July saw a steady stream of data center outages due to equipment failures, several of which attracted media attention. The latest incident to make headlines was an outage July 30 at Seattle's Fisher Plaza, which is described by CRM Buyer today in a story titled Unsinkable Data Center Crashes in Seattle. The article digs into the cause of the downtime, which is in dispute, as Fisher Plaza cited an equipment problem after a Seattle City Light outage, but the power company says it was never offline.
The recent string of incidents provide a painful reminder that Murphy's Law has jurisdiction over even the most wired data centers. An AFCOM member survey from April predicted that within the next five years power failures and shortages will halt data center operations (at least briefly) at more than 90% of all companies.
The uptime industry is in the business of trying to anticipate everything that can go wrong, and engineering solutions for even the most improbable scenarios. Although SLAs promising 100% uptime are common nowadays, stuff happens. "Failure is inevitable. Fail small," said Richard Sawyer, Director of Data Center Technology for American Power Conversion, in discussing the AFCOM results. Outages are painful, but offer lessons as well. In that spirit, here's a recap of some of the recent incidents:
- On July 23 a Level 3 data center in London was offline for more than five hours. The downtime was noted by The Register, which said the generators failed to work and the batteries began giving out shortly afterward. Even before the outage, the facility had struggled with rising heat loads.
- The Garland Building (1200 W. 7th) in Los Angeles suffered building-wide power outages during grid blackouts on July 24 and again on July 28. The first incident made headlines as it caused extended downtime for MySpace, one of the Internet's busiest sites. The downtime was also publicly discussed by both Dreamhost and Media Temple, two of the hosting providers in the building.
- On July 30 came the outage at Fisher Plaza, which knocked some Seattle-area TV and radio stations offline. The outage had been discussed on the NANOG mailing list, where the culprit was reported to be a bad breaker in the main bypass from city power to the generators.
- But there were plenty of other situations where the equipment worked. A July 13th electrical disruption at Los Angeles telecom super-hub One Wilshire forced the evacuation of the building and left some tenants unable to access the building the following day. That's not a good thing. But ponder the implications and ripple effects if One Wilshire had gone completely offline. "The building's backup generators came online without delay and ensured the uninterrupted delivery of power to essential building services and to the building's community of communications service providers," landlord CRG West noted in an incident report to tenants.
Which brings us to a potential learning opportunity in crisis communications. Providers housed at the Garland Building and Fisher Plaza have complained publicly about the lack of information during the outages, which made it more difficult for them to relay useful information to their customers. Nobody likes to talk about downtime, and this was a theme in the CRM Buyer article.
"Attempts to maintain secrecy and shift responsibility for the causes and consequences of the Fisher Plaza outage have made the events of July 30 as much about corporate culture and professional integrity in the face of adversity than about the technical issues surrounding the outage," writes CRM Buyer's Anthony Mitchell. (It's worth noting that Mitchell is not an entirely detached reporter/commentator, as his company uses one of the providers at Fisher Plaza affected by the outage, a wrinkle that isn't explicitly stated in the story but can be ascertained via his bio blurb).
At the other end of the spectrum is DreamHost, which published a lengthy blog item providing customers with a blow-by-blow of its downtime at the Garland Building. The level of disclosure was praised by marketing expert Seth Godin on his blog item titled Important Lessons From DreamHost.
"Lesson one: when things get messed up, being clear, self-critical and apologetic is really the only way to deal with customers if you expect them to give you another chance," Seth writes. "Lesson two: your story is all you've got. If you sell the 'up-time' story, better over-invest in whatever it takes to be sure your story is true."