The Day After: A Brutal Week for Uptime

24 comments

Last week was a brutal one for the data center industry, with high-profile outages for several companies with strong reputations for uptime, and a fire at a data center complex that raised tough questions about redundancy and responsiveness of a number of high-profile sites. There may not be a single root cause to analyze, but there are plenty of issues and “lessons learned” to emerge from last week’s problems. Let’s start with a look at the week’s events:

  • On Monday June 29, Rackspace Hosting (RAX) experienced a power outage at its Dallas data center that left several areas of the facility without power for about 45 minutes, knocking many popular customer web sites offline.
  • Early Thursday Equinix Inc. (EQIX) data centers in Sydney, Australia and Paris each experienced power failures. While the power outages were brief – Equinix said the Sydney event lasted 12 minutes while power was restored in Paris in just one minute – many key customer sites took considerably longer to recover their systems. The Sydney event led to disruptions for VoIP service in parts of Australia, while the Paris outage caused downtime for the popular video site DailyMotion and the French portal for hosting firm ClaraNet.
  • Google App Engine, the company’s cloud computing platform, had lengthy performance problems on Thursday, experiencing high latency and data loss.
  • A fire at Fisher Plaza in Seattle late Thursday night left many of the building’s data centers without power. The fire in an basement-level electrical room triggered sprinklers and caused extensive damage to generators and electrical equipment. The damage left tenants with backup plans offline for hours, and those without backup sites down until temporary generators restored power early Saturday morning. The biggest impact was at payment gateway Authorize.net, which was offline for more than 12 hours, leaving its merchant customers unable to process credit card sales. Other sites experiencing lengthy downtime included AdHost, GeoCaching and Microsoft’s Bing Travel.
  • Early Sunday, July 5, a fire at 151 Front Street, the major carrier hotel in Toronto, knocked out power on several floors of the facility used by Peer 1 networks. Power was restored in about 3 hours, after a damaged UPS unit was bypassed.

Now about those tough questions and (hopefully) lessons learned:

Crisis Communications Moves Fast:
We’ve made this point many times before. But good communication with stakeholders is absolutely essential during an outage, and in the age of Twitter that means good and timely communications. Rackspace hosts a ton of high-profile sites, meaning its outage was on TechCrunch within minutes. Fortunately, the company was already busy on Twitter, acknowledged the issue quickly and then slowly began ramping up a series of updates via Twitter and the company blog.

The Fisher Plaza fire was another matter. “Fisher Plaza lacked any official communication with the first responders at the scene,” writes Jeremy Irish of GeoCaching in a post-mortem on the event. “Many clients of the building were in the dark, both figuratively and literally, while we were waiting outside for news of what really happened. Instead we had to join in on Twitter to figure out what happened. … If someone walked out of the building with some authority and told us what they knew – we could have passed that information on to our customers.”

What the Heck Happened at Bing Travel?
Two of Microsoft’s next-generation data centers (in Quincy, Wash. and San Antonio) have been online for many months. So why was Bing Travel being hosted at Fisher Plaza, an older third-party facility with a recent history of power outages? The site wound up being offline for about a day and a half. “Bing Travel is a complex system of servers, databases and networking hardware that runs at massive scale,” Microsoft spokeswoman Whitney Burk told TechFlash. “It takes a bit of time after an interruption of power such as this one to bring it back online. Given power was restored at 2am today, we feel we had the service back up as quickly as was possible.”

Except that Microsoft has spent about $2 billion building new data centers that are optimized to support complex systems of servers and databases on a massive scale. If Bing is the new hotness – and Microsoft’s marketing budget suggests that it is – why leave the new brand exposed to headline risk from older facilities?

Failover to Backup: Not Always Simple:
Seattle real estate site Redfin emerged from the Fisher Plaza incident as the poster child for thinking ahead. “We were pretty embarrassed last June when Adhost had a similar electrical fire and took our site down for 8 hours,” Redfin CTO Michael Young told TechFlash. “So by October 2008, we basically instituted a disaster avoidance plan where we had redundant-everything for our mission-critical databases, servers and networks in separate buildings. When the problem happened last night, our beepers went off, we saw what looked like a major outage in one building, and were able to switch to the redundant systems.” Redfin was back online by 4 a.m., about 5 hours after the fire.

Not so for payment gateway Authorize.net, arguably a far more critical service in light of its impact on e-commerce. Authorize.net said that it had a backup facility, but it was not able to switch over in a timely fashion due to a “perfect storm”of challenges.

Murphy is Alive and Well
Modern data centers are designed to architect around single points of failure. But how many points of failure can you anticipate and plan for? An unusual series of equipment failures contributed to the Rackspace outage, with the most significant being a bank of generators that malfunctioned. Other failures involved equipment that connects the data center to its two utility feeds, a UPS system and a transfer switch.

An Opportunity for Amazon?
Portent Interactive is a provider hosted at AdHost which says it can’t afford to back up all of its customer sites in a second data center. In the wake of last week’s outage, Portent will instead offer customers the option to host backup copies of static files on Amazon S3, the storage component of Amazon Web Services. “That way, moving to a new location in an emergency might take time but is still possible, even if all connectivity to Adhost is lost,” Portent wrote on its blog.

About the Author

Rich Miller is the founder and editor-in-chief of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

24 Comments

  1. Han

    A fire also took out parts of Peer1 over the weekend: http://forums.peer1.com/viewtopic.php?f=37&t=117

  2. Nate

    I just wanted to point out that you have Geocaching's CEO named incorrectly in this article. His name is "Jeremy Irish" not "Jeremy Fisher".

  3. Zandr

    Let me get this straight, 5 hour downtime is considered a success? I think I'd be in serious conversation about my employment if I had built out fully redundant datacenters and couldn't fail over in less than an hour... and honestly "no outage detected" would be a design goal.

  4. One correction first...Geocaching is run by Jeremy IRISH. Not Jeremy Fisher. That unfortunate mistake might connect him to owning Fisher Plaza in some people's eyes when the connection is merely that Fisher Plaza houses the servers for Geocaching.com. My question, and a question that I've heard a few others ask, is why there were sprinklers in the same room as electrical generators? DUH...water and electricity don't mix. Why didn't they use a Class C fire extinguisher system that uses dry, non-conductive chemicals? For example, CO2 or Argon would be much safer and have been commonly used for years to fight electrical fires without damaging the electrical equipment.

  5. Thanks for the correction on Jeremy Irish's last name, which was been updated. My apologies to Jeremy.

  6. Some of the affected sites had critical DNS records with a TTL of 24 hours, preventing rapid failover to even a "status page". More details documented on the Authorize.Net failure here: http://dynamicnetworkservices.com/journal/AnalysisAndLessonsLearnedFromTheAuthorizeNetDatacenterOutage

  7. Lennie

    I'm sorry, but I've noticed Amazon S3 was slower and I've seen quiet a few 404's this past few weeks. But never bet on one horse, is maybe what you should be doing. it might however be very complicated architectural wise.

  8. Thanks for making the correction. It would be my understanding that in the case of an electrical fire the sprinklers would go off after the power was cut. Water is still considered the worst case suppression tool for fire as far as I know, which isn't much. In any case, the sprinkler system that went off was in the parking garage and not at the collocation facility, so it wasn't like we were wading in water. In fact, the entire time I never even smelled a whiff of fire damage since it was contained in the depths of the garage.

  9. Scott

    Depending on local fire code and the size of the area, dual-action dry-pipe water sprinkler systems may be required by law. No matter what, datacenter facilities still have to comply with local and state regulations regarding fire suppression, even if there is a "better" alternative.

  10. Bob

    Scott is correct about compliance with local Fire Codes. They vary by location. One method of protection that satisfies most codes is a combination clean agent suppression system using any of the latest generation chemicals such as FM-200, ECARO, Saphire, Inergen, etc. backed up with a dual interlock pre-action sprinkler system. The clean agent suppresses the fire with no damage to the equipment and the sprinkler system protects the structure, if needed. I would stay away from CO2 in an potentially occupied space for safety reasons.

  11. Fire sprinklers are actually a low cost, effective, and reliable method for controlling fires in most rooms including electrical rooms. If a fire in an electrical room is large enough to activate a 155 or 200 degree sprinkler, then the room was shot anyway. It's better to put sprinkler water on it sooner than a fire hose later. However...for critical rooms, why not use sensitive detection with a clean agent suppression system (CO2, FM200, Novec 1230, Ecaro) and deal with the fire BEFORE it builds to catastrophic levels?

  12. Adam Steffes

    Customers should never have the power go out, save for a fire or flood. More to the point, how can anyone afford to operate an Internet service without a fully-redundant and instant-failover deployment from day 1?

  13. This seems on-topic: I've extensively tested three website monitoring services - Wormly, Site24x7 and Pingability - the results of which you can find here: http://opinionroad.com/2009/07/12/website-monitoring-services/ Feedback about the article is very welcome.