The Day After: A Brutal Week for Uptime

Last week was a brutal one for the data center industry, with high-profile outages for several companies with strong reputations for uptime, and a fire at a data center complex that raised tough questions about redundancy and responsiveness of a number of high-profile sites. There may not be a single root cause to analyze, but there are plenty of issues and "lessons learned" to emerge from last week's problems. Let's start with a look at the week's events:

On Monday June 29, Rackspace Hosting (RAX) experienced a power outage at its Dallas data center that left several areas of the facility without power for about 45 minutes, knocking many popular customer web sites offline.
Early Thursday Equinix Inc. (EQIX) data centers in Sydney, Australia and Paris each experienced power failures. While the power outages were brief - Equinix said the Sydney event lasted 12 minutes while power was restored in Paris in just one minute - many key customer sites took considerably longer to recover their systems. The Sydney event led to disruptions for VoIP service in parts of Australia, while the Paris outage caused downtime for the popular video site DailyMotion and the French portal for hosting firm ClaraNet.
Google App Engine, the company's cloud computing platform, had lengthy performance problems on Thursday, experiencing high latency and data loss.
A fire at Fisher Plaza in Seattle late Thursday night left many of the building's data centers without power. The fire in an basement-level electrical room triggered sprinklers and caused extensive damage to generators and electrical equipment. The damage left tenants with backup plans offline for hours, and those without backup sites down until temporary generators restored power early Saturday morning. The biggest impact was at payment gateway Authorize.net, which was offline for more than 12 hours, leaving its merchant customers unable to process credit card sales. Other sites experiencing lengthy downtime included AdHost, GeoCaching and Microsoft's Bing Travel.
Early Sunday, July 5, a fire at 151 Front Street, the major carrier hotel in Toronto, knocked out power on several floors of the facility used by Peer 1 networks. Power was restored in about 3 hours, after a damaged UPS unit was bypassed.

Now about those tough questions and (hopefully) lessons learned:

Crisis Communications Moves Fast:
We've made this point many times before. But good communication with stakeholders is absolutely essential during an outage, and in the age of Twitter that means good and timely communications. Rackspace hosts a ton of high-profile sites, meaning its outage was on TechCrunch within minutes. Fortunately, the company was already busy on Twitter, acknowledged the issue quickly and then slowly began ramping up a series of updates via Twitter and the company blog.

The Fisher Plaza fire was another matter. "Fisher Plaza lacked any official communication with the first responders at the scene," writes Jeremy Irish of GeoCaching in a post-mortem on the event. "Many clients of the building were in the dark, both figuratively and literally, while we were waiting outside for news of what really happened. Instead we had to join in on Twitter to figure out what happened. ... If someone walked out of the building with some authority and told us what they knew - we could have passed that information on to our customers."

What the Heck Happened at Bing Travel?
Two of Microsoft's next-generation data centers (in Quincy, Wash. and San Antonio) have been online for many months. So why was Bing Travel being hosted at Fisher Plaza, an older third-party facility with a recent history of power outages? The site wound up being offline for about a day and a half. "Bing Travel is a complex system of servers, databases and networking hardware that runs at massive scale," Microsoft spokeswoman Whitney Burk told TechFlash. "It takes a bit of time after an interruption of power such as this one to bring it back online. Given power was restored at 2am today, we feel we had the service back up as quickly as was possible."

Except that Microsoft has spent about $2 billion building new data centers that are optimized to support complex systems of servers and databases on a massive scale. If Bing is the new hotness - and Microsoft's marketing budget suggests that it is - why leave the new brand exposed to headline risk from older facilities?

Failover to Backup: Not Always Simple:
Seattle real estate site Redfin emerged from the Fisher Plaza incident as the poster child for thinking ahead. "We were pretty embarrassed last June when Adhost had a similar electrical fire and took our site down for 8 hours," Redfin CTO Michael Young told TechFlash. "So by October 2008, we basically instituted a disaster avoidance plan where we had redundant-everything for our mission-critical databases, servers and networks in separate buildings. When the problem happened last night, our beepers went off, we saw what looked like a major outage in one building, and were able to switch to the redundant systems." Redfin was back online by 4 a.m., about 5 hours after the fire.

Not so for payment gateway Authorize.net, arguably a far more critical service in light of its impact on e-commerce. Authorize.net said that it had a backup facility, but it was not able to switch over in a timely fashion due to a "perfect storm"of challenges.

Murphy is Alive and Well
Modern data centers are designed to architect around single points of failure. But how many points of failure can you anticipate and plan for? An unusual series of equipment failures contributed to the Rackspace outage, with the most significant being a bank of generators that malfunctioned. Other failures involved equipment that connects the data center to its two utility feeds, a UPS system and a transfer switch.

An Opportunity for Amazon?
Portent Interactive is a provider hosted at AdHost which says it can't afford to back up all of its customer sites in a second data center. In the wake of last week's outage, Portent will instead offer customers the option to host backup copies of static files on Amazon S3, the storage component of Amazon Web Services. "That way, moving to a new location in an emergency might take time but is still possible, even if all connectivity to Adhost is lost," Portent wrote on its blog.

Comments

Plain text