Posted By Rich Miller On July 6, 2009 @ 9:55 am In Downtime | 24 Comments
Last week was a brutal one for the data center industry, with high-profile outages for several companies with strong reputations for uptime, and a fire at a data center complex that raised tough questions about redundancy and responsiveness of a number of high-profile sites. There may not be a single root cause to analyze, but there are plenty of issues and “lessons learned” to emerge from last week’s problems. Let’s start with a look at the week’s events:
Now about those tough questions and (hopefully) lessons learned:
Crisis Communications Moves Fast:
We’ve made this point many  times  before . But good communication with stakeholders is absolutely essential during an outage, and in the age of Twitter that means good and timely communications. Rackspace hosts a ton of high-profile sites, meaning its outage was on TechCrunch within minutes. Fortunately, the company was already busy on Twitter, acknowledged the issue quickly and then slowly began ramping up a series of updates via Twitter and the company blog.
The Fisher Plaza fire was another matter. “Fisher Plaza lacked any official communication with the first responders at the scene,” writes Jeremy Irish  of GeoCaching in a post-mortem on the event. “Many clients of the building were in the dark, both figuratively and literally, while we were waiting outside for news of what really happened. Instead we had to join in on Twitter to figure out what happened. … If someone walked out of the building with some authority and told us what they knew – we could have passed that information on to our customers.”
What the Heck Happened at Bing Travel?
Two of Microsoft’s next-generation data centers (in Quincy, Wash. and San Antonio) have been online for many months. So why was Bing Travel being hosted at Fisher Plaza, an older third-party facility with a recent history of power outages? The site wound up being offline for about a day and a half. “Bing Travel is a complex system of servers, databases and networking hardware that runs at massive scale,” Microsoft spokeswoman Whitney Burk told TechFlash . “It takes a bit of time after an interruption of power such as this one to bring it back online. Given power was restored at 2am today, we feel we had the service back up as quickly as was possible.”
Except that Microsoft has spent about $2 billion building new data centers that are optimized to support complex systems of servers and databases on a massive scale. If Bing is the new hotness – and Microsoft’s marketing budget suggests that it is – why leave the new brand exposed to headline risk from older facilities?
Failover to Backup: Not Always Simple:
Seattle real estate site Redfin emerged from the Fisher Plaza incident as the poster child for thinking ahead. “We were pretty embarrassed last June when Adhost had a similar electrical fire and took our site down for 8 hours,” Redfin CTO Michael Young told TechFlash . “So by October 2008, we basically instituted a disaster avoidance plan where we had redundant-everything for our mission-critical databases, servers and networks in separate buildings. When the problem happened last night, our beepers went off, we saw what looked like a major outage in one building, and were able to switch to the redundant systems.” Redfin was back online by 4 a.m., about 5 hours after the fire.
Not so for payment gateway Authorize.net, arguably a far more critical service in light of its impact on e-commerce. Authorize.net said that it had a backup facility,  but it was not able to switch over in a timely fashion due to a “perfect storm”of challenges.
Murphy is Alive and Well
Modern data centers are designed to architect around single points of failure. But how many points of failure can you anticipate and plan for? An unusual series of equipment failures  contributed to the Rackspace outage, with the most significant being a bank of generators that malfunctioned. Other failures involved equipment that connects the data center to its two utility feeds, a UPS system and a transfer switch.
An Opportunity for Amazon?
Portent Interactive is a provider hosted at AdHost which says it can’t afford to back up all of its customer sites in a second data center. In the wake of last week’s outage, Portent will instead offer customers the option to host backup copies of static files on Amazon S3, the storage component of Amazon Web Services. “That way, moving to a new location in an emergency might take time but is still possible, even if all connectivity to Adhost is lost,” Portent wrote on its blog .
Article printed from Data Center Knowledge: http://www.datacenterknowledge.com
URL to article: http://www.datacenterknowledge.com/archives/2009/07/06/the-day-after-a-brutal-week-for-uptime/
URLs in this post:
 power outage at its Dallas data center: http://www.datacenterknowledge.com/archives/2009/06/29/outage-for-rackspace-customers/
 power failures: http://www.datacenterknowledge.com/archives/2009/07/02/equinix-hit-by-outages-in-sydney-paris/
 performance problems on Thursday: http://www.datacenterknowledge.com/archives/2009/07/02/google-app-engine-hit-by-outage/
 fire at Fisher Plaza: http://www.datacenterknowledge.com/archives/2009/07/03/major-outage-at-seattle-data-center/
 fire at 151 Front Street: http://www.datacenterknowledge.com/archives/2009/07/06/fire-causes-outage-at-toronto-carrier-hotel/
 many: http://www.datacenterknowledge.com/archives/2009/03/17/data-center-outages-and-staff-scalability/
 times: http://www.datacenterknowledge.com/archives/2009/05/07/weathering-the-customer-service-tweetstorm/
 before: http://www.datacenterknowledge.com/archives/2008/10/03/typepad-tweets-its-downtime/
 Jeremy Irish: http://locuslingua.blogspot.com/2009/07/colotastrophe-day-after.html
 TechFlash: http://www.techflash.com/venture/Bing_Travel_is_back_36_hours_later_49935637.html
 TechFlash: http://www.techflash.com/venture/How_one_CTO_avoided_a_Web_site_disaster_after_data_center_fire_49891132.html
 had a backup facility,: http://twitter.com/AuthorizeNet/status/2465678179
 series of equipment failures: http://www.datacenterknowledge.com/archives/2009/07/02/rackspace-expects-credits-of-25-million/
 wrote on its blog: http://www.portentinteractive.com/blog/fisher-plaza-service-restored.htm
 Rich Miller: http://www.datacenterknowledge.com/archives/author/richm/
Copyright © 2012 Data Center Knowledge. All rights reserved.