Posted By Rich Miller On March 8, 2010 @ 8:00 am In Google,Power | 7 Comments
What happens when the power goes out at a Google data center? We found out on Feb. 24, when a power outage at a Google facility caused more than two hours of downtime for Google App Engine, the company’s cloud computing platform for developers. Last week the company released a detailed incident report on the outage, which underscored the critical importance of good documentation, even in huge data center networks with failover capacity.
Most of Google’s recent high-profile outages have been caused by routing or network capacity problems, including outages in May  and September  of last year (see How Google Routes Around Outages  for more). But not so with the Feb. 24 event.
“The underlying cause of the outage was a power failure in our primary datacenter,” Google reported. “While the Google App Engine infrastructure is designed to quickly recover from these sort of failures, this type of rare problem, combined with internal procedural issues extended the time required to restore the service.”
Power Down for 30 Minutes
Data center power outages typically fall into two categories: those in which the entire data center loses power for an extended period, and those in which power is restored relatively quickly but hardware within the data center has trouble restarting properly. The Google App Engine downtime appears to fall into the latter category. Power to the primary data center was restored within a half hour, but a key group of servers failed to restart properly. The somewhat unusual pattern of the recovery presented the first challenge.
“We failed to plan for the case of a power outage that might affect some, but not all, of our machines in a datacenter (in this case, about 25%),” Google reported. “In particular, this led to incorrect analysis of the serving state of the failed datacenter and when it might recover.”
This in turn complicated the decision-making process about whether and when to shift Google App Engine to a second “failover” facility. It’s also the point where Google’s documentation became part of the story.
“Recent work to migrate the datastore for better multihoming changed and improved the procedure for handling these failures significantly,” Google noted. “However, some documentation detailing the procedure to support the datastore during failover incorrectly referred to the old configuration. This led to confusion during the event.
“Although we had procedures ready for this sort of outage, the oncall staff was unfamiliar with them and had not trained sufficiently with the specific recovery procedure for this type of failure,” the incident report continued.
As a result, Google engineers had trouble deciding whether to commit to the primary or secondary data center at a key moment. At one point, they reversed a decision to focus on the failover data center, believing the primary facility might have fully recovered. It hadn’t, and the misstep resulted in a slightly longer outage for App Engine customers.
Lessons Learned, Steps Outlined
As in any good post-mortem, the Google team shared a series of steps it is taking to address the issues that arose in the Feb. 24 incident:
Read the entire post-mortem report  for additional details, including a full timeline of the incident.
The scope and detail of the report drew plaudits. Lenny Rachitsky of Transparent Uptime  blog called it “nearly a perfect model for others.”
“A vast majority of the issues were training related,” Rachitsky wrote. “This is an important lesson: all of the technology and process in the world won’t help you if your on-call team is unaware of what to do. This is especially true during the stress of a large incident.”
Article printed from Data Center Knowledge: http://www.datacenterknowledge.com
URL to article: http://www.datacenterknowledge.com/archives/2010/03/08/when-the-power-goes-out-at-google/
URLs in this post:
 incident report : https://groups.google.com/group/google-appengine/browse_thread/thread/a7640a2743922dcf?pli=1
 May: http://www.datacenterknowledge.com/archives/2009/05/14/outage-for-google-news/
 September: http://www.datacenterknowledge.com/archives/2009/09/01/router-ripples-cited-in-gmail-outage/
 How Google Routes Around Outages: http://www.datacenterknowledge.com/archives/2009/03/25/how-google-routes-around-outages/
 entire post-mortem report: http://www.datacenterknowledge.com https://groups.google.com/group/google-appengine/browse_thread/thread/a7640a2743922dcf?pli=1
 Transparent Uptime: http://www.transparentuptime.com/2010/03/google-app-engine-downtime-postmortem.html
 Rich Miller: http://www.datacenterknowledge.com/archives/author/richm/
Copyright © 2012 Data Center Knowledge. All rights reserved.