What happens when the power goes out at a Google data center? We found out on Feb. 24, when a power outage at a Google facility caused more than two hours of downtime for Google App Engine, the company’s cloud computing platform for developers. Last week the company released a detailed incident report on the outage, which underscored the critical importance of good documentation, even in huge data center networks with failover capacity.
Most of Google’s recent high-profile outages have been caused by routing or network capacity problems, including outages in May and September of last year (see How Google Routes Around Outages for more). But not so with the Feb. 24 event.
“The underlying cause of the outage was a power failure in our primary datacenter,” Google reported. “While the Google App Engine infrastructure is designed to quickly recover from these sort of failures, this type of rare problem, combined with internal procedural issues extended the time required to restore the service.”
Power Down for 30 Minutes
Data center power outages typically fall into two categories: those in which the entire data center loses power for an extended period, and those in which power is restored relatively quickly but hardware within the data center has trouble restarting properly. The Google App Engine downtime appears to fall into the latter category. Power to the primary data center was restored within a half hour, but a key group of servers failed to restart properly. The somewhat unusual pattern of the recovery presented the first challenge.
“We failed to plan for the case of a power outage that might affect some, but not all, of our machines in a datacenter (in this case, about 25%),” Google reported. “In particular, this led to incorrect analysis of the serving state of the failed datacenter and when it might recover.”
This in turn complicated the decision-making process about whether and when to shift Google App Engine to a second “failover” facility. It’s also the point where Google’s documentation became part of the story.
“Recent work to migrate the datastore for better multihoming changed and improved the procedure for handling these failures significantly,” Google noted. “However, some documentation detailing the procedure to support the datastore during failover incorrectly referred to the old configuration. This led to confusion during the event.
“Although we had procedures ready for this sort of outage, the oncall staff was unfamiliar with them and had not trained sufficiently with the specific recovery procedure for this type of failure,” the incident report continued.
As a result, Google engineers had trouble deciding whether to commit to the primary or secondary data center at a key moment. At one point, they reversed a decision to focus on the failover data center, believing the primary facility might have fully recovered. It hadn’t, and the misstep resulted in a slightly longer outage for App Engine customers.
Lessons Learned, Steps Outlined
As in any good post-mortem, the Google team shared a series of steps it is taking to address the issues that arose in the Feb. 24 incident:
- Google will schedule additional drills by all oncall staff to review production procedures, including those for “rare and complicated procedures.” All members of the team will be required to complete the drills before joining the oncall rotation.
- The company will also implement a regular bi-monthly audit of operations docs, and ensure that all out-of-date docs are properly marked “Deprecated.”
- The company will “establish a clear policy framework to assist oncall staff to quickly and decisively make decisions about taking intrusive, user-facing actions during failures. This will allow them to act confidently and without delay in emergency situations.”
- Google said it will make a major infrastructural change in App Engine, which currently provides a “one-size-fits-all” Datastore. In
the wake of the Feb. 24 outage, Google says it will offer two different Datastore configurations: the current option of low-latency and lower availability during unexpected failures, and a new option for higher availability using synchronous replication
for reads and writes, “at the cost of significantly higher latency.”
Read the entire post-mortem report for additional details, including a full timeline of the incident.
The scope and detail of the report drew plaudits. Lenny Rachitsky of Transparent Uptime blog called it “nearly a perfect model for others.”
“A vast majority of the issues were training related,” Rachitsky wrote. “This is an important lesson: all of the technology and process in the world won’t help you if your on-call team is unaware of what to do. This is especially true during the stress of a large incident.”