Multiple Generator Failures Caused Amazon Outage
Amazon Web Services says that the repeated failure of multiple generators in a single data center caused last Friday night’s power outage, which led to downtime for Netflix, Instagram and many other popular web sites. The generators in this facility failed to operate properly during two utility outages over a short period Friday evening when the site lost utility power, depleting the emergency power in the uninterruptible power supply (UPS) systems.
Amazon said the data center outage affected a small percentage of its operations, but was exacerbated by problems with systems that allow customers to spread workloads across multiple data centers. The company apologized for the outage and outlined the steps it will take to address the problems and prevent a recurrence.
The generator failures in the June 29 incident came just two weeks after a June 14 outage that was caused by a series of problems with generators and electrical switching equipment.
Just 7 Percent of Instances Affected
The incident affected only one availability zone within its US-East-1 region, and that only 7 percent of instances were offline. It did not identify the location of the data center, but said it was one of 10 facilities serving the US-East-1 region.
When the UPS units ran out of power at 8:04 p.m., the data center was left without power. Shortly afterward, Amazon staffers were able to manually start the generators, and power was restored at 8:24 p.m. Although the servers lost power for only 20 minutes, recovery took much longer. “The vast majority of these instances came back online between 11:15pm PDT and just after midnight,” Amazon said in its incident report.
Amazon said a bug in its Elastic Load Balancing (ELB) system that prevented customers from quickly shifting workloads to other availability zones. This had the affect of magnifying the impact of the outage, as customers that normally use more than one availability zone to improve their reliability (such as Netflix) were unable to shift capacity.
Amazon: We Tested & Maintained The Generators
Amazon said the generators and electrical switching equipment that failed were all the same brand and all installed in late 2010 and early 2011, and had been tested regularly and rigorously maintained. “The generators and electrical equipment in this datacenter are less than two years old, maintained by manufacturer representatives to manufacturer standards, and tested weekly. In addition, these generators operated flawlessly, once brought online Friday night, for just over 30 hours until utility power was restored to this datacenter. The equipment will be repaired, recertified by the manufacturer, and retested at full load onsite or it will be replaced entirely.”
In the meantime, Amazon said it would adjust several settings in the process that switches the electrical load to the generators, making it easier to transfer power in the event the generators start slowly or experience uneven power quality as they come online. The company will also have additional staff available to start the generators manually if needed.
Amazon also addressed why the power outage was so widely felt, even though it apparently affected just 7 percent of virtual machine instances in the US-East-1 region.
Though the resources in this datacenter … represent a single-digit percentage of the total resources in the US East-1 Region, there was significant impact to many customers. The impact manifested in two forms. The first was the unavailability of instances and volumes running in the affected datacenter. This kind of impact was limited to the affected Availability Zone. Other Availability Zones in the US East-1 Region continued functioning normally. The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region. While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another.
Load Balancing Bug Limited Workload Shifts
The incident report provides extensive details on the outage’s impact on control planes for its EC2 compute service, Elastic Block Storage (EBS) services and Relational Database Service (RDS). Of particular interest is Amazon’s explanation of the issues affecting its Elastic Load Balancing (ELB) service. The ELB service is important because it is widely used to improve customer reliability, allowing them to shift capacity between different availability zones, an important strategy in preserving uptime when a single data center experiences problems. Here’s a key excerpt from Amazon’s incident report regarding the issues with ELB on the June 29 outage.
During the disruption this past Friday night, the control plane (which encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an ELB, and remove traffic from ELBs) began performing traffic shifts to account for the loss of load balancers in the affected Availability Zone. As the power and systems returned, a large number of ELBs came up in a state which triggered a bug we hadn’t seen before. The bug caused the ELB control plane to attempt to scale these ELBs to larger ELB instance sizes. This resulted in a sudden flood of requests which began to backlog the control plane. At the same time, customers began launching new EC2 instances to replace capacity lost in the impacted Availability Zone, requesting the instances be added to existing load balancers in the other zones. These requests further increased the ELB control plane backlog. Because the ELB control plane currently manages requests for the US East-1 Region through a shared queue, it fell increasingly behind in processing these requests; and pretty soon, these requests started taking a very long time to complete.
While direct impact was limited to those ELBs which had failed in the power-affected datacenter and hadn’t yet had their traffic shifted, the ELB service’s inability to quickly process new requests delayed recovery for many customers who were replacing lost EC2 capacity by launching new instances in other Availability Zones. For multi-Availability Zone ELBs, if a client attempted to connect to an ELB in a healthy Availability Zone, it succeeded. If a client attempted to connect to an ELB in the impacted Availability Zone and didn’t retry using one of the alternate IP addresses returned, it would fail to connect until the backlogged traffic shift occurred and it issued a new DNS query. As mentioned, many modern web browsers perform multiple attempts when given multiple IP addresses; but many clients, especially game consoles and other consumer electronics, only use one IP address returned from the DNS query.
[...] fire that knocked out the utility and enabled the generators to start. Now recently in the news Amazon in Virginia had almost the same problem. So now all the latest headlines are doubting cloud computing, and that’s just silly. Kelly [...]
M.T.FieldPosted July 3rd, 2012
Excellent information – thanks for the follow-up.
T. EvansPosted July 3rd, 2012
“In the meantime, Amazon said it would adjust several settings in the process that switches the electrical load to the generators, making it easier to transfer power in the event the generators start slowly or experience uneven power quality as they come online. The company will also have additional staff available to start the generators manually if needed.”
Not jumping to conclusions and putting the eventual post mortem aside, this preliminary explanation again appears to reaffirm the fact that just building a world class DC is not enough. Operations have to be equally, if not more fine-tuned than the facilities infrastructure itself. No matter how well engineered, systems fail, and manual intervention will always have a role to play. We can presume AWS saw the storm coming (what major DC doesn’t have the weather channel constantly streaming in the NOC?). If so, why wasn’t’ there a preemptive move on to site provided power (hindsight’s a bear)? Leaving foresight alone, not being able to manually turn on back-up generators should realistically NEVER take down a site of AWS’s criticality (unless it’s on flywheels that is – 10 minutes of battery should be enough time to diagnose an ATS/STS auto-start failure and remediate through manual intervention; even with paralleling gensets) – (I realize the presumptions I’m making when saying this, but it appears to be the case and has been cause for many outages in the past.). We often look at DC operations as a 9-5 business when it’s clearly is not. Yes, you may have a facility engineer onsite during the day to quickly respond to a situation like AWS just experienced, but is your glorified security guard up to the same task afterhours? Probably not – and thinking that you will have the time or communication skills to solve these problems over the phone is just kidding yourself. A good article on DCK said that you have to “embrace failure” in this business to learn and build ever-better systems. As AWS is learning now, and as many other DC providers have learned in the past, only applying those lessons to how the DC is built and configured is a disservice to your company and your customers. DC operations require just as many (if not more) redundant/fail-safe features than the electrical/mechanical systems that they run. A formula 1 race car may kick ass in itself, but it won’t win many races without Mario Andretti holding the wheel.
Jim LeachPosted July 4th, 2012
One point that can be drawn from the incident report is that not all “testing” is the same.
Amazon said the generators were “tested regularly and rigorously maintained.” But what were the testing protocols and procedures?
The litmus test for high availability data centers is the ability to do maintenance and repairs at the same time on the electrical and mechanical infrastructure elements.
It just goes to show even big hitters like Amazon suffer the occasional period of downtime, glad to see they are back online.
TeeRoyPosted July 9th, 2012
the storm (or incident) that caused the outage might not have been bad or even happened at the site. i’ve seen switchgear fooled by one leg dead on 3 phase on multiple occasions. due to a lightning strike i’ve seen a fuse at the commercial feed burnt enough to fail under load but carry voltage to cause switch gear to assume commercial power was ok. and don’t even think about throwing building ground fault into this mix. good luck (even with the andretti of DC’s) troubleshooting these scenarios in under 10 minutes. or perhaps one could write policy and procedures that cover almost any scenario known to man and then you could read and remember those in 10 minutes at a time of need.
the problem wasn’t that the generators failed so much as the simultaneous software side “failover” had bugs. perhaps if testing of the knowns (software) had been done, amazon would have been better prepared for the unknowns (nature)
Sounds like the generators were in manual. Interesting that although they had 20 minutes of runtime on the UPS systems they couldn’t get it fixed in time or they hadn’t arrived in work by then. This is a case for testign the systems by pulling the main breaker and making sure things do what you want them to do. Sounds like they were gun shy on the testing!!