Last week's outage at Amazon Web Services was triggered by a series of failures in the power infrastructure in a northern Virginia data center, including the failure of a generator cooling fan while the facility was on emergency power. The downtime affected AWS customers Heroku, Pinterest, Quora and HootSuite, along with a host of smaller sites.
The incident began at 8:44 p.m. Pacific time on June 14, when the Amazon data center lost utility power. The facility switched to generator power, as designed. But nine minutes later, a defective cooling fan caused one of the backup generators to overheat and shut itself down.
"At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity)," Amazon wrote in its incident report at the AWS Service Health Dashboard.
Breaker Misconfiguration Compounds Issue
"Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power."
The generator fan was fixed and the generator was restarted at 10:19 pm Pacific time. As is often the case, once power was restored it took some time for customers to fully restore databases and applications. Amazon said a primary datastore for its Elastic Block Storage (EBS) lost power during the incident and "did not fail cleanly," resulting in some additional disruption.
One the event was resolved, Amazon conducted an audit of its back-up power distribution circuits. "We found one additional breaker that needed corrective action," AWS reported. "We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes."
The outage was the third significant downtime in the last 14 months for the US-East-1 availability zone, which is Amazon’s oldest availability zone and resides in a data center in Ashburn, Virginia. The US-East-1 zone had a major outage inApril 2011 and another less serious incident in March. Amazon’s U.S East region also was hit by a series of four outages in a single week in 2010.