Last week’s
outage [1] at
Amazon Web Services was triggered by a series of failures in the power infrastructure in a northern Virginia data center, including the failure of a generator cooling fan while the facility was on emergency power. The downtime affected AWS customers
Heroku,
Pinterest,
Quora and
HootSuite, along with a host of smaller sites.
The incident began at 8:44 p.m. Pacific time on June 14, when the Amazon data center lost utility power. The facility switched to generator power, as designed. But nine minutes later, a defective cooling fan caused one of the backup generators to overheat and shut itself down.
“At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity),” Amazon wrote in its incident report at the AWS Service Health Dashboard.
Breaker Misconfiguration Compounds Issue
“Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power.”
The generator fan was fixed and the generator was restarted at 10:19 pm Pacific time. As is often the case, once power was restored it took some time for customers to fully restore databases and applications. Amazon said a primary datastore for its Elastic Block Storage (EBS) lost power during the incident and “did not fail cleanly,” resulting in some additional disruption.
One the event was resolved, Amazon conducted an audit of its back-up power distribution circuits. “We found one additional breaker that needed corrective action,” AWS reported. “We’ve now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.”
The outage was the third significant downtime in the last 14 months for the US-East-1 availability zone, which is Amazon’s oldest availability zone and resides in a data center in Ashburn, Virginia. The US-East-1 zone had a major outage in
April 2011 [2]and another less serious
incident in March [3]. Amazon’s U.S East region also was hit by a series of
four outages in a single week [4] in 2010.
Rich Miller is the founder and editor-in-chief of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.
Article printed from Data Center Knowledge: http://www.datacenterknowledge.com
URL to article: http://www.datacenterknowledge.com/archives/2012/06/21/aws-outage/
URLs in this post:
[1] outage: http://www.datacenterknowledge.com/archives/2012/06/15/power-outage-affects-amazon-customers/
[2] April 2011 : http://www.datacenterknowledge.com/archives/2011/04/21/major-amazon-outage-ripples-across-web/
[3] incident in March: http://www.datacenterknowledge.com/archives/2012/03/15/amazon-ec2-recovers-after-brief-downtime/
[4] four outages in a single week: http://www.datacenterknowledge.com/archives/2010/05/10/amazon-addresses-ec2-power-outages/
[5] Rich Miller: http://www.datacenterknowledge.com/archives/author/richm/
Click here to print.