Generator Fan Failure Triggered AWS Outage

8 comments

Last week’s outage at Amazon Web Services was triggered by a series of failures in the power infrastructure in a northern Virginia data center, including the failure of a generator cooling fan while the facility was on emergency power. The downtime affected AWS customers HerokuPinterestQuora and HootSuite, along with a host of smaller sites.

The incident began at 8:44 p.m. Pacific time on June 14, when the Amazon data center lost utility power. The facility switched to generator power, as designed. But nine minutes later, a defective cooling fan caused one of the backup generators to overheat and shut itself down.

“At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity),” Amazon wrote in its incident report at the AWS Service Health Dashboard.

Breaker Misconfiguration Compounds Issue

“Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power.”

The generator fan was fixed and the generator was restarted at 10:19 pm Pacific time. As is often the case, once power was restored it took some time for customers to fully restore databases and applications. Amazon said a primary datastore for its Elastic Block Storage (EBS) lost power during the incident and “did not fail cleanly,” resulting in some additional disruption.

One the event was resolved, Amazon conducted an audit of its  back-up power distribution circuits. “We found one additional breaker that needed corrective action,” AWS reported. “We’ve now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.”

The outage was the third significant downtime in the last 14 months for the US-East-1 availability zone, which is Amazon’s oldest availability zone and resides in a data center in Ashburn, Virginia. The US-East-1 zone had a major outage inApril 2011 and another less serious incident in March. Amazon’s U.S East region also was hit by a series of four outages in a single week in 2010.

 

About the Author

Rich Miller is the founder and editor-in-chief of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

8 Comments

  1. Lance Bradey

    So am I to believe that a fan belt broke on Genset? If so this is lack of maintenance? Just because you don’t use Gensets doesn’t mean to say that items such as fan belts do not wear but they do perish.

  2. Nik Simpson

    "But nine minutes later, a defective cooling fan caused one of the backup generators to overheat and shit itself down." So this is presumably the point at which the sh*t hit the fan ;-)

  3. ChiefSnipe

    Let me see, why didn't the data center just switch to the secondary power feed into the data center....oh wait I bet that the power plants that used to feed a second primary into this data center has been taken offline and now due to the power grid being overtaxed I bet they are on Gen. Power much more than they are thought they would ever be. ... Still can't quite figure out how a single fan could take it down......Understanding that equipment will just come apart for no reason some times...even if it is new.... However I still can't figure out why when this data center was set up that this breaker configuration was not caught........

  4. Alan Brown

    Trust is earned and lost everyday in this industry. People assume that Amazon employs best practices on everything that they do but they clearly do not. Three outages in 14 months would be a nightmare scenario for a colocation facility. I'm amazed that in the midst of all of this they were able to repair a defective fan in under 82 minutes. That would be a great response depending upon how it was defective. So Amazon, how was the cooling fan defective? Bad belt as Lance suggests or was something else the actual cause? “We found one additional breaker that needed corrective action,” AWS reported. “We’ve now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.” I am guessing someone from Amazon didn't check breaker positions post maintenance on key equipment. Anything mechanical can break so they get a pass there assuming they were maintaining the equipment. What might be more troublesome is that they weren't already regularly testing and auditing. Or that their EBS storage system didn't "fail cleanly" on power failover. Why not? Too much current inrush on a branch circuit? Or some other single point of failure with a device like a switch not plugged into A & B power or something else? To be fair to Amazon I shouldn't speculate but their statements raise questions. It could just be marketing fluff - but if I was a direct customer I'd be looking for a much more definitive explanation.

  5. Jim Leach

    Even the best cloud system in the world will fail if it runs out of gas or overheats. That's not a fancy virtualization or automation problem -- it's a data center blocking and tackling issue. That's why DreamHost just upgraded its data centers. They have great hosting and cloud products running in high availability data centers. http://www.datacenterknowledge.com/archives/2012/06/05/dreamhost-goes-east-expands-with-ragingwire/

  6. Sadly, if Amazon had the attitude of "What's the most we can do" vs. "What's the least we can get away with" a second power supply in their servers on a different bus would have avoided this issue. At Peak, this is the only way we deploy servers because we know that bean counters don't care when a small % of servers are offline, but the customers sure do. I wonder when big public companies will wake up to the realization that doing the least redundancy possible and making it up in volume sounds great on paper, but if you're using the service, it's a different expereince entirely.

  7. Baldrick

    My previous employer had a strict rule with all outages. Work out what went wrong, but nothing in writing. Then we would have a meeting to determine what management would be told, and it was never the truth. They once spent millions of dollars on an elaborate smoke screen to cover up for a simple process error. Their whole Post Incident Review system was nothing more than a sham. At no stage did they ever correct the real cause nor were they even interested. Now I have no knowledge of these outages and no reason doubt what’s reported here. I would just like to say that the only way to improve the reliability of a DC is to embrace failure. You must be open and honest about what went wrong and why, use it as a learning experience. This most often requires a cultural change in a company, and this is the hardest change of all.