Car Crash Triggers Amazon Power Outage

22 comments

Amazon’s EC2 cloud computing service suffered its fourth power outage in a week on Tuesday, with some customers in its US East Region losing service for about an hour. The incident was triggered when a vehicle crashed into a utility pole near one of the company’s data centers, and a transfer switch failed to properly manage the shift from utility power to the facility’s generators.

Amazon Web Services said a “small number of instances” on EC2 lost service at 12:05 p.m. Pacific time Tuesday, with most of the interrupted apps recovering by 1:08 p.m. The incident affected a different Availability Zone than the ones that experienced three power outages last week.

The sequence of events was reminiscent of a 2007 incident in which a truck struck a utility pole near a Rackspace data center in Dallas, taking out a transformer. The outage triggered a thermal event when chillers struggled to restart during multiple utility power interruptions.

Crash Triggers Utility Outage
“Tuesday’s event was triggered when a vehicle crashed into a high voltage utility pole on a road near one of our datacenters, creating a large external electrical ground fault and cutting utility power to this datacenter,” Amazon said in an update on its Service Health Dashboard. “When the utility power failed, most of the facility seamlessly switched to redundant generator power.”

A ground fault occurs when electrical current flows into the earth, creating a potential hazard to people and equipment as it seeks a path to the ground.

“One of the switches used to initiate the cutover from utility to generator power misinterpreted the power signature to be from a ground fault that happened inside the building rather than outside, and immediately halted the cutover to protect both internal equipment and personnel,” the report continued. “This meant that the small set of instances associated with this switch didn’t immediately get back-up power. After validating there was no power failure inside our facility, we were able to manually engage the secondary power source for those instances and get them up and running quickly.”

Switch Default Setting Faulted
Amazon said the switch that failed arrived from the manufacturer with a different default configuration than the rest of the data centers’ switches, causing it to misinterpret this power event. “We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting,” Amazon reported.

Amazon Web Services said Sunday that it is making changes in its data centers to address a series of power outages last week. Amazon EC2 experienced two power outages on May 4 and an extended power loss early on Saturday, May 8. In each case, a group of users in a single availability zone lost service, while the majority of EC2 users remained unaffected.

About the Author

Rich Miller is the founder and editor-in-chief of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

22 Comments

  1. Ed

    I'd like to know who commissioned this facility...

  2. Was the transfer switch setting outside a desirable limit? Did Amazon startup the transfer switch equipment before cutting over to building load? Typically the install is commissioned before assuming building load to discover setting issues. Surprised Amazon would not have performed that check.

  3. Bob

    If there was a real UPS between the transfer switch and the load, then a 10 or 15 minute outage wouldn't cause too much trouble. (Ditto if the load devices were wired to independent A/B power sources with separate upstream transfer switches and independent utility feeders.) So what WAS between the transfer switch and the load? Nothing? One of those Hitec rotational-energy-storage units (with 15 seconds of run time)? This is just about money. The guys who build the data centers never want to spend any more than absolutely necessary. They sleaze by with the minimum possible configurations, just like the BP oil platform in the gulf, and the single-hulled tanker ships and collapsed-ring SONET systems. Its always about money. Bigwigs pocket the fat bonus checks while the country takes the express elevator down to the fiery hot place below.

  4. Walter White

    I wonder who manufactured the switch.

  5. ATS guy

    Do they use a UL 1008 listed contactor style transfer switch or a cheaper circuit breaker style?

  6. Data Center guy

    In response to Bob, while I am not sure what type of UPS Amazon uses, they often are not the problem. When utility power is lost, so is cooling because most sites do not put large three phase motors on UPS power and that is what takes down many sites. Also, most UPS's don't have more than a 7 minute run time. This is a technical forum, if you want to post about politics, i am sure you can find a bunch of like minded folks at the Washington Post.

  7. NDL

    Wow. I thought redundancy took things like that out of the equation. So much for redundant this and redundant that.

  8. Tushar

    Can one of you data center guys provide some guidance? How long does it usually take the standby generators to kick in fully (i.e. to power the CRAC) once a powe outtage occurs? In that time frame, how much heat gain/load is to be expected, especially in these centers with high density blade server racks? Would the increased heat load during this time frame result in damage to equipment? Have you seen this or more of a theoretical possibility? Thanks

  9. Knowing the critical nature of the switch, its settings should have been compared to the existing switches to mitigate this issue ever occurring.

  10. SuperDC

    Typically our generators kick in within 10 seconds of a power outage.

  11. Bob

    It just shows you that you can't throw stuff in the cloud and think you have given up responsibility. You still have to think about disasters, prepare for them and test for them. Do you know how robust your cloud provider is?

  12. HardwareFreak

    The problem here isn't the switch supposedly not working properly. That's a BS PR excuse to blame the vendor. The facility management team would have made sure all switches were properly set, unless they're incompetent. The real problem is above ground power which allows vehicles, storm systems, etc to damage AC infrastructure killing the feeds. Any modern data centers should be receiving all grid power from cables buried all the way from the substation to the facility. There is zero excuse for any data center in 2010 to be receiving power via overhead lines. Knowing Amazon, I'd bet this facility does have buried cable service. It's just too easy to blame the incident on a truck taking down a pole, so they did, knowing full well that the story is too small for any entity to do some investigative journalism. Their story (likely a falsehood) will stand until someone digs up traffic accident reports for that city on that day, and confirms whether a power pole was stricken.

  13. In my many years of building, commissionng and maintaining data centres, I must say, this is the sorriest excuse I have seen a data centre come up with when an outage has occurred. Even if the ATS has ground fault sensing.........and in the event this sensing "failed", the other sensing features of the ATS would have started and transfered the generator. Every modern ATS monitors both voltage and frequency of utility power and in the event they fall outside the set parameters, will start the generator and transfer the site. Will all that said, if Amazon sticks by this story, then they must admit that they didn't commission the ATS's and generator correctly as well as admit that they are not performing proper monthly maintainence of the back up system........... Personnally, I think they should come clean with the real fault!!!

  14. Roberto Cazador

    Can you please comment on why Amazon doesn't have the same power outage protection as Terremark? See recent article: http://www.datacenterknowledge.com/archives/2010/05/04/terremark-extinguishes-fire-stays-online/comment-page-1/#comment-16661 What's different? Thanks in advance.

  15. Ian

    Unless they were supporting a hospital surgical unit, or military action, there isn't much else that can't survive a 1 hour outage. The 'never down, ever" attitude only applies in life or death situations, which justifies the big bills. If someone's phone app wasn't working, or their social network was delayed, so what? Amazon has to credit the cost back to the customer, it's not a free ride for them when there's an interruption. So lesson learned, correct for it and move on.

  16. Gaz

    A lot of the comments here make me laugh. Even with levels of refundancy, there are always occasions where downtime may happen over the course of several years. Just goes to show, a lot of people like to think they know it all. If they worked in IT, they might understand the pressures and compexities of these systems.