Amazon’s EC2 cloud computing service suffered its fourth power outage in a week on Tuesday, with some customers in its US East Region losing service for about an hour. The incident was triggered when a vehicle crashed into a utility pole near one of the company’s data centers, and a transfer switch failed to properly manage the shift from utility power to the facility’s generators.
Amazon Web Services said a “small number of instances” on EC2 lost service at 12:05 p.m. Pacific time Tuesday, with most of the interrupted apps recovering by 1:08 p.m. The incident affected a different Availability Zone than the ones that experienced three power outages last week.
The sequence of events was reminiscent of a 2007 incident in which a truck struck a utility pole near a Rackspace data center in Dallas, taking out a transformer. The outage triggered a thermal event when chillers struggled to restart during multiple utility power interruptions.
Crash Triggers Utility Outage
“Tuesday’s event was triggered when a vehicle crashed into a high voltage utility pole on a road near one of our datacenters, creating a large external electrical ground fault and cutting utility power to this datacenter,” Amazon said in an update on its Service Health Dashboard. “When the utility power failed, most of the facility seamlessly switched to redundant generator power.”
A ground fault occurs when electrical current flows into the earth, creating a potential hazard to people and equipment as it seeks a path to the ground.
“One of the switches used to initiate the cutover from utility to generator power misinterpreted the power signature to be from a ground fault that happened inside the building rather than outside, and immediately halted the cutover to protect both internal equipment and personnel,” the report continued. “This meant that the small set of instances associated with this switch didn’t immediately get back-up power. After validating there was no power failure inside our facility, we were able to manually engage the secondary power source for those instances and get them up and running quickly.”
Switch Default Setting Faulted
Amazon said the switch that failed arrived from the manufacturer with a different default configuration than the rest of the data centers’ switches, causing it to misinterpret this power event. “We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting,” Amazon reported.
Amazon Web Services said Sunday that it is making changes in its data centers to address a series of power outages last week. Amazon EC2 experienced two power outages on May 4 and an extended power loss early on Saturday, May 8. In each case, a group of users in a single availability zone lost service, while the majority of EC2 users remained unaffected.