Amazon Web Services says it is making changes in its data centers to address a series of power outages last week that affected some users of its EC2 cloud computing service. Amazon EC2 experienced two power outages on May 4 and an extended power loss early on Saturday, May 8. In each case, a group of users in a single availability zone lost service, while the majority of EC2 users remained unaffected.
“As a result of the recent power events, we are working on non-intrusive changes to our power distribution architecture to significantly reduce the number of instances that can be affected by failures like we have seen in the last week,” Amazon said Sunday in an incident report on its status dashboard. “This work is under way and will be carefully performed over the coming months throughout our datacenters.”
Saturday’s outage began at about 12:20 a.m. and lasted until 7:20 a.m., and affected a “set of racks,” according to Amazon, which said the bulk of customers in its U.S. East availability zone remained unaffected.
“The loss of power was caused by an electrical ground fault and short circuit in a major power distribution panel that interrupted power to some instances in this particular Availability Zone,” Amazon reported. “Before restoring power to the impacted instances, facility engineering had to find and correct the ground fault and test all switch gear and related power distribution equipment.”
Data Loss for Small Number of Users
Amazon said the power outages caused some data loss for a “very small number” of users of its Elastic Block Storage (EBS) service, which has prompted a discussion of reliability issues among EBS users. One user posted an email notifying him that “multiple failures of the underlying hardware components” had caused the data loss.
“Though yesterday’s issues caused impact to a very small number of volumes, even those customers were able to recover quickly if they were using the Amazon EBS feature that creates point-in-time snapshots that are persisted to Amazon S3,” Amazon said.
The company said Saturday’s incident was unrelated to a pair of outages last Tuesday, which occurred in a different zone, and involved a different type of failure.
The first outage on May 4 occurred as data center technicians were shifting power to a new substation from the local power utility. “During a power cut-over at 2:22 a.m. PDT, a single UPS failed to appropriately transfer power to the back-up generators,” Amazon reported.
The UPS unit failed to detect a drop in input power and shift the load to batteries, and subsequently the generator. Amazon was able to bypass the problematic UPS and provide generator power directly to the affected racks.
Amazon said more than half of customer instances were recovered by 3:40 a.m Pacific, and virtually all of them were recovered by 6:35 a,m.
Generator Goes Offline
Later that day, while the affected racks were still being powered by the back-up generator, a human error caused the back-up generator to lose power, Amazon said. The generator was reset and power was again restored to the racks, with most customers regaining service between 6:45 p.m. and 7:40 p.m. Pacific time. The next day Amazon completed installation of a replacement UPS unit.
In addition to the changes in its power distribution system, Amazon reminded EC2 users of the ability to deploy instances across multiple availability zones to guard against failures like last week’s, which affected a single zone. EC2 customers can also use Amazon CloudWatch and Auto Scaling to quickly recover from instance failures in an availability zone, the company said.