Amazon Addresses EC2 Power Outages

7 comments

Amazon Web Services says it is making changes in its data centers to address a series of power outages last week that affected some users of its EC2 cloud computing service. Amazon EC2 experienced two power outages on May 4 and an extended power loss early on Saturday, May 8. In each case, a group of users in a single availability zone lost service, while the majority of EC2 users remained unaffected. 

“As a result of the recent power events, we are working on non-intrusive changes to our power distribution architecture to significantly reduce the number of instances that can be affected by failures like we have seen in the last week,” Amazon said Sunday in an incident report on its status dashboard. “This work is under way and will be carefully performed over the coming months throughout our datacenters.”

Seven-Hour Outage
Saturday’s outage began at about 12:20 a.m. and lasted until 7:20 a.m., and affected a “set of racks,” according to Amazon, which said the bulk of customers in its U.S. East availability zone remained unaffected. 

“The loss of power was caused by an electrical ground fault and short circuit in a major power distribution panel that interrupted power to some instances in this particular Availability Zone,” Amazon reported. “Before restoring power to the impacted instances, facility engineering had to find and correct the ground fault and test all switch gear and related power distribution equipment.”

Data Loss for Small Number of Users
Amazon said the power outages caused some data loss for a “very small number” of users of its Elastic Block Storage (EBS) service, which has prompted a discussion of reliability issues among EBS users. One user posted an email notifying him that “multiple failures of the underlying hardware components” had caused the data loss.

“Though yesterday’s issues caused impact to a very small number of volumes, even those customers were able to recover quickly if they were using the Amazon EBS feature that creates point-in-time snapshots that are persisted to Amazon S3,” Amazon said.

The company said Saturday’s incident was unrelated to a pair of outages last Tuesday, which occurred in a different zone, and involved a different type of failure.

The first outage on May 4 occurred as data center technicians were shifting power to a new substation from the local power utility. “During a power cut-over at 2:22 a.m. PDT, a single UPS failed to appropriately transfer power to the back-up generators,” Amazon reported.

The UPS unit failed to detect a drop in input power and shift the load to batteries, and subsequently the generator. Amazon was able to bypass the problematic UPS and provide generator power directly to the affected racks.

Amazon said more than half of customer instances were recovered by 3:40 a.m Pacific, and virtually all of them were recovered by 6:35 a,m. 

Generator Goes Offline
Later that day, while the affected racks were still being powered by the back-up generator, a human error caused the back-up generator to lose power, Amazon said. The generator was reset and power was again restored to the racks, with most customers regaining service between 6:45 p.m. and 7:40 p.m. Pacific time. The next day Amazon completed installation of a replacement UPS unit.

In addition to the changes in its power distribution system, Amazon reminded EC2 users of the ability to deploy instances across multiple availability zones to guard against failures like last week’s, which affected a single zone. EC2 customers can also use Amazon CloudWatch and Auto Scaling to quickly recover from instance failures in an availability zone, the company said.

About the Author

Rich Miller is the founder and editor-in-chief of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

7 Comments

  1. David Meyer

    This is just another in a series of significant outages with "Cloud" providers (I say "cloud" because not all clouds are true clouds.) Terremark's vCloud outage on March 17th, and multiple outages at Rackspace should give everyone pause about Cloud. Not cloud compuitng as a whole (because that is clearly the future of computing), but rather a focus on what the cloud infrastructure actually looks like, and what redundancies are in place. I have seem some "cloud" infrastructures that are, quite frankly, disturbing, Cloud infrastructures are about more than the servers and the software stack. Having been in this business for some time, I would encourage anyone looking at the cloud to ask, and GET detailed information about the 0physical infrastructure. If they don't give you that information, or if they are not willing to SHOW you the infrastructure, then look elsewhere. This is your business and your data. Trust...but verify.