Amazon Addresses EC2 Power Outages
May 10th, 2010 By: Rich Miller
Amazon Web Services says it is making changes in its data centers to address a series of power outages last week that affected some users of its EC2 cloud computing service. Amazon EC2 experienced two power outages on May 4 and an extended power loss early on Saturday, May 8. In each case, a group of users in a single availability zone lost service, while the majority of EC2 users remained unaffected.
“As a result of the recent power events, we are working on non-intrusive changes to our power distribution architecture to significantly reduce the number of instances that can be affected by failures like we have seen in the last week,” Amazon said Sunday in an incident report on its status dashboard. “This work is under way and will be carefully performed over the coming months throughout our datacenters.”
Saturday’s outage began at about 12:20 a.m. and lasted until 7:20 a.m., and affected a “set of racks,” according to Amazon, which said the bulk of customers in its U.S. East availability zone remained unaffected.
“The loss of power was caused by an electrical ground fault and short circuit in a major power distribution panel that interrupted power to some instances in this particular Availability Zone,” Amazon reported. “Before restoring power to the impacted instances, facility engineering had to find and correct the ground fault and test all switch gear and related power distribution equipment.”
Data Loss for Small Number of Users
Amazon said the power outages caused some data loss for a “very small number” of users of its Elastic Block Storage (EBS) service, which has prompted a discussion of reliability issues among EBS users. One user posted an email notifying him that “multiple failures of the underlying hardware components” had caused the data loss.
“Though yesterday’s issues caused impact to a very small number of volumes, even those customers were able to recover quickly if they were using the Amazon EBS feature that creates point-in-time snapshots that are persisted to Amazon S3,” Amazon said.
The company said Saturday’s incident was unrelated to a pair of outages last Tuesday, which occurred in a different zone, and involved a different type of failure.
The first outage on May 4 occurred as data center technicians were shifting power to a new substation from the local power utility. “During a power cut-over at 2:22 a.m. PDT, a single UPS failed to appropriately transfer power to the back-up generators,” Amazon reported.
The UPS unit failed to detect a drop in input power and shift the load to batteries, and subsequently the generator. Amazon was able to bypass the problematic UPS and provide generator power directly to the affected racks.
Amazon said more than half of customer instances were recovered by 3:40 a.m Pacific, and virtually all of them were recovered by 6:35 a,m.
Generator Goes Offline
Later that day, while the affected racks were still being powered by the back-up generator, a human error caused the back-up generator to lose power, Amazon said. The generator was reset and power was again restored to the racks, with most customers regaining service between 6:45 p.m. and 7:40 p.m. Pacific time. The next day Amazon completed installation of a replacement UPS unit.
In addition to the changes in its power distribution system, Amazon reminded EC2 users of the ability to deploy instances across multiple availability zones to guard against failures like last week’s, which affected a single zone. EC2 customers can also use Amazon CloudWatch and Auto Scaling to quickly recover from instance failures in an availability zone, the company said.
David MeyerPosted May 10th, 2010
This is just another in a series of significant outages with “Cloud” providers (I say “cloud” because not all clouds are true clouds.) Terremark’s vCloud outage on March 17th, and multiple outages at Rackspace should give everyone pause about Cloud. Not cloud compuitng as a whole (because that is clearly the future of computing), but rather a focus on what the cloud infrastructure actually looks like, and what redundancies are in place.
I have seem some “cloud” infrastructures that are, quite frankly, disturbing, Cloud infrastructures are about more than the servers and the software stack. Having been in this business for some time, I would encourage anyone looking at the cloud to ask, and GET detailed information about the 0physical infrastructure. If they don’t give you that information, or if they are not willing to SHOW you the infrastructure, then look elsewhere. This is your business and your data. Trust…but verify.
[...] past weekend, Amazon EC2 experienced a power outage that brought down servers for about seven hours. Amazon has experienced a number of outages over the last few years–not surprising given [...]
[...] } I’ve just been reading about the latest failure of Amazon’s EC2 cloud computing service. This is screwed up on so many [...]
[...] Center Knowledge has a detailed breakdown of the recent incidents. Amazon reminds customers that they can spread applications across multiple [...]
[...] number” of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday’s incident is reminiscent of a [...]
[...] crashing into a utility pole was responsible for the most recent outage. Take a browse on over to Amazon Addresses EC2 Power Outages and then to Car Crash Triggers Amazon Power [...]
[...] outage began at about 12:20 a.m. and lasted until 7:20 a.m., and affected a “set of racks,” according to Amazon, which said the bulk of customers in its U.S. East availability zone remained [...]