Brief Power Outage for Amazon Data Center
Amazon Web Services experienced an outage in one of the East Coast availability zones for its EC2 service early Wednesday due to power problems in a data center in northern Virginia. Failures in a power distribution unit (PDU) resulted in some servers in the data center losing power for about 45 minutes. It took several more hours to get customer instances back online, with all but a “small number” of instances restored within five hours.
“This incident impacted a subset of instances in a single Availability Zone,” said Amazon spokesperson kay Kinton. “Most of that subset of instances were back online in 45 minutes.”
The issues started at 4 am East Coast time Wednesday, and affected one of the three availability zones in Amazon’s East Coast operation. The zones are designed to provide redundancy for developers by allowing them to deploy apps across several zones.
“A single component of the redundant power distribution system failed in this zone,” AWS said in its status report. “Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power. Impacted customers experienced a loss of connectivity to their instances. As soon as the defective power distribution units were bypassed, servers restarted and instances began to come online shortly thereafter.”
[...] } Today we got the story about Amazon’s outage of part of its EC2 cloud. In this post, I’ll examine what happened and what you can do to avoid the same [...]
ErniePosted December 11th, 2009
All too often a data center becomes inadequate immediately after it is build. Data Center engineers get complacent with their corner of the world as long as everything is operating correctly. I’m sure an autopsy will be performed in a conference room somewhere discussing what could have or should have been.
Terms such as “We could not have known the second path would fail” or it wasn’t our fault that the second PDU went down. I have to point my finger at Amazon’s maintenance program. I will bet the preverbal dime to a donut that all their UPS, PDU, air conditioning and generator maintenance is up to date. I will then double down my bet by saying I bet they haven’t completed an electrical assessment of their facility since it was built.
With the life cycle of a data center that begins to reach it’s half-life six months after it was built, staying on top of the electrical infrastructure is paramount. To many times I have seen where “My UPS is running fine, I don’t need to worry about anything else” attitude. Once you begin to move around servers, you have changed the physical characteristics of the data center. NFPA 70E says you have to perform a circuit breaker coordination study every 5 years or after any changes.
Things change daily in a data center and those changes need to be tracked. Data Center software such as aperture does this. If you are not willing to invest the money for a program such as that then you better invest money in a full time staff to do nothing but monitor changes to your electrical infrastructure. I have preached the gospel of “Just because it is new does not mean it works” my entire life.
If there is a data center that has not completed an electrical assessment then that data center is a ticking time bomb. Time to lose the “UPS is running fine mentality” and pay attention to the electrical infrastructure of each and every data center. Cloud computing will not become a sky scraper as long as the foundation is built in sand.
JeffPosted December 11th, 2009
Apparently the future isn’t here just yet.
Isn’t the point of “cloud computing” to allow redundancy at the server level so that any component failure can be corrected seamlessly (or at least in near-realtime) because some other part of the cloud is still operating? It seems to me that if EC2 isn’t structured like this, then it is nothing more than managed hosting, the same old thing that’s been popular for over a decade.
Wake me up when “the cloud” is an ether of compute resources that can seamlessly tolerate a loss of any number of nodes up to the capacity threshold. Until then, “what’s old is new again” is in full effect at EC2.
The underlying question is did the “redundant PDU” actually fail because there was an unknown defect, or was it simply a case of a classic “Cascade Failure”.
This is when the total power, which is usually split 50%-50% across the 2 paths, exceeds the max rating of the power path i.e. 120 KVA split across 2 x 100 KVA paths. (each side sees only a 60% load [60 kva] so it “seems” ok) When a path fails (for any reason) the entire load shifts to the remaining path and it overloads and drops off line.
The hence classic Cascade Failure.
[...] The risk with Active/Active is that load does not scale linearly. If you have two systems running at 40% load, that does not mean that one will be able the handle the load of both, and run at 80%. More likely you will run into an inflection point, where you will run into an unanticipated bottleneck – be it CPU, memory bandwidth, disk IO, or some system that is providing external API resources. It can even be the power system. If servers have redundant power supplies, and each PSU is attached to separate Power Distribution Units (PDUs), the critical load for each PDU is now 40% of the rating. If one circuit fails, all load switches to the other PDU – and if that PDU is now asked to carry more than 80% of its rating, overload circuits will trip, leading to a total outage. There is some speculation that a cascading failure of this type was behind the recent Amazon EC2 outage. [...]
Reported cloud outages for Amazon, Google, Microsoft and Salesforce.com in 2008 and 2009 « Muon CloudPosted January 31st, 2010
[...] datacenterknowledge [...]
In the above case how should we operate the parallel redundant system so that cascade failures do not occur in the case of failure of one of the paths