Brief Power Outage for Amazon Data Center

7 comments

Amazon Web Services experienced an outage in one of the East Coast availability zones for its EC2 service early Wednesday due to power problems in a data center in northern Virginia. Failures in a power distribution unit (PDU) resulted in some servers in the data center losing power for about 45 minutes. It took several more hours to get customer instances back online, with all but a “small number” of instances restored within five hours.

“This incident impacted a subset of instances in a single Availability Zone,” said Amazon spokesperson kay Kinton. “Most of that subset of instances were back online in 45 minutes.”

The issues started at 4 am East Coast time Wednesday, and affected one of the three availability zones in Amazon’s East Coast operation. The zones are designed to provide redundancy for developers by allowing them to deploy apps across several zones.

“A single component of the redundant power distribution system failed in this zone,” AWS said in its status report. “Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power. Impacted customers experienced a loss of connectivity to their instances. As soon as the defective power distribution units were bypassed, servers restarted and instances began to come online shortly thereafter.”

Amazon is known to operate a major data center in Ashburn, Virginia. EC2 previously experienced brief downtime in both June and July.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

7 Comments

  1. Ernie

    All too often a data center becomes inadequate immediately after it is build. Data Center engineers get complacent with their corner of the world as long as everything is operating correctly. I’m sure an autopsy will be performed in a conference room somewhere discussing what could have or should have been. Terms such as “We could not have known the second path would fail” or it wasn’t our fault that the second PDU went down. I have to point my finger at Amazon’s maintenance program. I will bet the preverbal dime to a donut that all their UPS, PDU, air conditioning and generator maintenance is up to date. I will then double down my bet by saying I bet they haven’t completed an electrical assessment of their facility since it was built. With the life cycle of a data center that begins to reach it's half-life six months after it was built, staying on top of the electrical infrastructure is paramount. To many times I have seen where “My UPS is running fine, I don’t need to worry about anything else” attitude. Once you begin to move around servers, you have changed the physical characteristics of the data center. NFPA 70E says you have to perform a circuit breaker coordination study every 5 years or after any changes. Things change daily in a data center and those changes need to be tracked. Data Center software such as aperture does this. If you are not willing to invest the money for a program such as that then you better invest money in a full time staff to do nothing but monitor changes to your electrical infrastructure. I have preached the gospel of “Just because it is new does not mean it works” my entire life. If there is a data center that has not completed an electrical assessment then that data center is a ticking time bomb. Time to lose the “UPS is running fine mentality” and pay attention to the electrical infrastructure of each and every data center. Cloud computing will not become a sky scraper as long as the foundation is built in sand.

  2. Jeff

    Apparently the future isn't here just yet. Isn't the point of "cloud computing" to allow redundancy at the server level so that any component failure can be corrected seamlessly (or at least in near-realtime) because some other part of the cloud is still operating? It seems to me that if EC2 isn't structured like this, then it is nothing more than managed hosting, the same old thing that's been popular for over a decade. Wake me up when "the cloud" is an ether of compute resources that can seamlessly tolerate a loss of any number of nodes up to the capacity threshold. Until then, "what's old is new again" is in full effect at EC2.

  3. The underlying question is did the "redundant PDU" actually fail because there was an unknown defect, or was it simply a case of a classic "Cascade Failure". This is when the total power, which is usually split 50%-50% across the 2 paths, exceeds the max rating of the power path i.e. 120 KVA split across 2 x 100 KVA paths. (each side sees only a 60% load [60 kva] so it "seems" ok) When a path fails (for any reason) the entire load shifts to the remaining path and it overloads and drops off line. The hence classic Cascade Failure. See http://www.eweek.com/c/a/IT-Infrastructure/How-to-Avoid-a-Redundant-Path-to-Power-Failure/

  4. hi In the above case how should we operate the parallel redundant system so that cascade failures do not occur in the case of failure of one of the paths Thanks