Amazon Data Center Loses Power During Storm
June 30th, 2012 By: Rich Miller
An Amazon Web Services data center in northern Virginia lost power Friday night, causing extended downtime for services includng Netflix, Heroku, Pinterest , Instagram and many others. The incident occurred as a powerful electrical storm struck the Washington, D.C. area, leaving as many as 1.5 million residents without power.
The data center in Ashburn, Virginia that hosts the US-East-1 region lost power for about 30 minutes, but customers were affected for a longer period as Amazon worked to recover virtual machine instances. “We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area,” Amazon reported at 8:30 pm Pacific time. An update 20 minutes later said that “power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.”
By 1:42 AM Pacific time, Amazon reported that it had “recovered the majority of EC2 instances and are continuing to work to recover the remaining EBS (Elastic Block Storage) volumes.”
UPDATE: While most Amazon customers recovered within several hours, a number of prominent services were offline for much lnger. The photo-sharing service Instagram was unavailable until about Noon Pacific time Saturday, more than 15 hours after the incident began. Cloud infrastructure provider Heroku, which runs its platform atop AWS, reported 8 hours of downtime for some services.
Latest in Series of Outages
The outage marked the second time this month that the Amazon data center hosting the US-East-1 region lost power during a utility outage. Major data centers are equipped with large backup generators to maintain power during utlity outages, but the Amazon facility was apparently unable to make the switch to backup power.
Amazon experienced an outage June 15 in its US-East-1 region that was triggered by a series of failures in the power infrastructure, including the failure of a generator cooling fan while the facility was on emergency power. The same data center also experienced problems early Friday, when customers experienced connectivity problems.
Even Netflix Impacted
The latest outage was unusual in that that it affected Netflix, a marquee customer for Amazon Web Services that is known to spread its resources across multiple AWS availability zones, a strategy that allows cloud users to route around problems at a single data center. Netflix has remained online through past AWS outages affecting a single availability zone.
Adrian Cockroft, the Director of Architecture at Netflix, said the problem was a failure of Amazon’s Elastic Load Balancing service.”We only lost hardware in one zone, we replicate data over three,” Cockroft tweeted. “Problem was traffic routing was broken across all zones.”
The Washington area was hit by powerful storms late Friday that left two people dead and more than 1.5 million residents without power. Dominion Power’s outage map showed that sporadic outages continued to affect the Ashburn area. Although the storm was intense, there were no immediate reports of other data centers in the region losing power. Ashburn is one of the busiest data center hubs in the country, and home to key infrastructure for dozens of providers and hundreds of Internet services.
Here’s a look at the Twitter updates from some of the companies affected:
We’re sorry for the outage and working to get your Friday streaming back to normal as quickly as possible. Thank you for bearing with us.
— Netflix (@netflix) June 30, 2012
Pinterest is currently unavailable due to server outages. Our goal is to be back up by 10:30PM PST. Thanks for your patience!
— Pinterest (@Pinterest) June 30, 2012
It looks like AWS is having issues again,we have numerous EC2 instances that are no longer responding,we will let you know when we know more
— dotcloud status (@dotcloudstatus) June 30, 2012
We’re currently experiencing technical difficulties and we’re working to correct the issues. Thanks for your patience
— Instagram Support (@InstagramHelp) June 30, 2012
[...] suffered an outage in its Northern Virginia facilities on June 14. Data Center Knowledge posted this story on the [...]
M.T.FieldPosted June 30th, 2012
Typically, enterprise-class data centers are doubly protected by both UPS systems and standby generators. I’d love to know if Amazon skipped this step to cut costs / increase profit, or if a more complex chain of events occurred. If the latter, it would be nice to learn from this incident.
jeffPosted June 30th, 2012
Unfortunately all too many companies do not invest in the proper power infrastructure to save a buck. It costs clients a tremendous amount of money. Buying brand over proper engineering is not always the smartest idea either. Do your homework what is under the covers and not just what logo the company name has or their price in a race to the bottom.
Well, amazon isn’t famous for its price. It’s rather the higher end of the market, and this episode shows it isn’t justified.
[...] outage — the second this month – took down Netflix , Instagram, and Heroku, as Om previously reported. The storm was undoubtedly huge, leaving 2 [...]
DCDriverPosted July 1st, 2012
hey guys, its not about UPS’s in a data center. Typically, a UPS will hold a server or so up for about 30 minutes, sometimes longer, depending on the overkill factor. Most data center users, (also sometimes called colocation sites, or colo’s) don’t have UPS’. I’m in my data center right now, in Ashburn VA, with about 200 servers, and NO UPS’s . What the data center is supposed to provide for us is primary power from the grid, or if the grid fails, diesel power to replace it. I’m here because i had one server down. I have not heard about this data center losing power from the grid. I have only 2 cages out of a lot of cages, lots and lots of servers in here. The place is not flooded with tech’s.
Soenke RuemplerPosted July 1st, 2012
You are referring to US-East-1 region, not an availability zone. There are several AZ in one region. Please correct this in your article. Thx!
TomPosted July 1st, 2012
I wonder if Amazon is utilizing flywheels instead of battery UPSs at that facility? The flywheels are sold as “better for the environment” but only give you about 4 minutes for the genset to come online. That’s crazy to me – no time to address issues.
Jim LeachPosted July 1st, 2012
Fool me once, shame on you. Fool me twice, shame on me.
There are technologies, systems, and procedures to make sure the power is always on in a data center.
Brian jamesPosted July 1st, 2012
DCDriver, unlikely your colocation provider does not have a large UPS between your circuits and the generator, they are needed to smooth the cut over between grid and generator as generators generally take a min of about 1 to 2 minutes to be operational to take on full load in that time your systems would be long down. I have been co-locating servers since 1996, currently have 4 data center locations we co-locate servers in both cabinets and private cage space and in all cases our circuits are fed by multiple PDUs which feed off of separate UPS banks and finally generators on the other side of the UPSs. If you are co-locating somewhere with utility straight to generator I hope you have your own UPSs in your racks…
JonPosted July 2nd, 2012
Does Amazon own this data center, or do they lease it from someone else?
[...] my puzzled expression, because then he added: "I'm not talking about the one from a few weeks ago; I'm talking about yesterday." Well, excuse me. I guess that'll teach me to spend a couple of hours off the grid. The problem [...]
there should be generator and UPS to sustain the power breaksup. i think amazon ikeeps his servers in godown
[...] (EC2) service that can be seen by some as an enhancement to service or perhaps by others after last weeks outages, a fix or addressing a gap in their services. Note for those not aware, you can view current AWS [...]
KarionPosted July 10th, 2012
I think Amazon must build crank driven servers,
or best option is to make paddles like bicycles have so,
each server has an employee sitting on it with foots on paddles. If there is an electric failure, an alarm should be raised and employee start running paddles on server
PaulusPosted June 27th, 2013
My guess is that the thunder storm caused serious voltage sags on the grid supplying to Amazon data center. The voltage sags might have happened a few times within a short period, say, 60 minutes. When the voltage dropped outside the flywheel UPS’s input range, -10%, the flywheel UPS recognized this as an outage event and started to discharge. The voltage was probably back to normal range within less than 2 seconds and the flywheel UPS stopped discharging and changed into charging mode without triggering the backup genset to star. However, say, in 10 or 15 minutes, another less than 2 seconds sag happened. Flywheel UPS repeated the same protection procedure. But at the end of the discharging, even a partial one, the flywheel was overheated and had to cool down. Since the flywheel is in a closed vacuum space, the heat dissipitiation takes long time. Unfortunately, the third sag happened. This time the flywheel UPS was dead and the load was lost.