Amazon Data Center Loses Power During Storm

16 comments

Amazon Web Services says an electrical storm caused a service outage Friday night at a data center in northern Virginia. (Lightning Photo via NOAA).

An Amazon Web Services data center in northern Virginia lost power Friday night, causing extended downtime for services includng Netflix, Heroku, Pinterest , Instagram and many others. The incident occurred as a powerful electrical storm struck the Washington, D.C. area, leaving as many as 1.5 million residents without power.

The data center in Ashburn, Virginia that hosts the US-East-1 region lost power for about 30 minutes, but customers were affected for a longer period as Amazon worked to recover virtual machine instances. “We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area,” Amazon reported at 8:30 pm Pacific time. An update 20 minutes later said that “power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.”

By 1:42 AM Pacific time, Amazon reported that it had “recovered the majority of EC2 instances and are continuing to work to recover the remaining EBS (Elastic Block Storage) volumes.”

UPDATE: While most Amazon customers recovered within several hours, a number of prominent services were offline for much lnger. The photo-sharing service Instagram was unavailable until about Noon Pacific time Saturday, more than 15 hours after the incident began. Cloud infrastructure provider Heroku, which runs its platform atop AWS, reported 8 hours of downtime for some services.

Latest in Series of Outages

The outage marked the second time this month that the Amazon data center hosting the US-East-1 region lost power during a utility outage. Major data centers are equipped with large backup generators to maintain power during utlity outages, but the Amazon facility was apparently unable to make the switch to backup power.

Amazon experienced an outage June 15 in its US-East-1 region that was triggered by a series of failures in the power infrastructure, including the failure of a generator cooling fan while the facility was on emergency power. The same data center also experienced problems early Friday, when customers experienced connectivity problems.

Even Netflix Impacted

The latest outage was unusual in that that it affected Netflix, a marquee customer for Amazon Web Services that is known to spread its resources across multiple AWS availability zones, a strategy that allows cloud users to route around problems at a single data center. Netflix has remained online through past AWS outages affecting a single availability zone.

Adrian Cockroft,  the Director of Architecture at Netflix, said the problem was a failure of Amazon’s Elastic Load Balancing service.”We only lost hardware in one zone, we replicate data over three,” Cockroft tweeted. “Problem was traffic routing was broken across all zones.”

The Washington area was hit by powerful storms late Friday that left two people dead and more than 1.5 million residents without power. Dominion Power’s outage map showed that sporadic outages continued to affect the Ashburn area. Although the storm was intense, there were no immediate reports of other data centers in the region losing power. Ashburn is one of the busiest data center hubs in the country, and home to key infrastructure for dozens of providers and hundreds of Internet services.

Here’s a look at the Twitter updates from some of the companies affected:

 

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

16 Comments

  1. M.T.Field

    Typically, enterprise-class data centers are doubly protected by both UPS systems and standby generators. I'd love to know if Amazon skipped this step to cut costs / increase profit, or if a more complex chain of events occurred. If the latter, it would be nice to learn from this incident.

  2. jeff

    Unfortunately all too many companies do not invest in the proper power infrastructure to save a buck. It costs clients a tremendous amount of money. Buying brand over proper engineering is not always the smartest idea either. Do your homework what is under the covers and not just what logo the company name has or their price in a race to the bottom.

  3. Well, amazon isn't famous for its price. It's rather the higher end of the market, and this episode shows it isn't justified.

  4. DCDriver

    hey guys, its not about UPS's in a data center. Typically, a UPS will hold a server or so up for about 30 minutes, sometimes longer, depending on the overkill factor. Most data center users, (also sometimes called colocation sites, or colo's) don't have UPS'. I'm in my data center right now, in Ashburn VA, with about 200 servers, and NO UPS's . What the data center is supposed to provide for us is primary power from the grid, or if the grid fails, diesel power to replace it. I'm here because i had one server down. I have not heard about this data center losing power from the grid. I have only 2 cages out of a lot of cages, lots and lots of servers in here. The place is not flooded with tech's.

  5. Soenke Ruempler

    You are referring to US-East-1 region, not an availability zone. There are several AZ in one region. Please correct this in your article. Thx!

  6. Tom

    I wonder if Amazon is utilizing flywheels instead of battery UPSs at that facility? The flywheels are sold as "better for the environment" but only give you about 4 minutes for the genset to come online. That's crazy to me - no time to address issues.

  7. Jim Leach

    Fool me once, shame on you. Fool me twice, shame on me. There are technologies, systems, and procedures to make sure the power is always on in a data center.

  8. Brian james

    DCDriver, unlikely your colocation provider does not have a large UPS between your circuits and the generator, they are needed to smooth the cut over between grid and generator as generators generally take a min of about 1 to 2 minutes to be operational to take on full load in that time your systems would be long down. I have been co-locating servers since 1996, currently have 4 data center locations we co-locate servers in both cabinets and private cage space and in all cases our circuits are fed by multiple PDUs which feed off of separate UPS banks and finally generators on the other side of the UPSs. If you are co-locating somewhere with utility straight to generator I hope you have your own UPSs in your racks...

  9. Jon

    Does Amazon own this data center, or do they lease it from someone else?

  10. there should be generator and UPS to sustain the power breaksup. i think amazon ikeeps his servers in godown :)

  11. Karion

    I think Amazon must build crank driven servers, or best option is to make paddles like bicycles have so, each server has an employee sitting on it with foots on paddles. If there is an electric failure, an alarm should be raised and employee start running paddles on server ;) :D :D :D

  12. Paulus

    My guess is that the thunder storm caused serious voltage sags on the grid supplying to Amazon data center. The voltage sags might have happened a few times within a short period, say, 60 minutes. When the voltage dropped outside the flywheel UPS's input range, -10%, the flywheel UPS recognized this as an outage event and started to discharge. The voltage was probably back to normal range within less than 2 seconds and the flywheel UPS stopped discharging and changed into charging mode without triggering the backup genset to star. However, say, in 10 or 15 minutes, another less than 2 seconds sag happened. Flywheel UPS repeated the same protection procedure. But at the end of the discharging, even a partial one, the flywheel was overheated and had to cool down. Since the flywheel is in a closed vacuum space, the heat dissipitiation takes long time. Unfortunately, the third sag happened. This time the flywheel UPS was dead and the load was lost.