Lightning can be a hazard to data center uptime, as illustrated by power outages at Amazon and Microsoft data centers in Dublin. (Photo by dagpeak via Flickr)

Outage in Dublin Knocks Amazon, Microsoft Data Centers Offline

58 comments

A lightning strike has caused power outages at the major cloud computing data hubs for Amazon and Microsoft in Dublin, Ireland. The incident has caused downtime for many sites using Amazon’s EC2 cloud computing platform, as well as users of Microsoft’s BPOS (Business Productivity Online Suite).

Amazon said that lightning struck a transformer near its data center, causing an explosion and fire that knocked out utility service and left it unable to start its generators, resulting in a total power outage. While many sites were restored, Amazon said some sites that rely on one of its storage services may take 24-48 hours to fully recover. The company is bringing additional hardware online to try and speed the recovery process, but is advising customers whose sites are still offline to re-launch them in a different zone of its infrastructure.

UPDATE: On Wednesday, Aug. 10 local utility ESB Networks now says that lightning was not the cause of the transformer failure that dropped utility power to both major data centers. The power company said the actual cause of the transformer failure remains under investigation.

Amazon said the event affected one of the EC2 Availability Zones in its Dublin data center, which is the company’s primary European hub for its cloud computing platform.

Generator Systems Disrupted

“Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators,” Amazon said in an update on its status dashboard. “The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them.”

“Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We’ve now restored power to the Availability Zone and are bringing EC2 instances up.”

Amazon said the power outage began at 10:41 a.m. Pacific time, with instances beginning to recover about three hours later at 1:47 p.m. Pacific time. Recovery is taking longer for some user instances, including those using Amazon Elastic Block Storage (EBS), the company said.

UPDATE: As of 10:45 p.m. Eastern, Amazon reported that 60% of the impacted instances have recovered and are available. “Stopping and starting impaired instances will not help you recover your instance,” AWs said. “For those looking for what you can do to recover more quickly, we recommend re-launching your instance in another Availability Zone.”

UPDATE 2: Early Monday Amazon said that problems with EBS were slowing the recovery. “Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process,” Amazon said in a status update. “We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours until the process is completed.”

Microsoft Outage

The Twitter feed for Microsoft online services reported that a European data center power issue had affected access to its BPOS services. Microsoft reported that services were stating to come back online as of 7:30 pm Eastern/4:30 pm Pacific. A Microsoft statement said that a “widespread power outage in Dublin caused connectivity issues for European BPOS customers. Throughout the incident, we updated our customers regularly on the issue via our normal communication channels.”

“We were informed the incident was the result of an lightning strike that caused an explosion on one of the substation’s transformers,” said a Microsoft spokeswoman. “The lightning strike created an electrical surge large enough that it affected a portion of our backup power systems. Our team is highly trained to respond to various types of issues, so they then began to manually transfer to generator power.”

Microsoft said that the Dublin data center’s utility power was restored at 03:10 AM PST Monday and the data center “has returned to normal operating conditions.”

Key European Cloud Computing Hub

Dublin has become a key cloud computing gateway to Europe and beyond for U.S. companies due to several factors, including the city’s location, connectivity, climate and ready supply of IT workers. Dublin’s temperature is ideal for data center cooling, allowing companies to use fresh air to cool servers instead of using huge, power-hungry chillers to refrigerate cooling water.

This allowed Microsoft to design and build one of the world’s most efficient data centers, a huge facility that hosts the company’s cloud services for Europe and operates entirely without chillers. At 550,000 square feet, it is also one of the world’s largest data centers.

Amazon opened a data center in Dublin in December of 2008 to house the European availablity zones for its EC2 cloud computing services. The company recently acquired a 240,000 square foot building in Dublin which will be converted into an expansion data center.

The company’s property moves reflect the rapid growth of its European cloud computing operation, which was chronicled by Netcraft in December.

Lighning photo by Dagpeak via Flickr.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

58 Comments

  1. We use an Amazon cloud based order management software (linnworks), it is down yesterday (7 Aug 2011) around 6PM, it is now 7AM 8 Aug 2011, it is still not working (the software can not find any available cloud server).

  2. That's pretty bad, but I didn't actually notice it on any of the sites or services I use.

  3. This is quite worrying news as we move more and more of our data online. Hope cloud services can improve their contingency and redundancy plans!

  4. We have also been hit (www.kinderfee.de) and have been offline since last night. It seems Amazon will still be taking some time to get things running. A large number of European startups, relying on the amazon infrastructure will not generate any sales today... amazon seems to have repaired its own sites first (e.g. amazon.de)...

  5. Brian

    Round 1: "The Cloud" v Lightning....(winner Lightning)

  6. Those are some of the challenges Cloud computing is facing, many collateral damages.

  7. There are definitely odd discrepancies between the apparent facts and the stories Amazon and Microsoft have released: There was no lightning strike yesterday in Dublin. The local power supply company had a short and very isolated outage in one small part of Dublin. There was NO explosion and no fire according to the local power supplier! Not sure where Amazon got their information from and surprising that they can claim that their backup generators didn't start because of an explosion if no explosion did take place. Microsoft's new and modern data centre is not in the area where the power outage happened, and due to the limited information available from Microsoft it can not be concluded that it happened at the same time! Yes, there was a very short (less than a second) power outage and the local electricity supplier says that the power source was immediately switched over to an alternate supply. But there seems to be some issues in Amazon! Otherwise their backup systems would have kicked in properly!

  8. KeepItHome

    Never trust the online data world. Keep your data at home. If something goes wrong with the cloud, you have no way to save yourself. If it goes wrong at home, you know what to do, what it will take, how to do it, and so on.

  9. Skaperen

    “Power sources must be phase-synchronized before they can be brought online to load." That's only true if you are trying to make a closed-transition switchover (whether mains is live or dead). A closed-transition system is useless for maintaining power from generators, since the mains outage can come as a surprise faster than even the fastest startup generators can kick in. The solution is to use open-transition switching to generators supplying a datacenter that is fully protected by a UPS/battery infrastructure (whether one per datacenter, per row, per rack, or per machine). Stagger the switchover in groups or section the generators for a smoother transition. Don't even try to parallel generators in the design, just section everything apart by generator/switch.

  10. DG

    Skaperen, I think they mean that they had to parallel the generators together onto one bus, not paralleling the utility power and the generators. i.e. bring up one generator, bring up the second, watch synchroscope go around, and hit the breaker when it gets to 12 oclock. They had UPS. They just didnt have a good staff to manually bring up the gensets before the UPS power died. Still, a better design would have prevented the downtime.

  11. A common misconception of the cloud is that it is a panacea for everything that ails IT, but it is important to remember that regardless of where your infrastructure is, “everything fails all the time”, and as such, you need to plan for it. Only one (of three) of the AWS AZs were affected by this outage, so if architectural best practices are followed, events such as this can be tolerated with little to no service disruption. Many RightScale customers come to us to discuss this very issue, including HA solutions and DR scenarios. Our best practices in this regard are summarized here: http://bit.ly/rightscalewp

  12. Henrik

    it is just to much they dont have enough control of back up procedures and so on

  13. There was a massive lightening strike and my neighborhood (in Dublin) was out for more than 30 minutes which hasn't happened in many years. Sometimes you get a one second drop due to a local lightening strike but this was definitely more serious. So, I'd question the source of the local utility statement.

  14. The local utility has now confirmed that lightning was not the cause of the transformer failure that led to the outage. We've updated our story to reflect this. See this update for details: http://www.datacenterknowledge.com/archives/2011/08/10/dublin-utility-power-outage-not-caused-by-lightning-strike/

  15. It's instances like these that have to remind all of us how vulnerable our industry can be. Something so small can have such a massive international effect.

  16. Asaf Meir

    Perhaps the solution is atomic shelter - not less than that!