reddit-down-aws

Major Amazon Outage Ripples Across Web

35 comments


When a busy cloud computing platform crashes, the impact is felt widely. That’s the case with today’s extended outage for Amazon Web Services, which is battling latency issues at one of its northern Virginia data centers. The problems are rippling through to customers, causing downtime for many services that use Amazon’s cloud to run their web services.

The sites knocked offline by Amazon’s problems include social media hub Reddit, the HootSuite link-sharing tool, the popular question-and-answer service Quora, and even a Facebook app for Microsoft (see a full list of affected sites).

The issues began at about 1 a.m. Pacific time and are continuing as of 2:30 p.m. Pacific, with Amazon saying it still cannot predict when services will be fully recovered. By mid-afternoon, Amazon said it had limited the problems to a single availability zone in the Eastern U.S., and was attempting to route around the affected infrastructure. The AWS status dashboard shows that the services experiencing problems include Elastic Compute Cloud (EC2), Amazon Relational Database Service and Amazon Elastic MapReduce and are focused in the US-East-1 region.

Networking Event Triggers Problems

The problems are focused on Elastic Block Storage (EBS), which provides block level storage volumes for use with Amazon EC2 instances. Latency problems at EBS were cited by Reddit when the site experienced major downtime in March.

“A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1,” Amazon said in a status update just before 9 am Pacific time. “This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances.

“We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue,” Amazon continued. “We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.”

UPDATE: At 10:30 Pacific, Amazon said it was making “significant progress in stabilizing the affected EBS control plane service,” which was now seeing lower failure rates. “We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery.”

UPDATE 2:At 1:48 p.m. Amazon said a single Availability Zone in the US-EAST-1 region continues to experience problems launching EBS backed instances or creating volumes. “All other Availability Zones are operating normally,” Amazon said. “Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.”

The outage even has affected a Microsoft initiative, according to a Facebook post by the company. “For those of you trying to enter our ‘Big Box of Awesome’ sweepstakes…the entry site is currently down, related to a broader problem impacting a number of sites across the internet today,” Microsoft told its Facebook followers. “We’ll let you know when it’s back up.” Microsoft has its own data center infrastructure, but some business units use third-party services. The Big Box of Awesome Facebook app is hosted on EC2.

Multi-Region Failover Option

The outage appears to affect many, but not all, customers using the US-East-1 region. Amazon operates multiple regions, allowing users to add redundancy to their applications by hosting them in several regions. In a multi-region setup, when one region experiences performance problems, customers can shift workloads to an unaffected region.

Whenever Amazon Web Services experiences outages and performance problems, it typically highlights the multi-region option, which allows customers to avoid having its cloud assets constitute a “single point of failure.” Today’s outage is likely to prompt some customers that rely on Amazon to examine adding additional regions to their deployment and other strategies to work around EC2 outages.

The outage is also likely to prompt discussion of the reliability of cloud computing. Is it a fair question to raise? Today’s outage has affected many customers, highlighting the vulnerability of a single service hosting many popular sites.

This has also been true of earlier outages at dedicated hosting providers like The Planet or data center hubs like Fisher Plaza. Companies relying upon those facilities could avoid outages by adding backup installations at other data centers – which is essentially the same principle as adding additional zones at Amazon.

Stuff happens. We write about outages all the time. But real-world downtime is particularly problematic in the context of claims that the cloud “never goes down.” Cloud infrastructure can also fail. The difference is that cloud deployments offer new options for managing redundancy and routing around failures when they happen.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

35 Comments

  1. Amazon is right. It isn't that hard to write software for multiple reasons. In fact, it makes it so easy for small orgs to have multiple physical data centers it is amazing when they don't take advantage of it: http://smoothspan.wordpress.com/2011/04/21/what-to-do-when-your-cloud-is-down/ Best, BW

  2. John

    NOoOoooo mobipe.com is also down.

  3. I think what we are beginning to see here is the breakdown of the unlimited resource myth that Amazon and the bigger cloud players have been perpetuating for years now. Call it a bubble, pyramid scheme, whatever … It’s classic irrational exuberance that defines them all. Does anyone really believe that Amazon can infinitely scale to support every internet application in the world? It just doesn’t make technical sense. And multiple availability zones? What happens when the I/O to your storage is completely inundated so that you can’t replicate? Cascading failure of shared resources anyone? It really has been surprising to me how many smart technical people have fallen victim to this fallacy, although to be fair, irrational belief systems happen with “bubbles” in finance, housing, etc. Ask anyone who has implemented internet infrastructure, and they will tell you how technically difficult it is to deploy infrastructure at this level. It’s not surprising that Amazon has been fairly opaque when it comes to detailing exactly how their infrastructure is put together. That might have just pulled back the curtain and revealed the truth behind the myth.

  4. this is exactly the type of helpless and "who do I call" and "how much longer until it is back up" hanging that these customer are basing their entire platform on. this is why a lot of these companies need to look at private dedicated and private shared compute clouds with companies that are evenly priced but have much better solutions, support, and SLA's.

  5. Dan Stadler

    Perhaps everyone should just take a break from logging in to websites, go out and enjoy the sunshine, and not freak out so much about this.

  6. The primary factor holding back CIOs and IT managers from more aggressively pursuing cloud solutions is risk management/security. This incident will only heighten those concerns and at the same time improve the chances of private cloud solutions in the near term. It is a little scary. But as we saw and continue to see in Japan, stuff happens.

  7. Dan, that's a perfect suggestion. In fact, I'll do that! Anyway, disaster recovery and continuity plans should be something that every business takes into consideration. A failure like this is always going to be an issue with cloud computing, regardless of the prestige of the company.

  8. KP

    Its time to provide a back to on-premise option or deploy applications in multiple cloud providers and have interoperability between cloud providers (amazon , azure , rackspace ) etc..

  9. For all customers affected by EC2 downtime, I would like to recommend ElasticHosts as an alternative cloud service (www.elastichosts.com) - we offer a 5 day free trial for our cloud servers in US or UK, which is likely enough at least to bridge the gap.

  10. srs

    Can't rely on just one cloud vendor. Check out this simple animation that shows how to avoid these types of problems: You want to look at the "Complete in the cloud IT Organization" at the link below. http://www.batblue.com/usecases.php?first=499

  11. Ram

    73 + Hours and counting. Perhaps cloud computing is not ready for prime time. As a phd student looking in on the false promise of the cloud I am surprised that people smarter than me didn't see the emperor is naked. Saurik of Cydia fame being one. Perhaps you'd consider hosting on multiple different servers rather than just Amazon?

  12. Found on blog.rightscale.com Amazon was quoted as stating the following: "Our services team handled 4x the incident volume last Thursday compared to a normal Thursday. A large number of callers needed help in assessing the situation or in bringing their servers back up. A typical request was: “It looks like my db server is down due to the outage, can you help confirm and assist with a migration?” Unfortunately we also heard from a good number of users who were using a single availability zone or didn’t set up redundancy properly. Hindsight is always 20-20." I assume this has been fixed now.