Amazon Cloud Back Online After Major Christmas Outage

5 comments

Amazon Web Services suffered an outage that spanned Christmas Eve and Christmas Day and affected streaming video service from Netflix. (Photo by BCP via Flickr.

Amazon Web Services says it has recovered from the latest major outage for cloud computing service, which affected large customers, including Netflix and Heroku. The problems with Amazon’s Elastic Load Balancing (ELB) service began on Christmas Eve at 1:45 p.m. Pacific time, and weren’t fully resolved until 9:41 a.m. on Christmas Day, an outage of about 20 hours.

The incident was the latest in a series of outages for Amazon’s US-East-1 region, the oldest and most crowded portion of its cloud computing infrastructure. The downtime raised new questions about Amazon’s management of the region, and the prospect that load balancing problems in a single zone can undermine the benefits of hosting assets in multiple regions – scenario that first showed up in an extended outage last summer.

This was the second AWS-related outage in six months for Netflix, one of Amazon’s most sophisticated customers, which noted on its Twitter feed that it was “terrible timing.” The streaming video service gradually restored service to different devices throughout the night, but it wasn’t until 9 a.m. Pacific on Christmas morning – more than 19 hours after the incident began – that Netflix reported full recovery:

The ELB service is important because it is widely used to manage reliability, allowing customers to shift capacity between different availability zones, an important strategy in preserving uptime when a single data center experiences problems.

During a June 29 outage, Amazon said a bug in its Elastic Load Balancing system prevented customers from quickly shifting workloads to other availability zones. This had the affect of magnifying the impact of the outage, as customers that normally use more than one availability zone to improve their reliability (such as Netflix) were unable to shift capacity.

In a July 2 incident report from that event, Amazon outlined steps it would pursue to avoid a repeat of these issues: “As a result of these impacts and our learning from them, we are breaking ELB processing into multiple queues to improve overall throughput and to allow more rapid processing of time-sensitive actions such as traffic shifts. We are also going to immediately develop a backup DNS re-weighting that can very quickly shift all ELB traffic away from an impacted Availability Zone without contacting the control plane.”

It will be interesting to see whether Amazon’s load balancing problems were related to any of the issues identified in July, and what new solutions are devised to address them. We’ll likely see information on that front soon, as the Amazon team has been scrupulous about publishing details incident reports.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

5 Comments

  1. Blos

    Maybe it's cool to mention Dyn, XDN, Cedexis and others... that have solutions for bypassing this typical fu***ing event.

  2. At Total Uptime our failover and load balancing solutions are external to AWS, so you can actually failover from region to region.

  3. Want to bet at the end of the day this doesn't count as an outage with the Amazon SLA? I'm so glad I'm done using that shit service. Can't believe Netflix hosts on infrastructure from their competitor, just brain dead. ELB itself is just an absolutely terrible service, Netflix made news a while back that due to DNS caching millions of Netflix API requests went to other amazon customers (because ELB randomly changes IPs without notice).

  4. Bill Kleyman

    This is just unfortunate... it really is. I truly felt that AWS had learned it's lessons and worked hard to create a more resilient cloud platform. I guess that wasn't the case. I recently wrote this article for DCK: http://www.datacenterknowledge.com/archives/2012/12/05/the-cloudy-side-of-cloud-computing/ -- which outlined the risks of cloud computing. I still like the flexibility and agility of the cloud. But Amazon needs to get their act together...Seriously.

  5. Dan Creswell

    Perhaps we should all wait on the postmortem analysis before wading in? Sure, Amazon has it's foibles, but then so does everything else and considering the scale they operate at, they likely aren't doing too bad. I've been around and over the long-term no hosting provider or datacentre is perfect in my experience. Are Amazon really worse? Well, let's see some real numbers! Elsewhere in the comments: "because ELB randomly changes IPs without notice" For which Amazon do this: "works around this being a glaring problem by setting very low TTLs (time-to-live) on the domain name mappings for those machines" Which, unfortunately gets broken like this: "but clients that are caching the IP address and not honoring the TTL may still be trying to connect to the old ELB after it has been replaced by another one that your app was using." Clients not honouring TTL? Is that truly Amazon's problem then? Please don't read me as pro-Amazon. There are issues with AWS for sure around variable latency in I/O and network for starters. All I'm asking for is a bit of balance and perhaps not so much self-serving: "At Total Uptime our failover and load balancing solutions are external to AWS, so you can actually failover from region to region."