Here's What Caused Sunday's Amazon Cloud Outage

This article originally appeared at The WHIR

A minor network disruption at Amazon Web Services led to an issue with its NoSQL database service DynamoDB, causing some of the internet’s biggest sites and cloud services becoming unavailable on Sunday, and it has continued to plague the public cloud service with a disruption Wednesday.

In a blog post outlining the issue experienced Sunday, AWS explained that the US-East region experienced a brief network disruption impacting DynamoDB’s storage servers.

Understanding How DynamoDB Manages Table Data and Storage Servers Requests

DynamoDB’s internal metadata service manages the partitions of sharded table data, keeping track of partitions spread across many servers. The specific assignment of a group of partitions to a given server is called a “membership”.

The storage servers that hold data in partitions periodically check they have the correct membership, and for new table data present on other partitions, through a request to the metadata service. Storage servers also send a membership request after a network disruption or on startup.

If storage servers don’t receive a response from the metadata service within a specific time period, they will retry, but also disqualify themselves from accepting requests. But these requests were already taking longer to deliver because of a new feature that has been expanding the amount of membership data being requested.

The new DynamoDB feature called “Global Secondary Indexes” allows customers to access table data using multiple alternate keys. For tables with many partitions, this means the overall size of a storage server’s membership data could increase two or three-fold, causing it to take longer to fulfill membership requests. AWS said it wasn’t doing detailed tracking of membership size, and didn’t provide enough capacity to the metadata service to handle these larger requests.

Again, storage servers unable to obtain their membership data within an allotted time retry, and remove themselves from taking requests.

A Network Disruption Triggers a Meltdown

When the network disruption hit on Sunday morning, storage servers simultaneously requested membership data, including some enormous membership lists, essentially flooding the metadata service with many large requests which timed-out. As a result, storage servers could not complete their membership renewal and became unavailable for requests but still continued to retry requests for membership data, further clogging DynamoDB with requests.

By 2:37 am PDT, around 55 percent of requests to DynamoDB failed.

AWS attempted to fix the issues by adding capacity to the metadata service to deal with the additional load of the membership data requests. However, the service was under such high load, administrative requests couldn’t get through.

At around 5am, AWS paused metadata service requests, which decreased retry activity and relieved much of the load on the metadata service so that it would respond to administrative requests, allowing admins to add capacity. DynamoDB error rates dropped to acceptable levels by 7:10am.

Preventing DynamoDB From Causing More Problems

AWS said it is doing four things to ensure a similar event doesn’t happen again: increasing the capacity of the metadata service; implementing stricter monitoring of DynamoDB performance; reducing the rate of storage node membership data requests and allowing more time to process queries; and segmenting DynamoDB so there are essentially many metadata services available for queries.

Ensuring DynamoDB works correctly is especially important because many services such as Simple Queue Service (SQS), EC2 Auto Scaling, CloudWatch, the AWS Console and others were affected by DynamoDB’s high error rates.

Meanwhile, there was a second – albeit less critical – issue reported on Wednesday where latency and errors were reported for the DynamoDB metadata services, along with disruptions to Elastic Block Store (EBS), new instance launches, and Auto Scaling services in the US-East-1 region.

Errors are seriously concerning for enterprises relying on AWS for their businesses.

Some of the companies reportedly impacted in the Sunday outage included Netflix,Reddit, Product Hunt, Medium, SocialFlow, Buffer, GroupMe, Pocket, Viber, Amazon Echo, NEST, IMDB, and Heroku.

In its message to customers, AWS apologized for the cloud outage, noting, “For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future.”

This first ran at http://www.thewhir.com/web-hosting-news/amazon-sheds-light-on-dynamodb-disruption-that-caused-massive-outage

Comments

Plain text