Software Bug, Cascading Failures Caused Amazon Outage

A software bug in a data collection agent for Amazon's Elastic Block Storage (EBS) system triggered last Monday's service outage for many services running on Amazon Web Services, the company said today. In a thorough incident report, Amazon said the chain of events began with the hardware failure for a single server handling data collection. The company said it will issue service credits to some customers who were unable to access APIs, saying that its efforts to protect its systems proved too aggressive.

After the server was replaced, DNS records were updated to reflect the new hardware, but the DNS update didn't propogate correctly. "As a result, a fraction of the storage servers did not get the updated server address and continued to attempt to contact the failed data collection server," Amazon reported. Over time, this led to a condition in which a growing number of EBS volumes became “stuck” (i.e. unable to process further I/O requests).

AWS' monitoring didn't properly detect the problem until a growing number of servers were running out of memory, creating a logjam that had spillover effects on other Amazon cloud services, including its EC2 compute cloud, its Relational Database Service (RDS) and Elastic Load Balancing (ELB).

As it sought to manage the incident, Amazon implemented "throttling," limiting access to its APIs to protect the system.

"Throttling is a valuable tool for managing the health of our services, and we employ it regularly without significantly affecting customers’ ability to use our services," Amazon said. "While customers need to expect that they will encounter API throttling from time to time, we realize that the throttling policy we used for part of this event had a greater impact on many customers than we understood or intended. While this did not meaningfully affect users running high-availability applications architected to run across multiple Availability Zones with adequate running capacity to failover during Availability Zone disruptions, it did lead to several hours of significant API degradation for many of our customers. This inhibited these customers’ ability to use the APIs to recover from this event, and in some cases, get normal work done. Therefore, AWS will be issuing a credit to any customer whose API calls were throttled by this aggressive throttling policy (i.e. any customer whose API access was throttled between 12:06PM PDT and 2:33PM PDT) for 100% of their EC2, EBS and ELB usage for three hours of their Monday usage (to cover the period the aggressive throttling policy was in place)."

Read the full incident report for additional details.

Comments

Plain text