Last week's lengthy outage for the Amazon Web Services cloud computing platform was caused by a network configuration error as Amazon was attempting to upgrade capacity on its network. That error triggered a sequence of events that culminated in a "re-mirroring storm" in which automated replication of storage volumes maxed out the capacity of Amazon's servers in a portion of their platform.
Amazon provided a detailed incident report this morning in which it discussed the outage, apologized to customers and outlined plans to make its platform more resilient in the future. The company also issued a 10-day credit to customers using the US East Region at the time of the outage.
Traffic Shift 'Executed Incorrectly'
The incident began at 12:47 a.m. Pacific time on April 21, when Amazon began a network upgrade in a single availability zone in the US East region. "During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS (Elastic Block Storage) network to allow the upgrade to happen," Amazon said. "The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."
The traffic routing error overloaded the storage network. When network connectivity was restored, volumes stored on EBS began an automated mirroring process designed to preserve data during system failures. "In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes 'stuck' in a loop, continuously searching the cluster for free space," Amazon reported. "This quickly led to a 're-mirroring storm,' where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica."
The company said it would take steps to improve its customer communication, which was the focus of sharp criticism during the outage. The incident report was released early Friday, as global media attention focused on the royal wedding in England.
Amazon: No 'Significant' Impact to Other Zones
Amazon's report sought to deflect criticism that the outage affected multiple availability zones, a key point of contention for some unhappy customers. In theory, using multiple availability zones should allow customers to continue to operate if a single availability zone experiences a failure. There were numerous reports that EBS access was impaired across multiple availability zones, but Amazon challenged the notion that this was widespread.
"While our monitoring clearly shows the effect of the re-mirroring storm on the EBS control plane and on volumes within the affected Availability Zone, it does not reflect significant impact to existing EBS volumes within other Availability Zones in the Region," the incident report states. "We do see that there were slightly more 'stuck' volumes than we would have expected in the healthy Availability Zones, though still an extremely small number. To put this in perspective, the peak 'stuck' volume percentage we saw in the Region outside of the affected Availability Zone was less than 0.07%."
Amazon also created the AWS Architecture Center to share information on best practices for deploying cloud assets reliably. The company scheduled a series of webinars to educate customers on their options for designing their AWS deployments to survive outages.
The company also apologized for the outage's impact on customers, and vowed to take steps to prevent a recurrence. "As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes," the AWS team wrote.