Amazon Web Services has released a detailed incident report on last week's data center outage in Dublin. It's worthwhile reading for those who trouble-shoot data center and cloud computing problems.
Amazon outlined steps it would take to address the issues, and announced a 10-day service credit for all customers that were affected.Here are some of the notable issues addressed in the incident report.
- Why The Generators Didn't Start: "Normally, when utility power fails, electrical load is seamlessly picked up by backup generators. Programmable Logic Controllers (PLCs) assure that the electrical phase is synchronized between generators before their power is brought online. In this case, one of the PLCs did not complete the connection of a portion of the generators to bring them online. We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task. We are working with our supplier and performing further analysis of the device involved to confirm." Amazon said it will add more redundancy and isolation for its PLCs, and is working with vendors to add a backup PLC.
- Problems With Management Software: In several cases, software programs that manage tasks complicated the recovery process. The first case came shortly after the outage. "The management servers which receive requests continued to route requests to management servers in the affected Availability Zone. Because the management servers in the affected Availability Zone were inaccessible, requests routed to those servers failed. Second, the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted Availability Zone. Rather than failing these requests immediately, they were queued and our management servers attempted to process them. Fairly quickly, a large number of these requests began to queue up and we overloaded the management servers receiving requests, which were waiting for these queued requests to complete. The combination of these two factors caused long delays in launching instances and higher error rates for the EU West EC2 APIs. "
- Problems With EBS Software: The most serious downtime affected customers using Amazon's Elastic Block Storage (EBS). A software bug detected prior to the power outage created complications during the recovery process. There is a lengthy description of the EBS issues and how they were addressed.
The recap closed with an apology. "We will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes," Amazon wrote.