Amazon has issued a detailed timeline and explanation of last Sunday's lengthy outage for its S3 utility storage service. Here's the root cause:
We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.
Single-bit corruption was also a key issue in an S3 outage on June 20, in which a single load balancer was cited as the culprit in the file corruption, which affected customers using MD5 checksums to verify data integrity. After that incident, Amazon said it would "improve our logging of requests with MD5s, so that we can look for anomalies in their 400 error rates. Doing this will allow us to provide more proactive notification on potential transmission issues in the future."
But Sunday's file corruption occurred in a "gossip" protocol used to share system state information, rather than the customer data. While Amazon used MD5 to check customer data, "we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system."
The other big issue in the post-mortem is the length of time it took to recover the system:
At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system's state, and then reactivate the request processing components. By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system's state cleared. By 2:20pm PDT, we'd restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.
In other words, they had to turn off all the servers and restart the system. Amazon has promised to make several changes to address the problems that made this into a lengthy outage. Chief among them: "We've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing." The S3 team said it has implemented monitoring and an alarm system for the system state gossip, and added checksums to detect corruption of system state messages.