Posted By Rich Miller On July 26, 2008 @ 7:59 pm In Amazon | Comments Disabled
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.Single-bit corruption was also a key issue in an S3 outage on June 20 [2], in which a single load balancer was cited as the culprit in the file corruption, which affected customers using MD5 checksums to verify data integrity. After that incident, Amazon said it would “improve our logging of requests with MD5s, so that we can look for anomalies in their 400 error rates. Doing this will allow us to provide more proactive notification on potential transmission issues in the future.”
At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system’s state, and then reactivate the request processing components. By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system’s state cleared. By 2:20pm PDT, we’d restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.In other words, they had to turn off all the servers and restart the system. Amazon has promised to make several changes to address the problems that made this into a lengthy outage. Chief among them: “We’ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing.” The S3 team said it has implemented monitoring and an alarm system for the system state gossip, and added checksums to detect corruption of system state messages.
Article printed from Data Center Knowledge: http://www.datacenterknowledge.com
URL to article: http://www.datacenterknowledge.com/archives/2008/07/26/s3-downtime-more-missing-bites/
URLs in this post:
[1] detailed timeline: http://status.aws.amazon.com/s3-20080720.html
[2] S3 outage on June 20: http://www.datacenterknowledge.com/archives/2008/Jun/27/amazon_s3_issues_load_balancers_and_md5.html
[3] Rich Miller: http://www.datacenterknowledge.com/archives/author/richm/
Click here to print.
Copyright © 2011 Data Center Knowledge. All rights reserved.