Friday’s outage on Amazon’s utility computing platform, which had a ripple effect on many sites using the Amazon S3 storage service to serve images or widgets, has been blamed on a surge in encrypted traffic. Network traffic using the Secure Sockets Layer (SSL) and Transport Layer Security (TLS) encryption protocols uses more network resources because it involves a more complex “handshake” between servers than regular old web pages. This overhead is why many bank web sites have shifted their online banking logins to non-SSL pages in recent years.
In its report on Friday’s outage, Amazon (AMZN) said its servers experienced an unexpected jump in traffic using authentication and encryption. “While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests,” Amazon reported. “Importantly, these cryptographic requests consume more resources per call than other request types.” By about 4 am Pacific time, the level of authenticated requests “pushed the authentication service over its maximum capacity before we could complete putting new capacity in place.”
Amazon said it would take several steps to address the issues raised by Friday’s outage. Many customers critiqued Amazon for its lack of communication about what was happening and why. Others focused on the need for a dashboard providing details on service performance, which Amazon said it would develop.
But the key issue in the failure appears to have been Amazon’s monitoring, and its failure to track encrypted SSL/TLS web traffic and account for its impact on network performance. In light of that, the most important response to the outage is Amazon’s promise to improve its monitoring of the proportion of authenticated requests. This will be an important point as Amazon seeks to move beyond its existing customer base of startups and Web 2.0 services and campaigns for enterprise accounts, which typically have more robust requirements for security and encryption.
Amazon’s explanation deflated theories that S3 and EC2 may have been hit with a distributed denial of service attack (DDoS). But the University of Washington’s Computer Security blog had some interesting commentary on this issue:
This incident has exposed a large vulnerability in the authentication system: a competing service could explicitly send large amounts of authenticated calls to S3 in an attempt to overload it. Fortunately, Amazon plans to address this, stating that they will add “additional defensive measures around the authenticated calls.”
The post-outage blogging around the web also revealed that not all major users of Amazon’s utility computing services were affected. One customer that reported no impact was SmugMug, the photo sharing service that has been one of the most prominent success stories for Amazon. Don MacAskill of SmugMug wrote that although the outage didn’t slow his service, he’s prepared for the probability that this will happen.
Yes, I believe there will probably be times where SmugMug is seriously affected, possibly even offline completely, because Amazon (or some other web services provider) is having problems. Today wasn’t that day. Nobody likes outages, especially not us, but we’ve decided the tradeoffs are worth it. You should have your eyes wide open when you make the decision to use AWS or any other external service, though. It will fail from time to time.
The savings-to-reliability tradeoff may be somewhat different for enterprise customers. It’s safe to say that the performance of Amazon’s utility computing platform will continue to be closely watched.