Many AWS Sites Recover, Some Face Longer Wait

As the performance problems with the Amazon cloud platform enter a second day, Amazon says it has isolated the issues to a single availability zone. As the Amazon team continues to work to restore the remaining services, many AWS customers that were offline Thursday have managed to get their sites up and running.

As the performance problems with the Amazon Web Services cloud computing platform enter a second day, Amazon says it has isolated the issues to a single availability zone in the Eastern United States. As the Amazon team continues to work to restore the remaining services, many AWS customers that were offline Thursday have managed to get their sites up and running (with the notable exception of Reddit, which remains in read-only mode).

In an update Friday morning, Amazon says some "stuck" storage volumes using its Elastic Block Storage service will face a lengthier recovery. "We expect that we'll reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes)," Amazon reports on its AWS service dashboard. "When we get to that point, we'll let folks know."

In the meantime, there's lots of analysis, reaction and commentary on the lengthy outage for Amazon's cloud computing platform. Here's a roundup of some notable links:

  • Single Points of Failure - From Cloudability: "It must be said that as the Cloud actually makes a backup facility much, much cheaper than ever before, today’s downtime was avoidable for most of the affected sites. Deployment technologies such as Puppet, Chef, Capistrano, et al and price effective multi-location database setups (such as nightly backups to a second availability zone or cloud provider) brings a full backup option within the price range of even the most meta of startups."
  • AWS is down: Why the sky is falling - From Justin Santa Barbara, founder of Fathom DB: "This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn't a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the 'contract'; the problem is that AWS didn't follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don't know at this point. But the engineers at quora, foursquare and reddit are very competent, and it's wrong to point the blame in that direction."
  •’s real problem isn’t the outage, it’s the communication - From Keith Smith, CEO of BigDoor: "We’ve managed to find workarounds to the technical challenges. But it was disconcerting to us that Amazon’s otherwise stellar system was being marred. Not so much by a temporary technical issue, rather by what seemed like an unwillingness to embrace transparency. Today that lack of transparency has continued. As problems continued throughout the day, we experienced the obvious frustration from the system failure. But Amazon’s communication failure was even more alarming."
  • How Netflix handled the AWS problems without interruptions: - A discussion on Hacker News about how Netflix, one of Amazon's major customers, handled the outage without interruptions. "Netflix showed some increased latency, internal alarms went off but hasn't had a service outage. ... "Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down."
  • Why Twilio Wasn’t Affected by Today’s AWS Issues - From the Twilio Engineering Blog:"Twilio’s APIs and service were not impacted by the AWS issues today. As we’ve grown and scaled Twilio on Amazon AWS, we’ve followed a set of architectural design principles to minimize the impact of occasional, but inevitable issues in underlying infrastructure."
  • Heroku, DotCloud, Engine Yard Hit Hard: From Derrick Harris at GigaOm: "(The outage affected) at least three popular platform-as-a-service providers — Heroku, Engine Yard and DotCloud. That’s because many platform-as-a-service (PaaS) offerings are hosted with AWS, essentially adding a developer-friendly layer of abstraction of the AWS infrastructure to make writing and deploying applications even easier than with AWS. Of course, the downfall is that as goes AWS, so goes your PaaS provider."
  • What to Do When Your Cloud is Down - From Bob Garfield at Smoothspan: "Most SaaS companies have to get huge before they can afford multiple physical data centers if they own the data centers. But if you’re using a Cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical data centers very cheaply. The trick is, you have to make use of it, but it’s just software. A service like Heroku could’ve decided to spread the applications it’s hosting evenly over the two regions or gone even further afield to offshore regions. This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed."