Reddit Ties Outage to Amazon Performance

8 comments

UPDATE: Reddit has now updated its post from saying that it “been working to completely move Cassandra off EBS and onto local storage” to say that it is moving Cassandra “off of EBS and onto the local storage which is directly attached to the EC2 instances.” We have updated out post to reflect that Reddit has not reduced its use of AWS, but only the way it deploys resources on it.

The social news site Reddit is revising how it uses Amazon’s cloud computing service following performance problems that contributed to six hours of downtime for the Reddit site this week. The Reddit operations team attributed the outages to problems with Postgres and Cassandra servers deployed on Elastic Block Storage (EBS), a service offered by Amazon Web Services. Reddit said EBS servers in a single U.S. availability zone for AWS experienced performance problems.

Amazon’s Service Health Dashboard reported “increased latencies for a subset of EBS volumes in a single Availability Zone”  in Northern Virginia on Thursday. Several hours after the latencies were reported as fixed, AWS reported that connectivity problems related to a “misbehaving network device.”

“Amazon’s Elastic Block Service is an extremely handy technology,” writes Reddit’s Jason Harvey in a blog post recapping the outages. “It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.

“Unfortunately, EBS also has reliability issues. Even before the serious outage last night, we suffered random disks degrading multiple times a week. … Over the course of the past few weeks, we have been working to completely move Cassandra off of EBS and onto local storage. This move will be executed within the month. While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of EBS. After the outage today, we are going to be investigating doing the same for our Postgres clusters.”

Harvey said Amazon had been “working very closely with us to try and determine the root cause of the problem and implement a fix.”

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

8 Comments

  1. Your article headline is inaccurate. We have no plans to leave AWS or EC2, nor cut back it's use.

  2. Jeremy: We have updated our story to clarify the changes in how Reddit is using AWS, consistent with the revisions in the Reddit post.

  3. SNS

    Could you clarify what you mean by local storage? Do you mean non-persistent AMI storage?

  4. Amazon shouldn't allow Reddit to libel them this way. Reddit put all their eggs in one basket and didn't performance test the site. Now they very publically blame their own failures on Amazon?

  5. Anon

    And reddit shouldn't allow YOU to libel THEM the way you do, LouF.