UPDATE: Reddit has now updated its post from saying that it "been working to completely move Cassandra off EBS and onto local storage" to say that it is moving Cassandra "off of EBS and onto the local storage which is directly attached to the EC2 instances." We have updated out post to reflect that Reddit has not reduced its use of AWS, but only the way it deploys resources on it.
The social news site Reddit is revising how it uses Amazon's cloud computing service following performance problems that contributed to six hours of downtime for the Reddit site this week. The Reddit operations team attributed the outages to problems with Postgres and Cassandra servers deployed on Elastic Block Storage (EBS), a service offered by Amazon Web Services. Reddit said EBS servers in a single U.S. availability zone for AWS experienced performance problems.
Amazon's Service Health Dashboard reported "increased latencies for a subset of EBS volumes in a single Availability Zone" in Northern Virginia on Thursday. Several hours after the latencies were reported as fixed, AWS reported that connectivity problems related to a "misbehaving network device."
"Amazon's Elastic Block Service is an extremely handy technology," writes Reddit's Jason Harvey in a blog post recapping the outages. "It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.
"Unfortunately, EBS also has reliability issues. Even before the serious outage last night, we suffered random disks degrading multiple times a week. ... Over the course of the past few weeks, we have been working to completely move Cassandra off of EBS and onto local storage. This move will be executed within the month. While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of EBS. After the outage today, we are going to be investigating doing the same for our Postgres clusters."
Harvey said Amazon had been "working very closely with us to try and determine the root cause of the problem and implement a fix."