Reddit Ties Outage to Amazon Performance

The social news site Reddit is revising how it uses Amazon's cloud computing service following performance problems that contributed to six hours of downtime for the Reddit site this week.

Data Center Knowledge

March 18, 2011

2 Min Read
ITPro Today logo in a gray background | ITPro Today

UPDATE: Reddit has now updated its post from saying that it "been working to completely move Cassandra off EBS and onto local storage" to say that it is moving Cassandra "off of EBS and onto the local storage which is directly attached to the EC2 instances." We have updated out post to reflect that Reddit has not reduced its use of AWS, but only the way it deploys resources on it.

The social news site Reddit is revising how it uses Amazon's cloud computing service following performance problems that contributed to six hours of downtime for the Reddit site this week. The Reddit operations team attributed the outages to problems with Postgres and Cassandra servers deployed on Elastic Block Storage (EBS), a service offered by Amazon Web Services. Reddit said EBS servers in a single U.S. availability zone for AWS experienced performance problems.

Amazon's Service Health Dashboard reported "increased latencies for a subset of EBS volumes in a single Availability Zone"  in Northern Virginia on Thursday. Several hours after the latencies were reported as fixed, AWS reported that connectivity problems related to a "misbehaving network device."

"Amazon's Elastic Block Service is an extremely handy technology," writes Reddit's Jason Harvey in a blog post recapping the outages. "It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.

"Unfortunately, EBS also has reliability issues. Even before the serious outage last night, we suffered random disks degrading multiple times a week. ... Over the course of the past few weeks, we have been working to completely move Cassandra off of EBS and onto local storage. This move will be executed within the month. While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of EBS. After the outage today, we are going to be investigating doing the same for our Postgres clusters."

Harvey said Amazon had been "working very closely with us to try and determine the root cause of the problem and implement a fix."

About the Author

Data Center Knowledge

Data Center Knowledge, a sister site to ITPro Today, is a leading online source of daily news and analysis about the data center industry. Areas of coverage include power and cooling technology, processor and server architecture, networks, storage, the colocation industry, data center company stocks, cloud, the modern hyper-scale data center space, edge computing, infrastructure for machine learning, and virtual and augmented reality. Each month, hundreds of thousands of data center professionals (C-level, business, IT and facilities decision-makers) turn to DCK to help them develop data center strategies and/or design, build and manage world-class data centers. These buyers and decision-makers rely on DCK as a trusted source of breaking news and expertise on these specialized facilities.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like