Twitter’s New Infrastructure Weathers Massive TweetStorm
August 19th, 2013 By: Rich Miller
In an unusual test of Twitter’s infrastructure, Japanese anime fans set an all-time record for Tweets per second earlier this month during the showing of a 27-year-old movie on television. The entire event lasted less than ten seconds, but showcased Twitter’s new and improved back-end, which has improved performance while vastly reducing the number of servers needed to run the microblogging service.
On August 3, Twitter users in Japan watched an airing of “Castle in the Sky,” a 1986 movie from master anime filmmaker Hayao Miyazaki. Fans of the movie observe a tradition of posting a key line of dialogue in online forums at the exact moment it is spoken during the film. The practice has shifted to Twitter, and on August 3 resulted in an all-time one-second peak of 143,199 Tweets per second, shattering the previous record of 33,388 Tweets per second on New Year’s Day 2013.
“To give you some context of how that compares to typical numbers, we normally take in more than 500 million Tweets a day which means about 5,700 Tweets a second, on average,” wrote Raffi Krikorian, Twitter’s VP of Platform Engineering, in a blog post about the event. “This particular spike was around 25 times greater than our steady state.During this spike, our users didn’t experience a blip on Twitter.”
New Architecture Vanquishes Downtime
So where was the Fail Whale? Twitter weathered the Tweetstorm due to a retooling of its software and infrastructure after a series of high-profile outages during the 2010 World Cup. The service’s Rails-based infrastructure was constrained in storage and throughput, and unable to scale to the massive traffic from soccer enthusiasts.
“We were ‘throwing machines at the problem’ instead of engineering thorough solutions,” Krikorian wrote. “Our front-end Ruby machines were not handling the number of transactions per second that we thought was reasonable, given their horsepower. From previous experiences, we knew that those machines could do a lot more.”
Twitter’s Rails servers were “effectively single-threaded” and only capable of 200 to 300 requests per second per host, Krikorian explained in a detailed technical overview of the project. “Twitter’s usage is always growing rapidly, and doing the math there, it would take a lot of machines to keep up with the growth curve.”
Twitter instead began switching workloads to a revamped system using JVM (Java Virtual Machine), and had big goals.
“We wanted big infrastructure wins in performance, efficiency, and reliability – we wanted to improve the median latency that users experience on Twitter as well as bring in the outliers to give a uniform experience to Twitter,” writes Kirkorian. “We wanted to reduce the number of machines needed to run Twitter by 10x. We also wanted to isolate failures across our infrastructure to prevent large outages – this is especially important as the number of machines we use go up, because it means that the chance of any single machine failing is higher. Failures are also inevitable, so we wanted to have them happen in a much more controllable manner.”
Fewer Servers, Better Reliability
A significant benefit of the overhaul, according to Krikorian, is the reduction in servers required to run Twitter.
“Twitter is more performant, efficient and reliable than ever before,” he writes. “We’ve sped up the site incredibly across the 50th (p50) through 99th (p99) percentile distributions and the number of machines involved in serving the site itself has been decreased anywhere from 5x-12x. Over the last six months, Twitter has flirted with four 9s of availability.”
Twitter still needs plenty of server space. It recently began a major expansion of its data center infrastructure that includes large footprints at RagingWire iNS acramento and QTS in Atlanta.
For those seeking even more details about Twitter’s back-end overhaul, check out this 2012 video in which Krikorian provides a deeper dive on the project: