Twitter’s Infrastructure Team Matures, Making the Fail Whale a Thing of the Past

Twitter’s infrastructure team has managed to kill the notorious Fail Whale, which signified a Twitter outage, and Raffi Krikorian, the company’s vice president of platform engineering, says the team likes to think the creature (which a few years ago used to show up on users’ screens instead of their Twitter feed quite often) is now a thing of the past.

“I think we’re getting to that point that we can say confidently that we know how to do this,” he said during one of the fireside chats on stage at last week’s GigaOm Structure conference in San Francisco.

“Back in the day we didn’t really understand how to do our capacity planning properly,” Krikorian said. Capacity planning on a global scale is a difficult thing to do and it took time to get it down.

The team now operates at a very different level, worrying about delivering a smooth experience to the company’s software engineers. Krikorian does not want the engineers to ever worry about the infrastructure layer, which may distract them from focusing on the end-user experience.

Companies like Twitter, that were built around Internet services unlike anything that came before them and grew at breakneck speeds, had to learn how to operate data center infrastructure for their specific applications from scratch. The group also includes the likes of Facebook, Google and Yahoo.

As they grew, they developed their own IT management software and a lot of their own hardware. They have open sourced lots of technology they developed for their own purposes, many of which have been adopted by others and become central to commercial offerings developed by a multitude of startups.

Bursting remains a big challenge

The main challenges have remained the same, only Twitter has gotten better at managing them. The two biggest challenges are speed – or “real-time constraint,” as Krikorian put it – and burst capacity.

“We have to get tweets out the door as fast as we possibly can,” he said. Doing that during times of high demand requires some careful engineering.

These high-demand periods come quite often. The ongoing FIFA World Cup generates quite a bit of them, for example.

“Every time a goal happens … there’s a huge influx of tweets happening,” Krikorian said. Every tweet “just has to come in, get processed and get out the door again.”

To manage capacity bursting, Twitter’s infrastructure team has started breaking services down into tiers based on importance. When something like a big sporting event is taking place, all services other than the core feed will automatically degrade in performance and forfeit the spare capacity to ensure the core user experience is delivered smoothly.

Key infrastructure management tools

One of the most important tools in Twitter’s infrastructure management toolbox is Apache Mesos, the open source cluster manager that helps applications share pools of servers. Twitter runs Mesos across hundreds of servers, using it for everything, from services to analytics.

Another key piece of technology it has built is Manhattan, a massive real-time globally distributed multi-tenant database. “We’re migrating everything we possibly generate on Twitter into Manhattan in most cases,” Krikorian said.

Manhattan handles things like SLAs and multi-data center replication. Both are examples of things he does not want Twitter engineers to think about when writing applications.

The system allows an engineer to “throw some data” onto a cluster in one data center once and see it automatically show up everywhere else it is needed.

Efficiency comes over time

This isn’t the way Twitter has always done things. “The easier way to do it is basically just have lots of spare computers laying around,” Krikorian said. “Turns out that’s not very smart.”

The infrastructure team cares a lot about efficiency, which is why it has implemented tiering and global load balancing. “We don’t print money … so I want to make sure that every single CPU is used as much as we possibly can but still provide the headroom for spikes.”

Not only does it matter that a goal has been scored in World Cup, it also matters which team is playing. “When Japan is playing in the world cup, I know that most of the traffic [during the game] will come from Japan,” Krikorian said.

This means the infrastructure sitting the company’s West Coast data center and other points of presence closest to Japan have to have data shifted around to prepare for the influx of traffic.

Twitter has not officially disclosed its West Coast data center location, but sources have told Data Center Knowledge that it leases space at a RagingWire data center in Sacramento, California. RagingWire was also present at Structure, where it announced a major expansion at its Sacramento campus.

Twitter also leases a lot of capacity at a QTS data center in Atlanta, where it has been since 2011.

Comments

Plain text