Using Metrics to Vanquish the Fail Whale

2 comments

John Adams of the Twitter ops team discusses the use of metrics to imprve web site performance at Velocity 2009 (Photo by Duncan Davidson via Flickr)

John Adams of the Twitter ops team discusses the use of metrics to improve web site performance at Velocity 2009 (Photo by James Duncan Davidson via Flickr)

Few prominent web sites have failed more often and under closer scrutiny than Twitter. But over the past year the microblogging service has rehabilitated its reputation, improving its uptime even as its traffic has grown phenomenally.

That torrid growth continues, despite reports to the contrary based on ComScore data, according to John Adams of Twitter’s operations team, who spoke this morning at the O’Reilly Velocity Conference in San Jose. “There are a lot of reports that our growth is slowing down,” said Adams. “I can’t say what the real numbers are. But it’s just not slowing down at all. All that traffic has led to an insane amount of pain.”

Measuring and analyzing performance data has been the primary weapon in Twitter’s ongoing effort to vanquish the “Fail Whale” – the downtime mascot that appears whenever Twitter is unavailable.

“You really want to instrument everything you have,” Adams told an audience of 700 operations professionals. “The best thing you can do is have more information about your system. We’ve built a process around using these metrics to make decisions. We use science. The way we find the weakest point in our infrastructure is by collecting metrics and making graphs out of them.”

Those metrics are aggregated in a “Lord of the Rings” dashboard (“One dashboard to rule them all”) that brings together more than 1,200 data points for staff to track and analyze. That includes data from Twitter’s in-house monitoring as well as data center and network services provider, NTT America, and Google analytics. Interestingly, one of the most useful data points from Google Analytics is the “Fail Whale” page, which includes analytics code to track error data.

The appearance of the Fail Whale indicates a server error known as a 503, which then triggers a “Whale Watcher” script that prompts a review of the last 100,000 lines of server logs to sort out what has happened. When at all possible, Twitter tries to adapt by slowing the site performance as an alternative to a 503, according to Adams, who uses “whale” as a verb. “Our general fail mode has been to delay rather than whale,” he said. “We hate whale.”

Adams said the focus on metrics is one of the ways Twitter has matured. “In the beginning we had a lot of cowboy stuff, a lot of changes going on without control,” he said. “We’ve got a handle on that.”

In offering advice to other site operators, Adams cited the importance of an off-site status page to keep suers infromaed about problems. Twitter has a status blog on Tumblr. Adams says keeping users informed can reduce “armchair engineering.”

“And we’ve definitely been a victm of that,” he said.

About the Author

Rich Miller is the founder and editor at large of Data Center Knowledge, and has been reporting on the data center sector since 2000. He has tracked the growing impact of high-density computing on the power and cooling of data centers, and the resulting push for improved energy efficiency in these facilities.

Add Your Comments

  • (will not be published)

2 Comments

  1. John Adams

    You've gotten most of my words right, one correction: Twitter doesn't slow down site performance instead of sending a whale. We maintain site performance by delaying timelines or background processing tasks. There's no noticeable site delay, just a delay in the timeline (say, a message that is slightly older than real-time.)