Fiber Cut Knocks Wikipedia Offline
August 7th, 2012 By: Rich Miller
The IT team at Wikimedia has been working hard to improve the reliability of Wikipedia, the crowd-sourced online encyclopedia that has grown into the world’s fifth-busiest web. But even the new and improved Wikipedia infrastructure was no match for a backhoe.
Wikimedia’s sites experienced a one-hour outage early Monday when it lost network connectivity between its two U.S. data centers, located in Tampa and Ashburn, Virginia. “Upon checking with our network provider, they informed us that the outage was caused by a fiber cut between the two data centers,” reports CT Woo, Director of Technical Operations for Wikimedia, on the Wikimedia Tech Blog.
“While Ashburn serves most of the traffic, it needs to talk to our Tampa data center for backend services (e.g. database),” Woo wrote. “We do operate two 10-g separate fibers between the data centers. We are now working with our network provider to determine how and why we were impacted by that fiber cut when we are supposed to have redundancy in our network. We are still waiting for their full report.”
The Ashburn site is Wikimedia’s newest data center, and will eventually replace the Tampa facility. The group also has a data center in Amsterdam, and may add a server farm in Asia to support its global traffic.
Low Budget Meets High Traffic
As a non-profit running one of the world’s busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Apple can spend $1 billion on a single data center, Wikimedia spent $2.6 million on Internet hosting in 2011, with a server count measured in hundreds, not thousands.
Despite its low-budget approach to supporting a high-traffic site, the Wikimedia Foundation made reliability a priority in 2010, when it expanded its U.S. infrastructure. “Our projects are vulnerable to primary data center failure,” the group said in its annual report. “We will build out a second data center to enable safe failover in the case of disaster. We will also increase uptime by improving site monitoring, capacity planning and operations response.”
That meant a long-range plan to shift its server footprint beyond Tampa to Ashburn, probably America’s busiest intersection of networks. Loudoun County, Virginia is home to 4 million square feet of data center space and many of the Internet’s key interconnection facilities. Placing Wikipedia’s servers in Ashburn improves its connection to high-traffic networks that serve its users.
On the site monitoring front, Wikimedia has also rolled out a status dashboard providing detailed stats on the performance and uptime of its sites and services. Monday’s outage dinged the uptime numbers for some of these metrics, but it’s a far cray from the early days of Wikipedia.
“Down time used to be our most profitable product,” joked Domas Mituzas, a performance engineer at Wikipedia, in a 2008 presentation at the O’Relly Velocity conference.. The gag is that when Wikipedia is offline, the site often displays a page seeking donations for additional servers.
Perhaps an external, cloud load balancer will help fail over from one location to another, or direct traffic to alternate sites. We’re not the only company offering these simple but effective solutions.