“Down time used to be our most profitable product,” jokes Domas Mituzas, a performance engineer at Wikipedia. The gag is that when Wikipedia is offline, the site often displays a page seeking donations for additional servers.
As a non-profit running one of the world’s busiest web destinations, Wikipedia 
provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend a half-billion dollars on one of their global data center projects, Wikipedia runs on fewer than 300 servers from a single data center in Tampa, Fla. It also has servers in Amsterdam at the AMS-IX peering exchange.
“The traditional approach to availability isn’t exactly our way,” said Mituzas, who spoke about Wikipedia’s infrastructure Monday at the O’Reilly Velocity conference. “I’m not suggesting you should follow how we do it. But losing a few seconds of changes doesn’t destroy our business. As long as a crash doesn’t turn into a disaster, there’s no witch hunting or heads rolling.”
The engineers on the Wikipedia team may not take themselves too seriously, but they are serious about performance. That’s in keeping with Wikipedia’s guiding principles, which emphasize community over commerce (the site runs no ads) and getting excellent mileage out of its donations. Wikipedia maintains high 99-percent availability, and the usage data for Wikipedia includes some mind-boggling numbers.
Mituzas, whose works as a MySQL support engineer for Sun Microsystems in his “day job,” shared the following metrics on Wikipedia’s operations:
- 50,000 http requests per second
- 80,000 SQL queries per second
- 7 million registered users
- 18 million page objects in the English version
- 250 million page links
- 220 million revisions
- 1.5 terabytes of compressed data
The site started as a Perl CGI script running on single server in 2001. Wikipedia now has 200 application servers, 20 database servers and 70 servers dedicated to Squid 
Wikipedia is powered by the MediaWiki 
software, which was originally written to run Wikipedia and is now an open source project. MediaWiki uses PHP running on a MySQL database. Mituzas said MySQL instances range from 200 to 300 gigabytes. In addition to Squid, Wikipedia uses Memcached 
and the Linux Virtual Server 
load balancer. Wikipedia also uses database sharding 
to set up master-slave relationships between databases.
Additional technical details on Wikipedia’s infrastructure is available in 2007 presentations by Mituzas 
and WikiMedia’s Mark Bergsma 
Mituzas summed up his view of Wikipedia’s operations in a blog post 
about his Velocity presentation: “As I see it, in such context Wikipedia is more interesting as a case of operations underdog – non-profit lean budgets, brave approaches in infrastructure, conservative feature development, and lots of cheating and cheap tricks (caching! caching! caching!).”
Article printed from Data Center Knowledge: http://www.datacenterknowledge.com
URL to article: http://www.datacenterknowledge.com/archives/2008/06/24/a-look-inside-wikipedias-infrastructure/
URLs in this post:
 Wikipedia: http://wikipedia.org/
 Squid: http://www.squid-cache.org/
 MediaWiki: http://www.mediawiki.org/wiki/MediaWiki
 Memcached: http://www.danga.com/memcached/
 Linux Virtual Server: http://www.linuxvirtualserver.org/
 database sharding: http://www.datacenterknowledge.com/archives/2007/Apr/27/database_sharding_helps_high-traffic_sites.html
 Mituzas: http://dammit.lt/uc/workbook2007.pdf
 Mark Bergsma: http://www.nedworks.org/~mark/presentations/san/Wikimedia%20architecture.pdf
 blog post : http://dammit.lt/2008/06/19/wikipedia-at-velocity-conference/
 Rich Miller: http://www.datacenterknowledge.com/archives/author/richm/
Click here to print.