Google Processing 20 Petabytes a Day

Google (GOOG) engineers Jeffrey Dean and Sanjay Ghemawat have published a new paper providing some details on MapReduce, the company's technology for processing the huge datasets for its web index. The paper notes that more than 100,000 MapReduce jobs are executed daily day, processing more than 20 petabytes of data per day. The paper was published in the January 2008 issue of Communications of the ACM. Full text is limited to members (a copy has been posted elsewhere on the web, but we won't link it here).

I first saw this mentioned Sunday night by Greg Linden, who notes that the paper also provides some data points on Google's server configuration, including the fact that each dual processor box in the Google cluster typically has 4 to 8 gigs of memory.

"What is so remarkable about this is how casual it makes large scale data processing," Greg writes. "Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time. It is an amazing tool for Google."

The data from the Google researchers is getting more visibility today after being picked up by a number of tech blogs, most notably Niall Kennedy, who has summarized some of the statistics on the growth of the data processed using MapReduce. Niall's post has since been picked up by TechCrunch and Paul Kedrosky.

For folks who are interested in even more information about MapReduce, there are several videos on Google Video which detail its workings. Google Dataset specialist Barry Brumitt did 30-minute presentation on MapReduce at the Seattle Conference on Scalability in June, while a team of Google developers offered a five-part lecture series in August on MapReduce and cluster computing.