Google’s MapReduce Divorce Does Not Mean End of Hadoop is Near

At the Google I/O conference in San Francisco last week, Urs Hölzle, senior vice president of technical infrastructure at Google, said MapReduce was no longer sufficient for the scale of data analytics the company needed and that the company had stopped using it years ago, having replaced it with a much more capable system its engineers had cooked up.

The programming model for distributing massive amounts of data across clusters of commodity servers and to run those servers in parallel to process a lot of data quickly came out of Google, which published a paper describing the system in 2004 and received a patent for MapReduce in 2010.

The model did however serve as basis for Apache Hadoop, the open source implementation of MapReduce. As companies try to get value out of all the data they and their customers generate and store, Hadoop has become very popular, and a number of business models (and businesses) have sprung up to offer Hadoop distributions and services around it to enterprises.

Hadoop MapReduce linkage broken

Last week, however, Google said MapReduce was no longer cutting it. Once datasets the company was running queries across reached multi-petabyte scale, the system became too cumbersome, Hölzle said. The announcement naturally raised the question of whether it meant the beginning of the end of the Hadoop ecosystem.

While it is further proof that MapReduce is losing some of the steam it once had (it is a 10-year-old technology after all), the Hadoop ecosystem has grown into something much larger than MapReduce, and it is way too early to declare that the sun is setting on this ecosystem.

As John Fanelli, vice president of marketing at DataTorrent notes, MapReduce and Hadoop have not been inseparable ever since the release of Hadoop 2, which is really more like an operating system that can run different kinds of data processing workloads, MapReduce being only one of them.

The second generation of Hadoop introduced YARN (Yet Another Resource Negotiator) which breaks the linkage between MapReduce and the Hadoop Distributed File System and makes it possible for other processing models to be applied to data stored on Hadoop clusters.

Batch processing demand on decline

Arguably the biggest advantage of Hadoop 2 and YARN is real-time data processing, also referred to as stream processing. MapReduce is designed for batch processing, while users increasingly need stream processing.

Google’s replacement for MapReduce, its Cloud Dataflow system, combines batch and stream processing. Its developers and customers (the company is offering it as a service on its cloud platform) can create pipelines using unified programming that include both batch and streaming services.

Fanelli doubts Google itself thinks Cloud Dataflow means the end of Hadoop. “I don’t think Google views it as a Hadoop killer,” he says. “It’s an alternative. It actually continues to validate what we’re seeing from customers. They want real-time streaming analytics of their data.”

End of “one-size-fits-all data management”

Perhaps not coincidentally, MapR, one of the leading enterprise Hadoop distribution vendors, announced a $110 million funding round led by Google Ventures less than one week after Hölzle’s keynote at I/O. MapR offers its distro as a service that can be deployed on Google Compute Engine (the giant’s Infrastructure-as-a-Service offering). DataTorrent (Fanelli’s employer) has a Hadoop-based stream processing product also offered on top of Compute Engine.

Yes, Cloud Dataflow is now competing with DataTorrent, but it only adds to the variety of available offerings, each with its own advantages and disadvantages. You can only use Dataflow in the Google cloud, for example, while DataTorrent’s solution can be deployed in different public clouds as well as in a user’s own data center.

As Paul Brown, chief architect at Paradigm4, a company with a sophisticated structured-data analytics platform, puts it, if anything is coming to an end it is the era of “one-size-fits all data management.” Instead of being the de facto Big Data platform, Hadoop will become one option within a group of platforms companies will choose from, depending on the specifics of their application, he says.

Comments

Plain text