Hadoop takes its name from the toy elephant that belongs to the son of Doug Cutting, a chief architect at Cloudera and one of the engineering minds behind the open source architecture.

Cloudera Aims to Replace MapReduce With Spark as Default Hadoop Framework

Looking to tie the Apache Spark in-memory computing framework much closer to Apache Hadoop, Cloudera today announced it is leading an effort to make Spark the default data processing framework for Hadoop.

While IT organizations will be able to continue to layer other data processing frameworks on top of Hadoop clusters, the One Platinum Initiative is making a case to essentially replace MapReduce with Spark as the default data processing engine, Matt Brandwein, director of product marketing for Cloudera, said.

Most IT organizations consider MapReduce to be a fairly arcane programming tool. For that reason, many have adopted any number of SQL engines as mechanisms for querying Hadoop data.

Google publicly announced it had stopped using MapReduce because it was inadequate for its purposes last year, replacing it with its own framework called Dataflow. The company launched Dataflow as a beta cloud service earlier this year.

When it comes to building analytics applications that reside on top of Hadoop, the Spark framework has been enjoying a fair amount of momentum.

Brandwein noted that there are at least 50 percent more active Spark projects than there are Hadoop projects. The One Platinum Initiative would in effect formalize what is already rapidly becoming a de facto standard approach to building analytics applications on Hadoop.

“We want to unify Apache Spark and Hadoop,” he said. “We already have over 200 customers running Apache Spark on Hadoop.”

Cloudera, claimed Brandwein, has five times more engineering resources dedicated to Spark than other Hadoop vendors and has contributed over 370 patches and 43,000 lines of code to the open source stream analytics project. Cloudera also led the integration of Spark with Yarn for shared resource management on Hadoop as well integration efforts involving SQL frameworks such as Impala; messaging systems such as Kafka; and data ingestion tools such as Flume.

The long-term goal, said Brandwein, is to make it possible for Spark jobs to scale simultaneously across multi-tenant clusters with over 10,000 nodes, which will require significant improvements in Spark reliability, stability, and performance.

Cloudera, he added, is also committed to making Spark simpler to manage in enterprise production environments and ensuring that Spark Streaming supports at least 80 percent of common stream processing workloads. Finally, Cloudera will look to improve Spark Streaming performance in addition to opening up those real-time workloads to higher-level language extensions.

Exactly how much support for this initiative Cloudera has remains to be seen. The company, for example, has long-standing relationships with both Intel and Oracle. The rest of the IT industry at this juncture appears to be more committed to the Hadoop distribution put forward by Cloudera’s rival Hortonworks.

Get Daily Email News from DCK!
Subscribe now and get our special report, "The World's Most Unique Data Centers."

Enter your email to receive messages about offerings by Penton, its brands, affiliates and/or third-party partners, consistent with Penton's Privacy Policy.

About the Author

Michael Vizard has been covering enterprise IT issues for more than 25 years, during which time he has been the editorial director for Ziff-Davis enterprise as well as editor-in-chief for CRN and InfoWorld.

Add Your Comments

  • (will not be published)

2 Comments

  1. Logic step forward. MapReduce is obsolete.

  2. But what about Apache Flink? I've heard that people compare it with Spark.