Looking to tie the Apache Spark in-memory computing framework much closer to Apache Hadoop, Cloudera today announced it is leading an effort to make Spark the default data processing framework for Hadoop.
While IT organizations will be able to continue to layer other data processing frameworks on top of Hadoop clusters, the One Platinum Initiative is making a case to essentially replace MapReduce with Spark as the default data processing engine, Matt Brandwein, director of product marketing for Cloudera, said.
Most IT organizations consider MapReduce to be a fairly arcane programming tool. For that reason, many have adopted any number of SQL engines as mechanisms for querying Hadoop data.
Google publicly announced it had stopped using MapReduce because it was inadequate for its purposes last year, replacing it with its own framework called Dataflow. The company launched Dataflow as a beta cloud service earlier this year.
When it comes to building analytics applications that reside on top of Hadoop, the Spark framework has been enjoying a fair amount of momentum.
Brandwein noted that there are at least 50 percent more active Spark projects than there are Hadoop projects. The One Platinum Initiative would in effect formalize what is already rapidly becoming a de facto standard approach to building analytics applications on Hadoop.
“We want to unify Apache Spark and Hadoop,” he said. “We already have over 200 customers running Apache Spark on Hadoop.”
Cloudera, claimed Brandwein, has five times more engineering resources dedicated to Spark than other Hadoop vendors and has contributed over 370 patches and 43,000 lines of code to the open source stream analytics project. Cloudera also led the integration of Spark with Yarn for shared resource management on Hadoop as well integration efforts involving SQL frameworks such as Impala; messaging systems such as Kafka; and data ingestion tools such as Flume.
The long-term goal, said Brandwein, is to make it possible for Spark jobs to scale simultaneously across multi-tenant clusters with over 10,000 nodes, which will require significant improvements in Spark reliability, stability, and performance.
Cloudera, he added, is also committed to making Spark simpler to manage in enterprise production environments and ensuring that Spark Streaming supports at least 80 percent of common stream processing workloads. Finally, Cloudera will look to improve Spark Streaming performance in addition to opening up those real-time workloads to higher-level language extensions.
Exactly how much support for this initiative Cloudera has remains to be seen. The company, for example, has long-standing relationships with both Intel and Oracle. The rest of the IT industry at this juncture appears to be more committed to the Hadoop distribution put forward by Cloudera’s rival Hortonworks.