Why Data Scientists Want More Than Hadoop

2 comments

MARILYN MATZ<BR/>Paradigm4MARILYN MATZ
Paradigm4

Marilyn Matz is CEO and co-founder of Paradigm4, the creator of SciDB, a computational database management system used to solve large-scale, complex analytics challenges on Big – and Diverse – Data.

Many new analytical uses require more powerful algorithms and computational approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly need to leverage all of their organization’s data sources in novel ways, using tools and analytical infrastructures suitable for the task.

As we found out from our survey of data scientists, organizations are moving increasingly from simple SQL aggregates and summary statistics to next-generation complex analytics. This includes machine learning, clustering, correlation and principal components analysis.

Hadoop missing the mark

Hadoop is well suited for simple parallel problems but it comes up short for large-scale complex analytics. A growing number of complex analytics use cases are proving to be unworkable in Hadoop. Some examples include recommendation engines based on millions of customers and products, running massive correlations across giant arrays of genetic sequencing data and applying powerful noise reduction algorithms to finding actionable information in sensor and image data.

Currently, first-wave Hadoop adopters like Google, Facebook and LinkedIn are required to have a small army of developers to program and maintain Hadoop. But many organizations either don’t have the resources required for Hadoop and MapReduce programming expertise in-house or they face complex analytics use cases that can’t be readily solved with Hadoop. Since Hadoop does not support SQL, joins and other key functionality required for managing and manipulating data are not available to data scientists.

Addressing significant shortcomings

Hadoop vendors have also recognized the limitations. They are adding SQL functionality to their products to accommodate data scientists’ preference for a higher-level query language over low-level programming languages like Java, and to address the limitations of MapReduce.

For example, Cloudera has abandoned MapReduce and is offering Impala to provide SQL on top of the Hadoop Distributed File System (HDFS). Other vendors are adding SQL-sitting-on-Hadoop solutions to address Hadoop’s significant shortcomings. While these approaches make it easier to program, they are limited in how far they take you because they operate on a file system, not a database management system. Finally, they don’t have atomicity, consistency, isolation and durability (ACID) capabilities that are highly desirable for some applications. And they are slow.

Beyond SQL functionality, leveraging skill sets

In addition to lacking SQL functionality, Hadoop doesn’t effectively leverage data scientist skill sets. In a Hadoop environment, end-users typically use MapReduce Java as their primary programming language. But data scientists prefer to work in powerful and familiar high-level languages such as R and Python.

As a result, data stored in Hadoop tends to get exported to a data scientist’s preferred analytical environment, injecting time-intensive, low-value data movement into analytical workflows. Moving data out of Hadoop for analysis, summarization and aggregation and then having to move results back to Hadoop destroys data provenance and makes it difficult for data scientists to seamlessly explore and analyze their data across a spectrum of granularity and aggregations.

Rethinking Hadoop-based strategies

Many organizations are drawn to Hadoop because the Hadoop Distributed File System enables a low-cost storage strategy for a broad range of data types without having to pre-define table schemas or determine what the data will eventually be used for. While this is convenient, it’s a terribly inefficient approach for storing and analyzing massive volumes of structured data.

The move from simple to complex analytics on Big Data warns us of an emerging need for analytics that scale beyond single server memory limits and handle sparsity, missing values and mixed sampling frequencies appropriately. These complex analytics methods can also provide data scientists with unsupervised and assumption-free approaches, letting all the data do the talking. Storage and analytics solutions that leverage inherent data structure produce significantly better performance than Hadoop.

While Hadoop is a useful and pervasive technology, it’s hardly a golden hammer. Hadoop and MapReduce environments require significant development resources and fail to leverage the power of popular high level languages like R and Python preferred by data scientists.

Too slow for interactive data exploration and not suited for complex analysis, Hadoop forces data scientists to move data from the Hadoop Distributed File System to analytical environments, a time-consuming and low-value activity. As data scientists increasingly turn to complex analytics for help solving their most difficult problems, organizations are rethinking their Hadoop-based strategies. 

Corrected: A previous version of this article erroneously mentioned Splice Machine and its database solution. The article has been updated to correct that error.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Add Your Comments

  • (will not be published)

2 Comments

  1. This is Rich from Splice Machine. We are ANSI-99 operational RDBMS that can support full ACID transactions across multiple rows and tables. We use distributed snapshot isolation to provide lockless, high-concurrency transactions. Can you update the article to reflect this? Thanks. Rich

  2. Joe

    "While Hadoop is a useful and pervasive technology, it’s hardly a golden hammer. Hadoop and MapReduce environments require significant development resources and fail to leverage the power of popular high level languages like R and Python preferred by data scientists." A couple of pieces of feedback: Hadoop is used all over for data science work. Python over Hadoop streaming is great and there are several plugs for R on Hadoop. SQL: There's a bunch of tools in the space for SQL on Hadoop. Hive, Drill, Impala, Presto, Spark/Shark just to name a few. MapReduce: Totally agree its not the right solution for much of what data science needs. At the same time its not the only game in town. MR is one way of processing data in Hadoop, but not the only way. Use it when is makes sense, but there are many other tools.