Hadoop and Big Data Storage: The Challenge of Overcoming the Science Project

Andrew Warfield is the CTO of Coho Data and an associate professor of computer science at the University of British Columbia.

This is Part I of a two-part series. Come back to read Part II next week.

About two years ago, I started talking to Fortune 500 companies about their use of tools like Apache Hadoop and Spark to deal with big data in their organizations. I specifically sought out these large enterprises because I expected they would all have huge deployments, sophisticated analytics apps, and teams that were taking huge advantage of data at scale.

As the CTO of an enterprise infrastructure startup, I wanted to understand how these large-scale big data deployments were integrating with existing enterprise IT, especially from a storage perspective, and to get a sense of the pain points.

What I found was quite surprising. With the exception of a small number of large installs — and there were some of these, with thousands of compute nodes in at least two cases — the use of big data tools in most of the large organizations I met with had a number of similar properties:

Many Small Big Data Clusters

The deployment of analytics tools inside enterprise has been incredibly organic. In many situations, when I asked to talk to the “big data owner,” I wound up with a list of people, each of whom ran an 8-12 node cluster. Organizational IT owners and CIOs have referred to this as “analytics sprawl,” and several IT directors jokingly mentioned that the packaging and delivery of Cloudera CDH in Docker packages was making it “too easy” for people to stand up new ad hoc clusters. They had the sense that this sprawl of small clusters is actually accelerating within their companies.

Non-standard Installs, Even on Standard Distributions

The well-known big data distributions, especially Cloudera and Hortonworks, are broadly deployed in these small clusters, as they do a great job of combining a wide set of analytics tools into a single documented and manageable environment. Interestingly, these distributions are generally used as a “base image,” into which all sorts of other tools are hand-installed. As an example, customizations for ETL (extract, transform, load) — pulling data out of existing enterprise data sources — are common. So are additions of new analytics engines (H2o, Naiad and several graph analytics tools), that aren’t included in standard distributions. The software ecosystem around big data is moving so fast that developers are actively trying out new things, and extending these standard distributions with additional tools. While agile, this makes it difficult to deploy and maintain a single central cluster for an entire organization.

Inefficiencies and Reinvention

Whether large or small scale, analytics environments are typically being deployed as completely separate silos alongside traditional IT in their own racks and on their own switches. Data is being bulk copied out of enterprise storage and into HDFS, jobs are run, and then results are being copied back out of HDFS back to enterprise storage. Separate compute infrastructure is being deployed to run analytics jobs, resulting in wasted efficiency and an effective doubling of both capital and operational costs. Finally, business continuity concerns, such as the availability of clusters and the protection of data, are being solved by having physically duplicate clusters installed at multiple physical sites performing the exact same compute in each one.

More Than One Way to Build Big Data

It’s important to point out that none of these things are necessarily wrong: As companies are in the early stages of exploring big data tools, it makes complete sense that things happen in an organic and grassroots manner. However, as these tools start to bear fruit and become critical parts of business logic, their operational needs change quickly. It was remarkable to me that in many of my conversations — both with analytics cluster owners and with traditional IT owners — that the state of big data in their organization was described as a “a bit of a science project”; not in a necessarily negative way, but certainly as a way of characterizing the isolated and ad hoc nature of cluster deployments.

As a result of all this, one of the most significant challenges facing enterprise IT teams today is how to efficiently support and enable the “science” of big data, while providing the confidence and maturity of more traditional (and often better understood) infrastructure services. Big data needs to become a reliable and repeatable product offering – not to mention efficient and affordable – within IT organizations in the same way that storage, virtual machine hosting, databases and related infrastructure services are today.

From Science Project to Data Science Product

So how do we get there? One thing to make clear is that this isn’t simply a matter of choosing an appropriate big data distribution. The fluidity of big data software stacks and the hands-on nature of development practices are not going to change any time soon.

For big data science projects to evolve into viable, efficient solutions, it will take as much rethinking from vendors as it will from the companies deploying these solutions. This evolution is happening quite rapidly as vendors provide infrastructure solutions that bridge the gap between web-scale approaches and traditional data center architectures.

I’m eager to see how these changes allow companies of all shapes and sizes to further leverage big data to grow their businesses and better serve and understand customers.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text