Under the Hood of Hadoop: Its Journey from Open Source to Self-Service

Raymie Stata is CEO of Altiscale.

Apache Hadoop has transitioned from a Silicon Valley innovation to a critical technology for businesses around the world.

What started as an in-house project at Yahoo has become the de facto-standard for Big Data processing. It is making deep impacts in industries that historically haven’t had Big Data embedded into their DNA, such as finance, healthcare and energy. What’s even more impressive is the staggering amount of innovation surrounding Hadoop with the growing ecosystem of software built on top of it.

In the Hadoop ecosystem, evolution is to be encouraged, and as Hadoop continues to move beyond its Internet roots, it is also evolving.

The question: “What is Hadoop?” has become a trick question of sorts given it can be defined in a number of ways.

Hadoop is both an Apache project and an ecosystem of technologies. Hadoop the project was the catalyst for an entire ecosystem of Big Data-related projects, which fall under the umbrella broadly called the “Hadoop ecosystem.” This ecosystem does not stand still. There exists an incredible network of software built on top of Hadoop that exemplifies the staggering innovation surround Hadoop and is a testament to its value in today’s data-driven world. This software includes those such as Hive, HBase, and, more recently, Spark and Kafka. And that just scratches the surface of what’s out there.

With discussions around Hadoop comes the question of Spark and whether it is a replacement for or successor to Hadoop. Spark is certainly a potential successor for MapReduce. While it still has a way to go in terms of scalability and robustness, it is making fast progress and will eventually completely replace MapReduce for any new code.

But in Hadoop 2.0, MapReduce is just a small part of Hadoop. The main components of the Hadoop project today are the YARN resource manager and the HDFS storage system. Spark has neither a resource manager nor storage system, but it needs both to operate. Hadoop and Spark are tremendous complements to each other, and they will co-evolve for years to come.

In terms of adoption, Hadoop has matured beyond early-adopter Internet companies to span enterprises considered more mainstream, such as those in finance and healthcare. As Hadoop infiltrates the enterprise, more and more organizations are putting business-critical processes on top of it and committing significant budgets toward it.

While Hadoop adoption is on the rise, those using it waste way too much time doing the equivalent of janitorial work with large amounts of data. Data scientists are bogged down by administrative tasks, including wrangling data and wrestling Hadoop clusters. The next big wave of innovation in the Hadoop ecosystem will be in self-service and we’re already starting to see solutions that help data scientists manage the complexities of Hadoop so that they can focus on uncovering meaningful insights from their data.

Hadoop-based data analytics lends itself more to self-service capabilities as compared to traditional Enterprise Data Warehousing (EDW). Self-service gives programmers, data scientists, and business analysts direct access to all of the data in the enterprise, bypassing the high priests of data that slow everything down and drive up costs. This trend toward self-service will continue to accelerate, which in turn will drive the emergence of new tools for data management and engineering, as well as on-going work in security and new approaches to data governance.

Like all of IT, Big Data is migrating to the cloud. Historically, companies were cautious to put sensitive data in the cloud. However, more and more IT leaders today are growing comfortable and confident in the cloud and testing the waters with hybrid deployments of Hadoop.

Hadoop adoption is unmistakably trending up. While it may take time for Hadoop to supplant existing data investments in more conservative organizations, it’s likely that any enterprise today building a data infrastructure from the ground up will consider Hadoop. And as its popularity grows, so will its journey toward self-service and cloud-based deployments. What started as a small open source project, is now a critical component in today’s data-driven landscape.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text