While popular, Hadoop is a notoriously difficult framework to deploy. Standing up a Hadoop cluster in a company data center is a complex and lengthy process.
There is a market for helping companies do that, which is why so many startups and incumbent IT vendors have flooded the space in recent years.
One startup, BlueData, founded by a pair of VMware alumni, is tackling one of the hardest problems: presenting data stored in a variety of different formats on a variety of different systems to Hadoop in a uniform way that Hadoop understands and doing it quickly.
The need to process a wide variety of data by enterprise analytics systems is growing. More and more companies need to process data from their internal sources as well as external and process it all together, according to market research firm Gartner. The role of data generated by the Internet of Things is also growing in importance.
Going against the orthodox notion that Hadoop should run on bare-metal servers, BlueData’s platform is based on OpenStack and uses KVM virtual machines. Convincing the market that you can stand up a Hadoop cluster using VMs and still get the performance right is one of the biggest hurdles in meetings with customers and investors, Jason Schroedl, BlueData’s VP of marketing, said.
But they did manage to convince a handful of big-name customers, including Comcast, Orange, and Symantec, and a group of VCs who have pumped $19 million into the Mountain View, California-based startup since it was founded three years ago.
“When we go to pitch our stuff, it’s always, ‘Why don’t I move to Amazon? How do you beat bare metal?’” the company’s co-founder and chief architect Tom Phelan, a 10-year VMware veteran, said.
Translating Everything to HDFS in Real Time
The pitch is simple. BlueData’s platform, called Epic, presents data from a set of disparate file systems or object stores as if it is coming from Hadoop Distributed File System. Using proprietary tech, it delivers data using the HDFS protocol in real time and delivers it directly to VMs in a Hadoop cluster.
It supports a variety of Hadoop distributions, including numerous versions of Cloudera and Hortonworks. It supports both native Spark and Spark on top of Hadoop. Spark is an open source distributed-processing framework for real-time analytics.
One big reason the system is so fast is BlueData’s own caching architecture. Its VMs are fast and lightweight, and it runs HDFS outside of the cluster.
Customers access Epic through a RESTful API or through a user interface. The user selects the amount of nodes in the cluster and a Hadoop application, and a few minutes later they have a virtual cluster ready to run their analytics job.
Because the cluster is virtual, a user can stand it up temporarily, run the job, and dismantle it after, freeing up IT resources for something else. It can be used to generate infrequent reports, for example, or for testing applications that rely on Hadoop.
Hadoop on AWS in Docker Containers
BlueData focuses primarily on on-premise enterprise deployments, but the company recently introduced a cloud-based version of the platform called Epic Lite. It runs in cloud VMs on Amazon Web Services or on a laptop and uses Docker containers.
Cloud-based deployments are not really BlueData’s sweet spot, Phelan said. Epic cannot take advantage of the high performance it’s engineered for without backend access to the data center systems it runs on.
The company introduced Epic Lite mainly because Docker containers are so popular nowadays, but also because containers can potentially provide a performance improvement over using virtual machines when deployed on-prem. “We do pay a penalty for virtual machines,” Phelan said. “They have a CPU overhead.”
Epic consumes more CPU cycles to run Hadoop jobs than bare-metal solutions, and there are customers that need to consolidate as many Hadoop workloads into as few CPU cycles as possible. “Containers are the best solution for them,” he said.
There are also customers that simply don’t have dedicated bare-metal servers in their data centers to put Epic on. They can only allocate VMs, and you cannot run BlueData’s VMs inside other VMs. These customers have to use the container-based version too.
Will Containers Replace VMs?
Asked whether the full enterprise version of Epic can be entirely container-based, Phelan said it would not be feasible, at least today. “State-of-the-art containers still have security liability issues,” he said.
Epic is a multi-tenant system, and Docker containers cannot really isolate applications from one another all that well when sharing a host. Secure separation between different users is a must for many enterprise customers, and there is “still a pretty good attack surface within containers,” Phelan said.
Whatever form virtualization will ultimately take, companies have access to more data today than they have ever had, and many of them are anxious to put it to work. As enterprise use of data analytics continues to grow, so does the size of the opportunity for companies like BlueData.
Companies look at Big Data as a way to grow business value, and the easier it is for their analytics engines to access wide pools of data, the more effective those engines will be.