This month, we focus on the open source data center. From innovation at every physical layer of the data center coming out of Facebook's Open Compute Project to the revolution in the way developers treat IT infrastructure that's being driven by application containers, open source is changing the data center throughout the entire stack. This March, we'll zero in on some of those changes to get a better understanding of the pervasive open source data center.
Running a successful internet business without using the data you accumulate to your advantage is clearly impossible in this day and age. Until about one year ago, Baidu, the web company behind the largest Chinese-language search engine and the country’s answer to Google, had a major technology problem on its hands.
The queries Baidu product managers ran against its databases took hours to complete because of the huge amount of data stored in the company’s data centers. Baidu needed a solution, and its engineers were given the goal of creating an ad-hoc query engine that would manage petabytes of data and finish queries in 30 seconds or less.
The first step was to get rid of MapReduce, the open source distributed computing framework that’s part of Apache Hadoop and that has for years been the most popular framework for batch analytics. Google, the creator of MapReduce, stopped using it years ago because it couldn’t handle the amount of data it needed it to handle, replacing it with a new proprietary framework called Cloud Dataflow.
Baidu switched to Spark SQL, a query engine that enables you to run SQL queries against Apache Spark, the open source Big Data processing engine that replaces the batch analytics approach of MapReduce and Hadoop with real-time, or stream analytics. It helped, but not to the extent they had hoped.
Spark SQL queries were about four times faster, but each query still look about 10 minutes to complete. Upon closer inspection, the team found that it wasn’t the way the CPUs behaved that slowed the process down. The problem was the network, and, more specifically, the way the query engine used the network to access stored data. Baidu’s data lives in multiple data centers, and to run a query, the system would have to transfer data between sites, stressing the networks and causing big delays.
In-Memory Performance With any Storage
To solve this problem, they turned to a relatively young open source project born at the University of California, Berkeley. At the time, the project was called Tachyon. The software turns a set of disparate storage systems into a single virtual storage pool accessed via a single API, but, more importantly for Baidu, it makes that single virtual storage system behave the same way an in-memory system does, where data that’s being processed sits in memory of the clustered servers that are processing it, making the process exponentially faster.
The project was recently renamed as Alluxio, which is also the name of a venture capital-backed startup founded by one of its original creators. The company’s founder and CEO, Haoyuan Li, was one of the founding contributors for Spark, which also came out of the AMPLab at UC Berkeley, where he was a PhD candidate at the time.
AMPLab is a Berkeley research hub that focuses on computing challenges presented by modern-day Big Data analytics and distributed hyperscale systems. Ion Stoica, AMPLab’s co-director and co-creator of Spark, was one of Li’s PhD advisors.
Last year, Alluxio received $7.5 million in funding from Andreessen Horowitz, one of Silicon Valley’s most prominent venture capital firms. Since then, the startup has attracted distributed computing experts from Google, VMware, Palantir, and Carnegie Mellon University, as well as other AMPLab participants.
Switching to Alluxio as the underlying storage management layer for its query engine worked for Baidu. A query that took between 100 and 150 second to complete using Spark SQL alone now takes 10 to 15 seconds, according to a recently published case study, and that’s for data stored remotely. Similar queries across data stored on local Alluxio nodes take up to five seconds.
To get that in-memory performance, Alluxio moves frequently accessed, or hot data from the underlying storage systems to the computing nodes it runs on.
Baidu is now in the process of deploying Alluxio more broadly, gradually moving more and more of its workloads into Alluxio clusters. Some of the first workloads to move are systems for serving images to online users and for offline image analysis. Currently, images for each system sit on separate storage systems. Alluxio will enable Baidu to store them on a single system, where they can be accessed for both online serving and offline analysis. The company expects this change to cut its development and operation costs substantially.
Storage Needs a Revolution
Baidu isn’t the only high-profile case study for Alluxio. British banking giant Barclays runs it as part of the stack used to build machine learning and data-driven applications. The open source project has enjoyed endorsements and investments from giants like China’s Alibaba Group, IBM, Intel, EMC, and its subsidiary Pivotal.
One of its most attractive features is the ability to present a mix of storage resources underneath, be they on-premise or in the public cloud, physical or virtual, spinning disk or SSD, as a single storage resource to the compute layer via an API. It does to storage the same thing Apache Mesos and the Datacenter Operating System by the startup called Mesosphere does to compute. Mesos abstracts disparate computing resources in the data center, presenting them to the application as a single computer.
Another important capability in Alluxio is tiered storage. Users can assign the top tier to the in-memory storage layer it creates on the compute side, for example, second tier to Flash arrays, and third tier to disk drives. Different workloads have different performance requirements, and many users may be satisfied with performance and cost levels of regular disk storage in some cases while using the faster in-memory capabilities in others. Alluxio unifies it all into a single system.
It’s not limited to Spark. Any framework can access data on any storage system, Li said.
He declined to go into detail about his startup’s business plan, saying Alluxio the company was still in stealth mode. “We have a direction,” he said. “We cannot share it for now.”
At the moment, all the focus is on making the open source technology better. Spark revolutionized computation frameworks, while Mesos is revolutionizing the way data center resources are managed, Li said. “We’re missing revolution in the storage layer.”