Understanding Hadoop-as-a-Service Offerings

Raymie Stata is CEO and founder of Altiscale, Inc. Raymie previously served Chief Technical Officer at Yahoo! where he played an instrumental role in algorithmic search, display advertising and cloud computing. He also helped set Yahoo’s Open Source strategy and initiated its participation in the Apache Hadoop project.

Hadoop has clearly become the leading platform for big data analytics today. But in spite of its immense promise, not all organizations are ready or capable of implementing and maintaining a successful Hadoop environment. As a result, the need for Hadoop coupled with the lack of expertise in managing large, parallel systems has resulted in a multitude Hadoop-as-a-Service (HaaS) providers. HaaS providers present an outstanding opportunity for overwhelmed data center admins that need to incorporate Hadoop but don’t have the in-house resources or expertise to do so.

But what kind of HaaS provider do you need? The differences between each service offering are dramatic. HaaS providers offer a range of features and support, from basic access to Hadoop software and virtual machines, from preconfigured software in a “run it yourself” (RIY) environment to full service support options that include job monitoring and tuning support.

Any evaluation of HaaS should should take into account how well each of the services enables you to meet your business objectives while minimizing Hadoop and infrastructure management issues. Here are five criteria that help distinguish the variety of HaaS options.

HaaS should satisfy needs of both Data Scientists and Data Center Administrators

Data scientists spend significant amounts of time manipulating data, integrating data sets and applying statistical analyses. These types of users typically desire a functionally rich and powerful environment. Ideally, data scientists should have the ability to run Hadoop YARN jobs through Hive, Pig, R, Mahout and other data science tools. Compute operations should be immediately available when the data scientist logs into the service to begin work. Delays in starting clusters and reloading data are inefficient and unnecessary. “Always on” Hadoop services avoid what can be frustrating delays that occur when data scientist must deploy a cluster and load data from non-HDFS data stores before starting work.

For systems administrators less is more. Their job typically entails a set of related management tasks. Management consoles should be streamlined to allow them to perform these tasks quickly and with a minimal number of steps. If the administrator must configure a set of parameters then they should be exposed while avoiding parameters that are managed by the HaaS provider. Similarly, low-level monitoring details should be left to the HaaS provider. The administration interface should simply report on the overall health and SLA-compliance of the service.

HaaS Should Store “Data at Rest” in HDFS

HDFS is the native format for storing data in Hadoop. When data is persisted in other formats it must be loaded into HDFS. Storing data persistently in HDFS avoids the delays and the cost of translating data from another format to HDFS.

After initial data loads, users should not have to manage data in storage systems that are not native to Hadoop or be required to move data into and out of HDFS as they do their work. HDFS is industry tested to provide cost effective, reliable storage at scale. It is optimized to work efficiently with MapReduce and Yarn-based applications, is well suited to interactive use by analysts and data scientists, and is compatible with Hadoop’s growing ecosystem of third-party applications. HaaS solutions should offer “always on” HDFS so users can easily leverage these advantages.

HaaS Should Provide Elasticity

Elasticity should be a central consideration when evaluating HaaS providers.

Another consideration when evaluating HaaS providers is the ease with which the service manages elastic demand. In particular, one should consider how transparently the service handles changing demands for compute and storage resources. For example, Hadoop jobs can generate interim results that may be temporarily stored. Does the HaaS transparently expand and contract storage without system administrator intervention? If not, Hadoop administrators may need to be on call to adjust storage parameters or risk delaying jobs.

Also consider how well the HaaS manages workloads. Environments that support both production jobs and ad hoc analysis by data scientists will experience a wide range of mixed workloads. How easily does the service adjust to these varying workloads? Can it effectively manage YARN capacity and related CPU capacity?

Ideally, the elastic expansion and contraction of resources should create minimal numbers of configuration and administration tasks.

HaaS Should Support Non-stop Operations

In production environments with fixed workloads, system administrators can tune operating systems and applications to optimize the processing of those workloads. They can achieve non-stop operations by crafting the best set of configuration parameters and monitoring key operation metrics to ensure jobs run as expected. Hadoop environments are rarely so predictable.

Big data environments are large, complex, distributed, and parallel systems. Such systems present more challenging operating conditions than one finds in non-parallel applications including:

The need to restart failed subprocesses of a large job to avoid restarting the entire job
Jobs that starve for resources and finish late (or not at all), even when resources are available.
Deadlock, which occurs when one process must wait for a resource held by another process while the second process simultaneously waits for a resource held by the first process.

Non-stop Hadoop operations address these and other problems unique to the Hadoop environment. In-house and RIY environments are especially prone to problems maintaining non-stop operations because it requires deep Hadoop expertise and tooling.

HaaS Should Be Self-Configuring

One of the advantages of using a HaaS is that it minimizes the need for Hadoop expertise. A HaaS should configure itself for optimal numbers and types of nodes. Data scientists well versed in statistics and machine learning may have deep knowledge about when to apply a particular statistical test or use a specific machine learning algorithm but may have no foundation for deciding on the configuration of a Hadoop cluster needed to run their workflows.

System administrators rarely have too little to do or have deep expertise in every area of systems management. A HaaS that provides self-configuration allows system administrators to focus their time and efforts on tasks that cannot be easily automated.

HaaS solutions should dynamically configure the optimal number and type of nodes and automatically determine tuning parameters based on the type of workload and storage required. These optimized environments dramatically reduce human error, reduce administration time, and provide results faster than customer-tuned environments.

Look Before Your Leap

Hadoop as a Service is a promising option to building and maintaining Hadoop clusters on premises. There is a wide range of features and support offered by existing HaaS providers. Carefully evaluate HaaS providers before committing to one so you have the best chance of selecting a HaaS that meets both your Hadoop management and data science support needs.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text