Raymie Stata is CEO and founder of Altiscale, Inc. Raymie previously served Chief Technical Officer at Yahoo! where he played an instrumental role in algorithmic search, display advertising and cloud computing. He also helped set Yahoo’s Open Source strategy and initiated its participation in the Apache Hadoop project.
Hadoop has clearly become the leading platform for big data analytics today. But in spite of its immense promise, not all organizations are ready or capable of implementing and maintaining a successful Hadoop environment. As a result, the need for Hadoop coupled with the lack of expertise in managing large, parallel systems has resulted in a multitude Hadoop-as-a-Service (HaaS) providers. HaaS providers present an outstanding opportunity for overwhelmed data center admins that need to incorporate Hadoop but don’t have the in-house resources or expertise to do so.
But what kind of HaaS provider do you need? The differences between each service offering are dramatic. HaaS providers offer a range of features and support, from basic access to Hadoop software and virtual machines, from preconfigured software in a “run it yourself” (RIY) environment to full service support options that include job monitoring and tuning support.
Any evaluation of HaaS should should take into account how well each of the services enables you to meet your business objectives while minimizing Hadoop and infrastructure management issues. Here are five criteria that help distinguish the variety of HaaS options.
HaaS should satisfy needs of both Data Scientists and Data Center Administrators
Data scientists spend significant amounts of time manipulating data, integrating data sets and applying statistical analyses. These types of users typically desire a functionally rich and powerful environment. Ideally, data scientists should have the ability to run Hadoop YARN jobs through Hive, Pig, R, Mahout and other data science tools. Compute operations should be immediately available when the data scientist logs into the service to begin work. Delays in starting clusters and reloading data are inefficient and unnecessary. “Always on” Hadoop services avoid what can be frustrating delays that occur when data scientist must deploy a cluster and load data from non-HDFS data stores before starting work.
For systems administrators less is more. Their job typically entails a set of related management tasks. Management consoles should be streamlined to allow them to perform these tasks quickly and with a minimal number of steps. If the administrator must configure a set of parameters then they should be exposed while avoiding parameters that are managed by the HaaS provider. Similarly, low-level monitoring details should be left to the HaaS provider. The administration interface should simply report on the overall health and SLA-compliance of the service.
HaaS Should Store “Data at Rest” in HDFS
HDFS is the native format for storing data in Hadoop. When data is persisted in other formats it must be loaded into HDFS. Storing data persistently in HDFS avoids the delays and the cost of translating data from another format to HDFS.
After initial data loads, users should not have to manage data in storage systems that are not native to Hadoop or be required to move data into and out of HDFS as they do their work. HDFS is industry tested to provide cost effective, reliable storage at scale. It is optimized to work efficiently with MapReduce and Yarn-based applications, is well suited to interactive use by analysts and data scientists, and is compatible with Hadoop’s growing ecosystem of third-party applications. HaaS solutions should offer “always on” HDFS so users can easily leverage these advantages.
HaaS Should Provide Elasticity
Elasticity should be a central consideration when evaluating HaaS providers.
Another consideration when evaluating HaaS providers is the ease with which the service manages elastic demand. In particular, one should consider how transparently the service handles changing demands for compute and storage resources. For example, Hadoop jobs can generate interim results that may be temporarily stored. Does the HaaS transparently expand and contract storage without system administrator intervention? If not, Hadoop administrators may need to be on call to adjust storage parameters or risk delaying jobs.
Also consider how well the HaaS manages workloads. Environments that support both production jobs and ad hoc analysis by data scientists will experience a wide range of mixed workloads. How easily does the service adjust to these varying workloads? Can it effectively manage YARN capacity and related CPU capacity?
Pages: 1 2