Andy Warfield is the CTO and Co-founder of Coho Data.
This is Part II of the two-part series. You can read Part I here.
Big data is a big opportunity. However, in talking to both large enterprises and big data distribution companies over the past year, I was surprised to learn exactly how nascent these technologies are. Many large enterprise IT organizations are faced with the challenge of tracking a proliferation of small, ad hoc analytics clusters, or of having to plan for a larger central deployment with mixed technology requirements across multiple stakeholders.
While many believe there is untapped value to be had through harnessing their unstructured data, the path to a toolset that is as reliable, scalable and integrated as the rest of their enterprise IT environment is far from clear.
So how do we get there? This isn’t simply a matter of choosing an appropriate big data distribution. The fluidity of big data software stacks and the hands-on nature of development practices are not going to change any time soon.
In the face of this fluidity, what needs to change in order to support big data workflows in production IT environments? How can we help big data projects succeed and continue to succeed in the face of growth, evolution and disruptive events?
Big Data’s Infrastructure Problems Need Infrastructure Solutions
It’s fine that part of the job description of a big data developer or data scientist is the ability to adapt and work with new tools. However, a Wall Street bank or global healthcare firm isn’t in the business of experimenting with new and potentially fragile software tools as part of core IT. An infrastructure solution for big data must allow the diversity of tools that developers need to be deployed. It must also meet the efficiency, reliability and security requirements they have for the rest of their data. In short, it’s time for analytics environments to be brought into the fold, instead of being treated as a totally separate silo.
Unfortunately, the incorporation of big data into traditional IT has proven more difficult than anyone anticipated. This is because the compute and IO requirements of big data systems are significantly different than what traditional enterprise systems have been designed to support.
The Storage Tussle
Probably the biggest mismatch between traditional enterprise IT and big data centers on storage. Enterprise storage companies use space-efficient techniques to protect data from device data without wasting space with extra copies. HDFS defaults to keeping three copies of each file, and stores them on local disks within the compute nodes themselves. These copies aren’t just for durability: they provide flexibility in scheduling compute jobs close to data.
For years, traditional storage companies have tried to sell in to big data environments only to be defeated by this aspect of architecture: the narrowness of connectivity into a network-attached storage controller goes completely against HDFS’s architecture for scaling out I/O connectivity in conjunction with compute.
But there’s a counterpoint to this concern. As much as big data vendors would like to position HDFS as a “big data hub,” the file system falls far short of traditional enterprise storage on many counts. In fact, HDFS is not designed as a traditional file system at all. It doesn’t support the modification of existing files and the file system also struggles to scale beyond a million objects. However, more than all of this, HDFS lacks the rich set of data services that we have become accustomed to with enterprise storage systems, including things like snapshots, replication for disaster recovery and tiering across layers of performance.
A really big change is taking place on this front right now. Big data distributions and enterprise storage vendors alike are starting to acknowledge that HDFS is really more of a protocol than a file system.
Several traditional enterprise storage vendors have announced support for direct HDFS protocol-based access to data even though that data isn’t stored in HDFS (the file system) at all. Moreover, analytics distributions are acknowledging that data may be stored on external HDFS-interfaced storage.
This approach doesn’t solve the issues around scaling IO to actually achieve efficient “big data,” but it does allow customers to gain access to data services and, importantly, to large volumes of incumbent data that are already stored in those legacy systems.
At the end of the day, the storage tussle points to the fact that what is really needed for big data is to fully integrate enterprise storage and big data storage into a single, unified storage system. The recent turn toward scale-out storage architectures in enterprises makes a strong promise to deliver on this because they do not have the connectivity and controller bottleneck problems that are endemic to old enterprise storage.
In fact, there is a strong possibility that protocol-level integrations for HDFS (both NameNode and DataNode services) may result in systems that far exceed both the performance and functionality of HDFS as it is implemented today, offering the performance of scale-out systems (like HDFS) and the advanced features of enterprise storage systems.
Bringing Big Data into the Fold
Big data tools and environments have requirements that make them challenging to support with traditional IT infrastructure. However, at the end of the day, big data is still data, and enterprises have come to expect a rich set of capabilities in terms of managing, integrating and protecting that data. Many of these are not yet provided by big data systems that elect to add an entirely new IT silo to the datacenter environment.
This property is changing rapidly: as big data deployments continue to move from exploratory projects to business critical systems, IT systems will need to adapt and evolve in order to provide for the needs of large-scale analytics and business intelligence systems, while concurrently providing for the rich set of capabilities that is expected of all data within the enterprise.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.