The Role of Cloud in Overcoming the Analytics Deluge

Lakshmi Randall is Director of Product Marketing at Denodo,

Most enterprise workloads are poised to run in the cloud within a year. Based on a recent survey conducted by 451 Research, the percentage of these private or public cloud workloads will increase from 41 percent to 60 percent by mid-2018. Among survey respondents, 38 percent have already adopted a cloud-first policy, which prioritizes cloud solutions for all workload deployments. This is not surprising based on the agility, flexibility, scalability, perceived reduction in TCO, and cloud-born data that cloud computing offers. Cloud pricing is a key driver of cloud workloads. As the cost of cloud computing continues to fall, enterprises are increasingly reluctant to pursue costly expansions of their on-premises data centers or even appliances like a data warehouse.

On top of cloud pricing and inherent computing advantages, cloud providers continue to add services such as data warehouse, data integration, data preparation, and analytics that are essential for accelerating the delivery of analytics to both internal and external customers. It's no wonder that the center of gravity for both data and compute capacity is increasingly shifting from the traditional on-premises data center to the cloud, as companies take advantage of its inherent flexibility.

Why is Data Gravity Important?

Data gravity – which moves processing and analysis closer to data, i.e., to where the data resides – is gaining momentum among organizations for the simple reason that the alternative of moving data to a processing or a compute layer is costlier and more time consuming. The volumes of data involved in modern analytics is too large to rely on archaic approaches that require copying masses of data from one system to another for processing. Moving data in and out of the cloud for processing won’t resolve this dilemma, but rather can exacerbate it.

The processing engine must be intelligent in order to move the processing to where the data resides, and minimize data movement across the network. Data lives everywhere, including edge, near-edge, and colocation. If there is a need to move data, it is prudent to move only the subset of data that is required to support the analysis (e.g., from on-premises to cloud, cloud to cloud, or edge to cloud). Filtering, reducing, and retrieving only the necessary data minimizes data movement regardless of where the data resides.

On-premises data centers can be mission-critical to enterprises and are not anticipated to disappear anytime soon. However, workloads increasingly are becoming distributed and hybrid in nature. Enterprises are demanding colocation and hyper-scale cloud, new cloud computing and networking models, and they are mapping to the most optimal data center, which could be the edge, near edge and the core, or even a remote location. Factors that drive where the data should reside for the purpose of providing optimum location for storing, processing, aggregating, and filtering include, but are not limited to, the following:

Performance and latency requirements
Criticality of access to data (e.g., consider a remote data center operated by battery)
Acceptable level of down-time (e.g., network connectivity down)
Bandwidth limitations (e.g., on-premises to cloud, cloud-to-cloud)
Security, compliance, governance requirements (e.g., necessity to maintain sensitive data on-premises)

Data gravity becomes extremely important to support required performance and latency in accelerating analytics initiatives.

Compute Gravity Increases Compute Potential

Scenarios exist that require compute gravity to complement data gravity, including compute-intensive use cases in science, healthcare, and transportation. Compute-intensive secondary analysis of a subset of data (working Set) can be performed by leveraging appropriate compute resources regardless of where the data resides, whether cloud object storage or on-premises. The subset of data targeted for secondary analysis, possibly originating from original data comprising many petabytes, can also be processed in the cloud.

It's easy to build transient clusters in the cloud to support these types of workloads, especially when underlying on-premises infrastructure is reaching its maximum compute capacity. Alternatively, a targeted data subset of original data residing in cloud storage can be processed in an on-premises environment that meets the computing needs of secondary analysis. Optimizing network bandwidth between an enterprise data center and cloud is an option available from cloud providers, but may be contingent upon the enterprise budget.

Another solution for compute-intensive use cases incorporates a data access layer that provides data caching and embedded MPP in-memory fabric for data processing.

Data Virtualization and Data Gravity

Data virtualization supports data gravity by design. It brings agility, abstraction, and unified security to the modern analytics paradigm. Optimal performance is achieved by designing the data virtualization query optimizer specifically to minimize network traffic in logical architectures; minimally-adapted conventional optimizers are not sufficient. More importantly, the query optimizer should leverage in-memory parallel processing to facilitate further optimization.

To achieve the best performance, the data virtualization platform must:

Apply automatic optimizations to minimize network traffic, pushing down as much processing as possible to the data sources.
Use parallel in-memory computation to perform the post-processing operations at the data virtualization layer when this processing cannot be pushed down to the data sources.
Fulfill the requirements of scenarios that necessitate increased compute potential via parallel in-memory processing and data caching capabilities.

As part of an information infrastructure modernization strategy, enterprises should consider how best to leverage the cloud for their workloads in complementing the physical data centers. The goal is to meet the demands of compute for various analytical workloads keeping in mind the requirements of regulations and compliance. Data gravity plays an important role in accelerating analytic workloads, and can be complemented by compute gravity for appropriate scenarios.

Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Lakshmi Randall is Director of Product Marketing at Denodo, the leader in data virtualization software. Previously, she was a Research Director at Gartner covering Data Warehousing, Data Integration, Big Data, Information Management, and Analytics practices. To learn more, visit www.denodo.com or follow the company @denodo or the author on Twitter: @LakshmiLJ

Comments

Plain text