As data lakes grow, it becomes harder to analyze and glean the insights from the massive amount of data inside them. With global data volume projected to reach 175 zettabytes by 2025, that’s no small challenge. Data lakes can quickly become data swamps, where the data is more challenging to find and identify as volume scales upward.
For data center operators, this is unwieldy, time-consuming, and costly. Teams may not be able to find what they need -- and they might not even know where to look in the first place. For the end-user, valuable insights may not be found in the swamp -- insights that could seminally impact the task at hand, be it medical research, financial transactions, retail reports or simply running ecommerce systems more efficiently.
Traditionally, teams created data warehouses using database management systems. In addition, since many databases were not well suited for unstructured data, a separate file system repository might also be used to associate related files, images, logs and other big data. Unfortunately, this burdened data center operators with needing to manage two data repositories and keep them in sync as data changes.
Teams too often prioritize the fit and capabilities of their analytics tools in building data lakes. Instead, they should look carefully at the storage repository that houses data, to ensure it can:
- Process data from various sources
- Scale performance and capacity, and
- Make data accessible to the right users and applications.
File systems vs. object storage for data lakes
As stated, traditional relational database management systems (RDBMS) imposed a strict and rigid structure on data and required data center operators to perform complex Extract/Transform/Load (ETL) steps on data to fit it into the database model. Today, the main appeal of a data lake is that developers can export and dump data in from any external source, in any format.
The addition of a file system posed two major disadvantages for data lakes:
- No extensible user or application metadata is supported: This imposes the need for a separate database system to capture the tags and attributes needed to add taxonomy and enrich the data stored in the file system, and for enabling index-optimized queries. It’s a major burden for data center operators to manage two systems.
- File systems have a fixed, rigid structure imposed by the usual folder hierarchies: there’s really only one way to access the data and that’s to navigate through the file system hierarchy until the user finds what they need. That’s inefficient, but more importantly, static, and fixed.
In contrast, object stores offer compelling advantages for data lakes, namely:
- Eliminating the need for a separate database through extensible metadata: Object stores can handle both storage for the data payloads plus extensible metadata (user or application defined) that’s stored with each object. This eliminates the need for a database separate from the storage solution, as is required with a file system. Metadata can be used dynamically over time to add context/semantics/taxonomy to data. Think of it like the difference between the old MS Outlook email system versus Gmail, which offers tagging and labels to add structure to email content.
- Increased performance: Some enterprise object storage systems also support integrated metadata search with index-optimized query capabilities -- reducing query times from hours to minutes, depending on the data set size, by replacing time-consuming data scans with fast index lookups.
- Single systems management for data center operators: By collapsing the data lake storage from a database plus a file system, management is simplified. User and performance management, monitoring, and scaling the system are consolidated. The data lake can be grown seamlessly as needed, which ensures continuous uptime, with no downtime or disruption.
- Unlimited access paths to data: Object storage enables access to data in time order, by key prefix or by metadata ordering. Users can access single objects directly by key, or do listings of objects, even with filters based on tags -- or do searches based on metadata.
Use case: sensor and event stream data
One large IT systems provider replaced file system-based storage solutions in a customer support system for a significant uptick in resiliency, access, and performance. One file system stored an ongoing stream of remote telemetry from thousands of systems deployed worldwide in customers’ data centers had created a flood of new sensor, log and event stream data and metadata. Separately, a big data analytics solution used to inspect storage repository data for anomalous patterns that could indicate pending or possible faults had its own file storage system. The organization sought faster query capability to look for records and system information -- speed that also could improve proactive customer service.
The new data lake, built on object storage, eliminated the legacy two-system solution and converged both into a single, easy to manage object storage platform that included both integrated metadata and query capabilities. It accommodates four to five terabytes of new data ingested per day, up 52% from the previous system. Scaling is simplified and queries are done up to 1,000 times faster.
Creating the optimal data lake
Object storage helps optimize data lakes over time because it organizes information into containers of flexible sizes -- a.k.a. objects. Each object includes the data itself as well as the associated metadata, and it has a globally unique identifier rather than a file name and path. These systems can be augmented with custom attributes to handle additional file-related information, which make finding the needed information much easier. There’s no limit on data volume, which is important considering that data lakes can quickly reach petabyte scale and beyond.
With object storage, data center operators can handle increasing capacity and scale as data continues to proliferate and be pulled in from various sources. They are no longer trying to wade through a thick soup of mud and gunk, metaphorically speaking -- and instead have a platform on which to structure an agile, modern data lake for optimal performance.
Giorgio Regni is chief technology officer and co-founder of Scality, provider of storage software that helps companies unify data management and protect data on-premises or in hybrid and multi-cloud environments. He is an expert in distributed infrastructure software at web scale with multiple US patents for distributed systems to his credit.