Sebastiao Correia is director of product development for data quality at Talend.
More than a decade ago, we entered an era of data deluge. One reason for this big data deluge is the steady decrease in the cost per gigabyte, which has made it possible to store more and more data for the same price. Another reason is the expansion of the Web, which has allowed everyone to create content and companies like Google, Yahoo, Facebook and others to collect increasing amounts of data. I'd like to explore some of the paradigm shifts caused by the data deluge and its impact on data quality.
The Birth of a Distributed Operating System
With the advent of the Hadoop Distributed File System (HDFS) and the resource manager called YARN, a distributed data platform was born. With HDFS, very large amounts of data can now be placed in a single virtual place and, with YARN, the processing of this data can be done by several engines such as SQL interactive engines, batch engines or real-time streaming engines.
The ability to store and process data in one location is an ideal framework to manage big data. While a tremendous step forward in helping companies leverage big data, data lakes have the potential of introducing several quality issues, as outlined in an article by Barry Devlin: In summary, as the old adage goes, “garbage in, garbage out.” Being able to store petabytes of data does not guarantee that all the information will be useful and can be used.
Another similar concept to data lakes that the industry is discussing is the idea of a data reservoir. The premise is to perform quality checks and data cleansing prior to inserting the data into the distributed system. Therefore, rather than being raw, the data is ready-to-use.
The accessibility of data is a data quality dimension that benefits from these concepts of a data lake or data reservoir. Indeed, Hadoop makes data and even legacy data accessible. All data can be stored in the data lake and tapes or other dedicated storage systems are no longer required. Indeed, the accessibility dimension was a known issue with these systems.
But distributed systems also have an intrinsic drawback, the CAP theorem. The theorem states that a partition-tolerant system can't provide data consistency and data availability simultaneously. Therefore, with the Hadoop Distributed File System - a partitioned system that guarantees consistency - the availability dimension of data quality can’t be guaranteed. This means that the data can't be accessed until all data copies on different nodes get synchronized (consistent). Clearly, this is a major stumbling block for organizations that need to scale and want to immediately use insights derived from their data.
Colocation of Data and Processing
Before Hadoop, organizations analyzed data stored in a database by sending it out of the database to another tool or database. With Hadoop, the data remains in Hadoop. The processing algorithm to be applied to the data can be sent to the Hadoop Map Reduce framework and the raw data can still be accessed by the algorithm. For data quality, this is a significant improvement as you no longer need to extract data to profile. You can then work with the whole data rather than with samples or selections. In-place profiling combined with BI Data systems opens new doors for data quality. It's even possible to think about some data cleansing processes that will take place in the big data framework rather than outside.
With traditional databases, the schema of the tables is predefined and fixed. Ensuring constraints with this kind of “schema-on-write” approach surely helps to improve the data quality, as the system is safeguarded against data that doesn’t conform to the constraints. However, very often, constraints are relaxed for one reason or another and bad data can still enter the system. Big data systems such as HDFS have a different strategy. They use a “schema-on-read” approach. This means that there is no constraint on the data going into the system. The schema of the data is defined as the data is being read. It's like a “view” in a database. We may define several views on the very raw data, which makes the schema-on-read approach very flexible.
However, in terms of data quality, it's probably not a viable solution to let any kind of data enter the system. Letting a variety of data formats enter the system requires some processing algorithm that defines an appropriate schema-on-read to serve the data. As time passes, the algorithm will become more complex. The more complex the input data becomes, the more complex the algorithm that parses, extracts and fixes it then becomes; to the point where it becomes impossible to maintain.
Pushing this reasoning to its limits, some of the transformations executed by the algorithm can be seen as data quality transformations. Data quality then becomes a cornerstone of any big data management process, while the data governance team may have to manage “data quality services” and not only focus on data.
On the other hand, the data that is read through the “views” would still need to obey most of the standard data quality dimensions. A data governance team would also define data quality rules on this data retrieved from the views. It raises the question of the data lake versus the data reservoir. Indeed, the schema-on-read brings huge flexibility to data management, but controlling the quality and accuracy of data can then become extremely complex and difficult. There is a clear need to find the right compromise.
We see here that data quality is pervasive at all stages in Hadoop systems and not only involves the raw data, but also the transformations done in Hadoop on this data. This shows the importance of well-defined data governance programs when working with big data frameworks.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission processfor information on participating. View previously published Industry Perspectives in our Knowledge Library.