Kirit Basu is Director of Product Management for StreamSets.
Before big data and fast data, the challenge of data movement was simple: move fields from fairly static databases to an appropriate home in a data warehouse, or move data between databases and apps in a standardized fashion. The process resembled a factory assembly line.
In contrast, the emerging world is many-to-many, with streaming, batch or micro-batched data coming from numerous sources and being consumed by multiple applications. Big data processing operations are more like a city traffic grid — a network of shared resources — than the linear path taken by traditional data. In addition, the sources and applications are controlled by separate parties, perhaps even third parties. So when the schema or semantics inevitably change — something known as data drift — it can wreak havoc with downstream analysis.
Because modern data is so dynamic, dealing with data in motion is not just a design-time problem for developers, but also a run-time problem requiring an operational perspective that must be managed day to day and evolve over time. In this new world, organizations must architect for change and continually monitor and tune the performance of their data movement system.
Today, data movement should be treated as a continuous, ever-changing operation with its performance actively managed. This two-part series gives the following 12 best practices as practical advice to help you manage the performance of data movement as a system and elicit maximum value from your data.
1. Limit Hand Coding as Much as Possible
It has been commonplace to write custom code to ingest data from sources into your data store. This practice is dangerous given the dynamic nature of big data. Custom code creates brittleness in dataflows where minor changes to the data schema can cause the pipeline to drop data or fail altogether. Also, since instrumentation must be explicitly designed in and often isn’t, dataflows can become black boxes offering no visibility to pipeline health. Lastly, low-level coding leads to tighter coupling between components, making it difficult to upgrade your infrastructure and stifling organizational agility.
Today, modern data ingest systems create code-free plug-and-play connectivity between data source types, intermediate processing systems (such as Kafka and other message queues) and your data store. The benefits you get from such a system are flexibility instead of brittleness, visibility instead of opacity, and the ability to upgrade data processing components independently. If you’re worried about customization or extensibility, these tools usually augment their built-in connectors with support for powerful expression languages or the ability to plug in custom code.
2. Minimize Schema Specification; Be Intent-Driven
While it is a standard requirement in the traditional data world, full schema specification of big data leads to wasted engineering time and resources. Consuming applications often make use of only a few key fields for analysis, plus big data sources often have poorly controlled schema that change over time and force ongoing maintenance.
Rather than relying on full schema specification, dataflow systems should be intent-driven, whereby you specify conditions for, and transformations on, only those fields that matter to downstream analysis. This minimalist approach reduces the work and time required to develop and implement pipelines. It also makes dataflows more reliable, as there is less to go wrong.
3. Plan for Both Streams and Batch
Despite all of the hubbub about streaming analytics, enterprise data is still a batch-driven world based on applications and source databases developed over the past 30 years. So while you are planning for cybersecurity, IoT and other new-age applications that capitalize on streams, you must account for the fact that this data often must be joined with or analyzed against batch sources such as master or transactional data. Rather than setting up a streaming-only framework, practical needs demand that you incorporate streaming into the legacy batch-driven fabric, while maintaining or improving performance and reliability of the overall data operation.
4. Sanitize Raw Data Upon Ingest
The original mantra for early Hadoop users was that you should store only immutable raw data in your store. As technology meets the real world, we have learned that there are some serious downsides to not sanitizing your data upon ingest. Raw data, like untreated water, can make you sick. This approach is what has spawned the “data swamp” metaphor from Gartner and others. Removing this risk by having data scientists clean the data for each consumption activity is a common approach but is clearly an inefficient use of resources. Plus, storing raw inputs invariably leads you to have personal data and otherwise sensitive information in your data lake, which increases your security and compliance risk.
With modern dataflow systems you can and should sanitize your data upon ingest. Basic sanitization includes simple “row in, row out” transformations that enforce corporate data policies and normalize or standardize data formats. More advanced sanitization includes rolling average and other time-series computations, the results of which can be leveraged broadly by data scientists and business analysts.
Sanitizing data as close to the data source as possible makes data scientists much more productive, allowing them to focus on use case-specific “data wrangling” rather than reinventing generic transformations that should be centralized and automated.
5. Address Data Drift to Ensure Consumption-Ready Data
An insidious challenge of big data management is dealing with data drift: the unpredictable, unavoidable and continuous mutation of data characteristics caused by the operations, maintenance and modernization of source systems. It shows up in three forms: structural drift (changes to schema), semantic drift (changes to meaning) or infrastructure drift (changes to data processing software, including virtualization, data center and cloud migration).
Data drift erodes data fidelity, data operations reliability, and ultimately the productivity of your data scientists and engineers. It increases your costs, delays your time to analysis and leads to poor decision making based on polluted or incomplete data.
If your end goal is to democratize data access by having as much data as possible available to as many users as possible, say through Hive or Impala queries, then you should look for data movement tools and systems that can detect and react to changes in schema and keep the Hive metastore in sync, or at the very least alert you to changes.
6. Don’t Count on Managed File Transfer
New data sets are often unbounded and continuous, such as ever-changing logs, clickstreams and IoT sensor output. Use of managed file transfers or other rudimentary mechanisms for these dynamic source types creates a fragile architecture that will require constant maintenance to remain viable.
Files, because their contents vary in size, structure and format, are challenging to introspect on the fly. This means you lose visibility into changes that should be communicated to consuming systems and applications.
If you’re intent on relying on a file transfer mechanism, consider pre-processing the files to standardize the data format to simplify inspection and profiling, or adopt an ingestion tool or framework that does this for you.
Today’s data is more dynamic than ever before, and as a result must be managed in an entirely different way than it has been in the past. Deploying these best practices for managing today’s continuously streaming data, along with the best practices in Part II of this series, will enable you to get the maximum benefit from your big data investment.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.