Kirit Basu is Director of Product Management for StreamSets.
In Part I of “Best Practices for Managing Enterprise Data Streams” I discussed the most prominent issues in modern data flow processes and presented six practices to better manage and maintain open source stream processes. In Part II, we continue that conversation and provide six more practices for modernizing data flow processes on an enterprise level.
7. Instrument Everything in Your Dataflows
You can never have enough visibility in a complex dataflow system. End-to-end instrumentation of your data movement gives you a window into performance as you contend with the challenge of evolving sources and systems. This instrumentation is not just needed for time-series analysis of a single dataflow to tease out changes over time. It can — more importantly — help you correlate data across flows to identify interesting events in real time.
Organizations should endeavor to capture details of every aspect of the overall dataflow architecture while minimizing overhead or tight coupling between systems. A well-instrumented approach will asynchronously communicate the measured values to external management systems and allow you to drill down from coarse metrics used for monitoring to the fine-grained measurements ideal for diagnosis, root-cause analysis and remediation of issues.
8. Don’t Just Count Packages; Inspect Contents
Would you feel secure if airport security solely counted passengers and luggage rather than actually scanned baggage for unusual contents? Of course not, yet the traditional metrics for data ingestion are throughput and latency. The reality of data drift means that you’re much better off if you profile and understand the values of the data itself as it flows through your infrastructure. Otherwise, you leave yourself at risk to unannounced changes in data format or meaning. A major change in data values might indicate a true change in the real world that is interesting to the business, or might indicate undetected data drift that is polluting your downstream analysis.
An additional benefit of data introspection is that it allows you to identify personal or otherwise sensitive data transiting your infrastructure. Many industries and geographies have strict requirements around storage of personal data, such as the EU’s 2018 “right to be forgotten” GDPR requirements. Continually monitoring incoming data for patterns helps companies comply by providing real-time detection and tracking of any personal data they are collecting and storing.
9. Implement a DevOps Approach to Data Movement
The DevOps sensibility of an agile workflow with tight linkages between those who design a system and those who run it is well-suited to big data movement operations. Data pipelines will need to be adjusted frequently in a world where there is a continual evolution of data sources, consumption use cases and data-processing systems.
Traditional data integration systems date back to when the waterfall development methodology was king, and tools from that era tend to focus almost exclusively on the design-time problem. This is also true of the early big data ingest developer frameworks such as Apache Sqoop and Apache Flume. Fortunately, modern dataflow tools now provide an integrated development environment (IDE) for continual use through the evolving dataflow life cycle.
10. Decouple Data Movement From Your Infrastructure
Unlike monolithic solutions built for traditional data architectures, big data infrastructure requires coordination across best-of-breed — and often open source — components for specialized functions such as ingest, message queues, storage, search, analytics and machine learning. These components evolve at their own pace and must be upgraded based on business needs. Thus, the large and expensive lockstep upgrades you’re used to in the traditional world are being supplanted by an ongoing series of one-by-one changes to componentry.
To keep your data operation up to date in this brave new world, you should use a data movement system that acts as a middleware layer and keeps each system in the data movement chain loosely coupled from its neighbors. This enables you to modernize a la carte without having to re-implement foundational pieces of infrastructure.
11. Engineer for Complex Deployment Patterns
Not only have dataflows become complex but they now span a range of deployment alternatives. Industry surveys confirm that enterprises are expecting to deploy data across multiple clouds while still retaining on-premises data operations. And edge operations are morphing from simple collection to include simple or complex processing depending on the device constraints, urgency and robustness of connectivity. Since each deployment option has its own advantages, it is a mistake to expect a single approach to work now and forever. Realistically, business requirements will dictate an enterprise architecture that combines many of them.
Regardless of where you are in your journey, it is best to assume a world where you have data stored in many different environments and build an architecture based on complete “workload portability” where you can move data to the point of analysis based on the best price and performance characteristics for the job, and do so with minimal friction. Also, you should assume that the constellation that describes your multi-cloud will change over time as cloud offerings and your business needs evolve.
12. Create a Center of Excellence for Data in Motion
The movement of data is evolving from a stovepipe model to one that resembles a traffic grid. You can no longer get by with a fire-and-forget approach to building data ingestion pipelines. In such a world you must formalize the management (people, processes and systems) of the overall operation to ensure it functions reliably and meets internal SLAs on a continual basis. This means adding tools that provide real-time visibility into the state of traffic flows, with the ability to receive warnings and act on issues that may violate contracts around data delivery, completeness and integrity.
Otherwise, you are trying to navigate a busy city traffic grid with ever-changing conditions using a paper map, with the risk that the data feeding your critical business processes and applications arrives late, incomplete or not at all.
Use a Data Operations Platform to Build a High-Performance Data Ingestion Practice
To help enterprises implement many of the best practices discussed above, consider adopting an enterprise data operations platform. This will help enterprises master the life cycle of their data movement, including efficient development, operational visibility and tight control over performance.
Key features of a data operations platform include:
- Smart pipelines to conquer data drift – Inspects data while it is in motion and detects and resolves unexpected changes on the fly.
- A living data map to conquer data sprawl – displays all data movement on a single canvas.
- Its ability to auto-update brings continuous integration and continuous deployment (CI/CD) methods to data flows.
- Data SLAs to conquer data urgency - Sets and enforces rules around dataflow performance to ensure that business rules for quality and timeliness are met.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Informa.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.