Nitin Donde is CEO of Talena.
The rapid growth of data has enabled exciting new opportunities and presented big challenges for businesses of all types. The opportunity has been to take this vast swath of human- and machine-generated data to better personalize e-commerce sites; more closely identify fraud patterns; and even sequence genomes more efficiently. And the NoSQL movement has been instrumental in delivering data platforms like Cassandra, MongoDB and Couchbase that enable rapid, distributed processing of these web-scale applications.
Data Management Challenges in a Data-Rich World
However, the downstream challenge of this new application class is that traditional data management techniques used in the world of relational databases no longer work. Data management includes the concepts of backup and recovery, archiving, and test/dev management in which engineering groups can test new application functionality with subsets of production data. So why do traditional processes fall short?
Data management capabilities now have to handle hundreds of terabytes, if not petabytes, of data in a scale-out manner on commodity hardware and storage. Traditional data management products are built on top of a scale-up architecture that cannot handle petabyte-scale applications, nor do they possess the economics to handle traditional open source technologies.
The advent of DevOps and other agile methodologies has led to the need for rapid application iteration, which implies that data management products need to help these teams refresh data sets to enable iterations. When was the last time you heard the words “DevOps” and “Veritas” in the same sentence? I thought so.
Even within the world of NoSQL there are a variety of data formats, making it difficult for data management products to handle the storage optimization needs of each data platform. If your company uses both Cassandra and HBase, then your data management architecture needs to handle each of these unique application formats when it comes to backup, recovery, archiving and other key processes.
NoSQL and Data Availability
How should companies think about ensuring universal data availability for their NoSQL and other Big Data applications?
At the volumes that Big Data applications handle, backups have to be incremental-forever after the first full backup. Otherwise you’ll constantly be scrambling for more storage, typically on your production cluster. This is often the origin of operational fire-drills and a big drain on resources. In addition, trying to do a full weekly backup on, for example, a 1-petabyte data set will never meet any corporate service-level agreement.
With NoSQL implementations often running into the hundreds if not thousands of nodes, you have to think about a data management architecture that is agentless. The overhead of managing agents across production nodes that are constantly being commissioned or decommissioned is simply overwhelming.
Your data science or DevOps teams will need access to production data to support ongoing analytics efforts or to iterate on new application functionality. However, production data may contain confidential information, the leakage of which can compromise your brand and reputation. It’s important to think through a data masking architecture that is irreversible, consistent (so your analytics will result in the same results on the masked data), and one-way.
Your backup architecture needs to be aware of the different data abstractions across the universe of NoSQL applications. For example, workflows associated with Cassandra have to be aware of and set up using keyspaces and tables. This applies to both the actual data as well as the metadata layer.
These are just a few of the ways the advent of the NoSQL and other Big Data platforms have altered the thinking around traditional data management. Paying attention to these architectural considerations will ensure that the data that powers your new applications will be always available to the consumers of your applications.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.