Insight and analysis on the data center space from industry thought leaders.

Making a Hash of Database Deduplication

IT managers in big data environments know they need deduplication. However, few are aware that inline/hash-based deduplication technologies are not able to efficiently deduplicate multi-stream database data, multiplexed data or data in progressive incremental backup environments, writes Jeff Tofano of SEPATON.

Industry Perspectives

April 24, 2012

6 Min Read
Data Center Knowledge logo

Jeff Tofano, the chief technology officer at SEPATON, Inc., has more than 30 years of experience in the data protection, storage and high-availability industries. He leads SEPATON’s technical direction, addressing the data protection needs of large enterprises.

Jeff Tofano SEPATON

Jeff Tofano


IT managers in big data environments are aware of the important role that deduplication plays in their ongoing struggle to backup, replicate, retain and restore massive, fast-growing data volumes. However, few are aware that inline/hash-based deduplication technologies are not able to efficiently deduplicate multi-stream database data, multiplexed data or data in progressive incremental backup environments. Understanding the limitations of inline/hash-based deduplication and the impact of these limitations on data protection can save "big data" organizations hundreds of thousands of dollars annually in disk capacity, power and cooling.

Hash vs Content Aware Deduplication

While several different deduplication technologies are suitable for smaller data volumes, IT managers in big data environments have two broad categories of deduplication to choose from: inline/hash-based deduplication and content aware byte differential deduplication.

Inline/hash-based technologies are designed to find matches in data before it is written to disk. They analyze segments of data as they are being backed up and assign a unique identifier called a fingerprint to each segment. Most of these technologies use an algorithm that computes a cryptographic hash value from a fixed or variable segment of data in the backup stream, regardless of the data type. The fingerprints are stored in an index. As each backup is performed, the fingerprints of incoming data are compared to those already in the index. If the fingerprint exists in the index, the incoming data is replaced with a pointer to data. If the fingerprint does not exist, the data is written to the disk as a new unique chunk. The fingerprint assignment, index lookup, and pointer replacement steps must all performed before the data is written to disk. To contain the size of the index, inline/hash-based deduplication technologies are purposely designed for small-to-medium sized enterprises with data volumes and change rates are small enough to be deduplicated without causing a bottleneck in backup “ingest” performance. And even these in these solutions designed for smaller data sets, most hash-based technologies rely on large duplicate chunk sizes and ignore small duplicates to achieve reasonable performance.

Content aware technologies (including Sepaton's ContentAware byte differential deduplication) work in a fundamentally different way. These schemes extract metadata from the incoming data stream and use it to identify duplicate data. They then analyze this small subset of data that contains duplicates at the byte level for optimal capacity reduction. The deduplication process is performed outside the ingest operation (concurrent processing) so that it does not slow backup or restore processes. Because there is no index, and because the analysis of suspect duplicates can be done in parallel these technologies are able to scale processing across as many nodes and to scale capacity to store tens of petabytes in a single system with deduplication.

Poor Capacity Reduction for Databases and Progressive Incremental Backup Environments

Databases, such as Oracle, SAP, DB2, and Exchange, as well as data streams found in big data environments typically have data change in segments of 8 KB or smaller. This granularity of storing structured and semi-structured data poses a significant problem for inline deduplication technologies simply because the granularity of change is too small for them to deal with effectively. Also, because hash-based schemes are unaware of content and data types, there is no way to control what data is compared against what – every incoming chunk is compared against the entire index regardless of the probability of duplication. Most importantly, in these technologies, examining data in sections smaller than 8 KB typically causes a severe performance bottleneck and prohibits use of all capacity provided. As a result, hash-based schemes leave a large volume of duplicate data from critical databases and analytical tools completely unhandled.

The sub 8 KB limitation of hash-based deduplication is also a problem in the progressive incremental backup environments commonly used in big data enterprises, including: non-file backups, TSM progressive incremental backups and backups from applications that fragment their data, such as NetWorker, HP Data Protector. Just as in database deduplication, inline/hash-based deduplication technologies cannot examine data in these common big data backup environments at a level of granularity reducing capacity reduction efficiency.

The “triage” approach used by ContentAware technologies enables them to focus their process-intensive deduplication examination on the subset of data that contains duplicates and to examine that data at the individual byte level for maximum efficiency.

Scalability Limitations

Another limitation of hash-based deduplication is the lack of scalability. Providing a single global coherent index across multiple nodes is extremely hard to do and beyond the abilities of all current hash-based engines. Related, hash-based schemes that support multiple federated indexes further exacerbate the issue of unhandled regions of duplicate data.

Big data enterprises need to move massive and fast-growing data volumes to the safety of a backup environment within a fixed backup window. Without the ability to scale, hash-based deduplication technologies force big data environments to create “sprawl” by dividing backups onto numerous individual, single-node backup systems. Introducing each of these systems requires significant load balancing and adjustment. The big data enterprise then has multiple – often dozens – of individual systems that need ongoing tuning, upgrading, maintenance as well as space, power and cooling in the data center. Fragmenting the backup among these systems results in inherently less efficient deduplication and over buying when new systems are added for performance before added capacity is needed.

However, as discussed above, these inline hash-based solutions are not scalable and cannot handle large data volumes efficiently. Restore performance is also a challenge for hash-based technologies as they have a single node to perform the compute-intensive tasks of reassembling the most recent backup while it continues to perform all backup, deduplication, and replication processes. Unpredictable performance makes restore time objectives difficult to achieve with any measure of confidence, including vaulting to tape. ContentAware byte differential deduplication, in contrast, keeps a fully hydrated copy of the most recent backup data intact for immediate restores/tape vaulting and can apply as many as eight processing nodes simultaneously to all data protection processes to sustain deterministic high performance.

Replication Challenges

Most hash-based deduplication technologies enable efficient replication of data across a WAN by engaging in complex fingerprint negotiations in an attempt to avoid sending duplicate data. The fingerprint negotiation phase often generates significant transfer latency and resultant “dead-time” on the wire. Consequently, many hash-based replication schemes typically don’t run faster than non-deduplicated schemes unless the deduplication rate is very high – they struggle with and often run significantly slower when database data or other high-change rate data is replicated. Content aware byte differential technologies solve these replication problems by streaming delta requests to target systems. Data is only pulled from the source system when the delta rules can’t be applied, effectively minimizing the amount of data transferred, while also effectively utilizing the full bandwidth of the wire and avoiding costly latency gaps.

For large enterprise organizations with big data environments to protect, many deduplication solutions that are well-suited to smaller environments fall short. The sheer volume and overall complexity of the big data backup environment requires a higher level of deduplication efficiency, performance, scalability and flexibility than hash-based deduplication technologies can deliver. Content aware deduplication technologies that were designed specifically for large, data-intensive environments offer a more efficient and cost-effective alternative.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like