Making a Hash of Database Deduplication

2 comments

Jeff Tofano, the chief technology officer at SEPATON, Inc., has more than 30 years of experience in the data protection, storage and high-availability industries. He leads SEPATON’s technical direction, addressing the data protection needs of large enterprises.

Jeff Tofano SEPATONJEFF TOFANO
SEPATON

IT managers in big data environments are aware of the important role that deduplication plays in their ongoing struggle to backup, replicate, retain and restore massive, fast-growing data volumes. However, few are aware that inline/hash-based deduplication technologies are not able to efficiently deduplicate multi-stream database data, multiplexed data or data in progressive incremental backup environments. Understanding the limitations of inline/hash-based deduplication and the impact of these limitations on data protection can save “big data” organizations hundreds of thousands of dollars annually in disk capacity, power and cooling.

Hash vs Content Aware Deduplication

While several different deduplication technologies are suitable for smaller data volumes, IT managers in big data environments have two broad categories of deduplication to choose from: inline/hash-based deduplication and content aware byte differential deduplication.

Inline/hash-based technologies are designed to find matches in data before it is written to disk. They analyze segments of data as they are being backed up and assign a unique identifier called a fingerprint to each segment. Most of these technologies use an algorithm that computes a cryptographic hash value from a fixed or variable segment of data in the backup stream, regardless of the data type. The fingerprints are stored in an index. As each backup is performed, the fingerprints of incoming data are compared to those already in the index. If the fingerprint exists in the index, the incoming data is replaced with a pointer to data. If the fingerprint does not exist, the data is written to the disk as a new unique chunk. The fingerprint assignment, index lookup, and pointer replacement steps must all performed before the data is written to disk. To contain the size of the index, inline/hash-based deduplication technologies are purposely designed for small-to-medium sized enterprises with data volumes and change rates are small enough to be deduplicated without causing a bottleneck in backup “ingest” performance. And even these in these solutions designed for smaller data sets, most hash-based technologies rely on large duplicate chunk sizes and ignore small duplicates to achieve reasonable performance.

Content aware technologies (including Sepaton’s ContentAware byte differential deduplication) work in a fundamentally different way. These schemes extract metadata from the incoming data stream and use it to identify duplicate data. They then analyze this small subset of data that contains duplicates at the byte level for optimal capacity reduction. The deduplication process is performed outside the ingest operation (concurrent processing) so that it does not slow backup or restore processes. Because there is no index, and because the analysis of suspect duplicates can be done in parallel these technologies are able to scale processing across as many nodes and to scale capacity to store tens of petabytes in a single system with deduplication.

Poor Capacity Reduction for Databases and Progressive Incremental Backup Environments

Databases, such as Oracle, SAP, DB2, and Exchange, as well as data streams found in big data environments typically have data change in segments of 8 KB or smaller. This granularity of storing structured and semi-structured data poses a significant problem for inline deduplication technologies simply because the granularity of change is too small for them to deal with effectively. Also, because hash-based schemes are unaware of content and data types, there is no way to control what data is compared against what – every incoming chunk is compared against the entire index regardless of the probability of duplication. Most importantly, in these technologies, examining data in sections smaller than 8 KB typically causes a severe performance bottleneck and prohibits use of all capacity provided. As a result, hash-based schemes leave a large volume of duplicate data from critical databases and analytical tools completely unhandled.

The sub 8 KB limitation of hash-based deduplication is also a problem in the progressive incremental backup environments commonly used in big data enterprises, including: non-file backups, TSM progressive incremental backups and backups from applications that fragment their data, such as NetWorker, HP Data Protector. Just as in database deduplication, inline/hash-based deduplication technologies cannot examine data in these common big data backup environments at a level of granularity reducing capacity reduction efficiency.

The “triage” approach used by ContentAware technologies enables them to focus their process-intensive deduplication examination on the subset of data that contains duplicates and to examine that data at the individual byte level for maximum efficiency.

Scalability Limitations

Another limitation of hash-based deduplication is the lack of scalability. Providing a single global coherent index across multiple nodes is extremely hard to do and beyond the abilities of all current hash-based engines. Related, hash-based schemes that support multiple federated indexes further exacerbate the issue of unhandled regions of duplicate data.

Big data enterprises need to move massive and fast-growing data volumes to the safety of a backup environment within a fixed backup window. Without the ability to scale, hash-based deduplication technologies force big data environments to create “sprawl” by dividing backups onto numerous individual, single-node backup systems. Introducing each of these systems requires significant load balancing and adjustment. The big data enterprise then has multiple – often dozens – of individual systems that need ongoing tuning, upgrading, maintenance as well as space, power and cooling in the data center. Fragmenting the backup among these systems results in inherently less efficient deduplication and over buying when new systems are added for performance before added capacity is needed.

However, as discussed above, these inline hash-based solutions are not scalable and cannot handle large data volumes efficiently. Restore performance is also a challenge for hash-based technologies as they have a single node to perform the compute-intensive tasks of reassembling the most recent backup while it continues to perform all backup, deduplication, and replication processes. Unpredictable performance makes restore time objectives difficult to achieve with any measure of confidence, including vaulting to tape. ContentAware byte differential deduplication, in contrast, keeps a fully hydrated copy of the most recent backup data intact for immediate restores/tape vaulting and can apply as many as eight processing nodes simultaneously to all data protection processes to sustain deterministic high performance.

Replication Challenges

Most hash-based deduplication technologies enable efficient replication of data across a WAN by engaging in complex fingerprint negotiations in an attempt to avoid sending duplicate data. The fingerprint negotiation phase often generates significant transfer latency and resultant “dead-time” on the wire. Consequently, many hash-based replication schemes typically don’t run faster than non-deduplicated schemes unless the deduplication rate is very high – they struggle with and often run significantly slower when database data or other high-change rate data is replicated. Content aware byte differential technologies solve these replication problems by streaming delta requests to target systems. Data is only pulled from the source system when the delta rules can’t be applied, effectively minimizing the amount of data transferred, while also effectively utilizing the full bandwidth of the wire and avoiding costly latency gaps.

For large enterprise organizations with big data environments to protect, many deduplication solutions that are well-suited to smaller environments fall short. The sheer volume and overall complexity of the big data backup environment requires a higher level of deduplication efficiency, performance, scalability and flexibility than hash-based deduplication technologies can deliver. Content aware deduplication technologies that were designed specifically for large, data-intensive environments offer a more efficient and cost-effective alternative.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Add Your Comments

  • (will not be published)

2 Comments

  1. Calle

    Inline hash based systems don't scale? http://thebackupblog.typepad.com/thebackupblog/2011/01/scaling-up-and-out-the-new-data-domain-890-and-gda.html "EMC DD890 scales to 14.7 TB/hr backup ingest, and up to 285 TB of useable capacity. The GDA scales to 26.3 TB/hr and up to 570 TB of useable capacity. For the average customer, that would equate to the ability to backup more than 200 TB in an eight hour backup window, and retain more than 10 PB of data. "

  2. Calle, The numbers quoted for the DD890 and GDA are best case numbers with the minimal change rates. Throw in a new backup policy with no commonality with existing data, or encrypted data, or a video file, and the ingest slows down considerably because the dedupe engine controls ingest speed. The hash dictionary suddenly swells with unique hashes. You may say that customers should send only well-behaved data that meets the DD profile to the target appliance. The problem is that you don't discover this data until you miss your backup window and production is affected. Post-process dedupe systems like the SEPATON S2100 provide predictable ingest to help you meet the backup window requirement. Our performance is also consistent for the initial backup, which Data Domain seldom talks about because it is so slow. Jeff' notes that multi-stream multiplexed database backups are particularly difficult for inline hash systems to handle. DBAs want to multi-stream the backups to reduce the backup window, and the DB agent (e.g. RMAN) wants to do parallel asynchronous reads to avoid disk delays, but this results in multiplexed files across the multiple streams. Record update or insert transactions produces small changes that are often in sub-8KB blocks. Inline systems using a larger hash size have a really hard time meeting the performance goals and achieving reasonable dedupe ratios in these large database environments. That's why Data Domain best practices recommend FILESPERSET=1 (no multiplexing) and maximum streams between 3 and 5. Note that software compression and encryption on DD systems starve the CPU of resources needed for cryptographic hashing and replication. This again speaks to a design center focused on very low change rates and relatively small data volumes. Regarding total capacity, your number of 285TB of usable data on the DD890 has to be de-rated somewhat. Since space reclamation affects performance, it is typically deferred to a post-process cleaning phase. Until cleaning is done, space freed by relabeling cartridges or overwriting cartridges is not available for use. Cleaning is like a defrag process and needs extra space to work effectively. The rule of thumb is that you shouldn't use more than 75%-80% of usable space. Therefore the usable space on a DD890 is closer to 200TB. This suggests that the maximum database you might want to backup is closer to 100TB. If you had a 200TB database to backup, the initial backup could well fill up the entire box, and it wouldn't backup at a fraction of the stated performance. You may say no one backs up 200TB databases to a Data Domain system. That's why SEPATON consistently wins BIG database backup deals. One of our customers is doing Oracle incremental backups nightly which are about 40TB in size. Either the database is colossal, or the change rate is huge. Either situation is beyond the capabilities of the Data Domain system.