Understanding Data De-duplication

Druvaa is a Pune-based startup that sells fast, efficient, and cheap backup (Update: see the comments section for Druvaa’s comments on my use of the word “cheap” here – apparently they sell even in cases where their product is priced above the competing offerings) software for enterprises and SMEs. It makes heavy use of data de-duplication technology to deliver on the promise of speed and low-bandwidth consumption. In this article, reproduced with permission from their blog, they explain what exactly data de-duplication is and how it works.

Definition of Data De-duplication

Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of all data is still retained should that data ever be required.

Example
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB.

Technological Classification

The practical benefits of this technology depend upon various factors like –

Point of Application – Source Vs Target
Time of Application – Inline vs Post-Process
Granularity – File vs Sub-File level
Algorithm – Fixed size blocks Vs Variable length data segments

A simple relation between these factors can be explained using the diagram below –

Target Vs Source based Deduplication

Target based deduplication acts on the target data storage media. In this case the client is unmodified and not aware of any deduplication. The deduplication engine can embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays. In both cases it improves only the storage utilization.

On the contrary Source based deduplication acts on the data at the source before it’s moved. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on the backup client.

Inline Vs Post-process Deduplication

In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e. as and when its send to target) or after its been stored in the target storage.

The former is called inline deduplication. The obvious advantages are –

Increase in overall efficiency as data is only passed and processed once
The processed data is instantaneously available for post storage processes like recovery and replication reducing the RPO and RTO window.

the disadvantages are –

Decrease in write throughput
Extent of deduplication is less – Only fixed-length block deduplication approach can be use

The inline deduplication only processed incoming raw blocks and does not have any knowledge of the files or file-structure. This forces it to use the fixed-length block approach (discussed in details later).

The post-process deduplication asynchronously acts on the stored data. And has an exact opposite effect on advantages and disadvantages of the inline deduplication listed above.

File vs Sub-file Level Deduplication

The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates can be easily eliminated by calculating single checksum of the complete file data and comparing it against existing checksums of already backed up files. It’s simple and fast, but the extent of deduplication is very less, as it does not address the problem of duplicate content found inside different files or data-sets (e.g. emails).

The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks, and then uses standard hash based algorithm to find similar blocks.

Fixed-Length Blocks v/s Variable-Length Data Segments

Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it’s possible to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets that are made up mostly – but not completely – of the same data segments.

For example, similar data blocks may be present at different offsets in two different datasets. In other words the block boundary of similar data may be different. This is very common when some bytes are inserted in a file, and when the changed file processes again and divides into fixed-length blocks, all blocks appear to have changed.

Therefore, two datasets with a small amount of difference are likely to have very few identical fixed length blocks.

Variable-Length Data Segment technology divides the data stream into variable length data segments using a methodology that can find the same block boundaries in different locations and contexts. This allows the boundaries to “float” within the data stream so that changes in one part of the dataset have little or no impact on the boundaries in other locations of the dataset.

ROI Benefits

Each organization has a capacity to generate data. The extent of savings depends upon – but not directly proportional to – the number of applications or end users generating data. Overall the deduplication savings depend upon following parameters –

No. of applications or end users generating data
Total data
Daily change in data
Type of data (emails/ documents/ media etc.)
Backup policy (weekly-full – daily-incremental or daily-full)
Retention period (90 days, 1 year etc.)
Deduplication technology in place

The actual benefits of deduplication are realized once the same dataset is processed multiple times over a span of time for weekly/daily backups. This is especially true for variable length data segment technology which has a much better capability for dealing with arbitrary byte insertions.

Numbers
While some vendors claim 1:300 ratios of bandwidth/storage saving. Our customer statistics show that, the results are between 1:4 to 1:50 for source based deduplication.

11 thoughts on “Understanding Data De-duplication”

Dhananjay Nene says:

January 15, 2009 at 9:57 am

Nice post. Especially liked the clean breakdown and a granular explanation of each item. Wasn’t aware of sub-file level deduplication so thats a new learning for me.

As an aside, deduplication is a term which is used in multiple contexts and has multiple meanings therefore. Couple of additional contexts I am aware in Data Mining (which attempts to do the same as on database records) and deduplication in application integration which primarily focuses on message level dedpulication to support message idempotency.
Jaspreet says:

January 15, 2009 at 10:12 am

Just a small correction. The flagship product is priced 3 times the cost of competing veritas product. Still it sells like hot cakes 🙂 .. because we can proudly succeed where veritas fails.
Jaspreet says:

January 15, 2009 at 10:14 am

Dhananjay,

Fyi, at Druvaa we use source based sub-file level deduplication with variable-sized data segment approach.

I know only 2 companies besides Druvaa which used source based dedup.
Dhananjay Nene says:

January 15, 2009 at 10:23 am

In response to a comment. Correction to what ? Seemed like an introduction of a new thought. Did I just smell a pitch on a technical article ?
Jaspreet says:

January 15, 2009 at 11:00 am

Dhananjay,

The introduction to company says – “Druvaa is a Pune-based startup that sells fast, efficient, and __cheap__ backup software for … ”

The word “cheap” automatically gets associated with any Indian startup. And I don’t quite like it. So i felt like correcting it.

Sorry, if that sounded like a pitch. But that wasn’t the intention, its not the market we intend to sell in.

Jaspreet
navin says:

January 15, 2009 at 7:03 pm

@Jaspreet, I was not aware of the fact that your flagship product was priced higher than competing offering from Veritas. At a cursory glance, I thought you were priced much lower than them. But in any case, software product lines and prices these days are so complicated that I can’t really figure out what the cost is in any case. For products from some of the larger companies, I seriously need a consultant just to figure which version of the product to buy 🙁

But in any case, I guess the point you are trying to make is that you are selling on features and TCO rather than just the price of the software.
Jaspreet says:

January 15, 2009 at 7:11 pm

Navin,

No issue, I guess it was a communication gap from my side.

Thanks a lot of covering the article,
Jaspreet
Pingback: PuneTech » Should you use a file-system or a database » Technology in Pune
Bijayalaxmi Nanda says:

July 17, 2009 at 2:53 am

Here is a detailed paper on de-duplication in Data Domain on Usenix .
Amit says:

July 21, 2009 at 7:43 pm

Very well written document..& gives in short overall introduction to ‘de-duplication’.
You can also refer below article for getting more clearer picture on de-duplication –
“http://www.snia.org/education/tutorials/2009/spring/data-management/DanielBudiansky_Understanding_Data_Deduplication.pdf”
Pingback: Backup Software Provider Druva.com get $10 million funding from Nexus | punetech.com

Comments are closed.