Best Practices for Deduplication of Data in Storage

When it comes to deduplication of data, there are several factors that will impact effectiveness.

Brien Posey

July 29, 2020

4 Min Read
Best Practices for Deduplication of Data in Storage
Getty Images

Few technologies have the potential to drive down storage costs the way that deduplication of data can. Even so, the way in which you implement data deduplication can play a major role in its overall effectiveness. As such, I wanted to take the opportunity to talk about some best practices for data deduplication.

1. Choose hardware over software.

One of the questions that I am asked most often pertaining to deduplication of data is whether it is better to allow the operating system to handle the data deduplication process or to perform deduplication at the storage hardware level.

If you find yourself in a position in which you could choose between native file system deduplication or letting your storage appliance deduplicate the data, it’s usually better to opt for hardware-level deduplication of data. The reason for this is that the deduplication process consumes resources beyond just storage IOPS. Memory and CPU cycles are also used. Offloading the deduplication process to storage hardware keeps your servers from having to waste CPU and memory resources on data deduplication. These resources can instead be used for running business workloads.

2. Be aware of the deduplication engine’s limitations.

Not all deduplication engines are created equally, and some have significant limitations that can impact performance or functionality. In Windows Server 2012 and Windows Server 2012 R2, for instance, the NTFS deduplication engine had trouble with large volumes and large files. At the time, Microsoft recommended that native deduplication be enabled only for high-churn volumes of up to about 7 TB in size, or for low-churn volumes of up to about 10 TB in size. Microsoft also cautioned its customers about using deduplication on any NTFS volume containing very large files.

In Windows Server 2016, Microsoft made some major architectural changes to its data deduplication job pipeline. These changes made it practical to deduplicate volumes of up to 64 TB in size, containing individual files of up to 1 TB in size.

Before you enable deduplication of data, it is important to check the documentation for whatever deduplication engine you happen to be using to see if it is a good match for the volume that you are planning on deduplicating.

3. Consider deduplication ratios.

For a while there was something of an arms race among storage vendors, with each trying to achieve the highest possible deduplication ratio. Although a higher deduplication ratio often translates to a smaller storage footprint, it is important to understand what the ratios actually mean.

The deduplication ratio is essentially the amount that you will be able to shrink your data under ideal conditions. A ratio of 20:1, for example, means that you could conceivably decrease your data’s physical storage consumption by a factor of 20. However, actually being able to achieve this ratio in the real world is anything but guaranteed. While the deduplication engine’s efficiency plays a big role in how much you will be able to shrink your data, the data itself plays a far larger role. Some data simply cannot be deduplicated.

The other thing that needs to be understood about deduplication ratios is that higher ratios yield diminishing returns.

Suppose that you are deduplicating 1 TB of data and that you achieve a ratio of 20:1. At that ratio, you have decreased the size of your data by 95%. The data has gone from consuming 1 TB of storage to 51.2 GB of storage. If, however, you manage to achieve a deduplication ratio of 25:1, then the increased ratio will shrink the data by only an additional 1%. The additional storage space saved by going from a 20:1 to a 25:1 ratio is only about 10 GB in size.

4. Be aware that deduplication of data can actually increase storage consumption.

As odd as it may sound, there are situations in which deduplicating a data set can actually increase its physical storage consumption. This can happen when the data set is compressed and cannot be deduplicated. If you try to deduplicate such a data set, the deduplication engine cannot do anything to shrink the data because it is already compressed. Even so, the deduplication engine may create a hash table or other structure in preparation for attempting to deduplicate the data. This table has to be stored somewhere, and the end result can be that the data set actually consumes slightly more storage space after it has been deduplicated.

The lesson here is to use deduplication of data only where it will be effective. If a volume is filled with compressed archives (ZIP files, CAB files, etc.) or compressed media files (MP3, MP4, etc.), then that volume might not be a good candidate for deduplication. It is worth noting, however, that some deduplication products are able to analyze a volume and estimate the space savings before you commit to using data deduplication.

About the Author

Brien Posey

Brien Posey is a bestselling technology author, a speaker, and a 20X Microsoft MVP. In addition to his ongoing work in IT, Posey has spent the last several years training as a commercial astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space.

https://brienposey.com/

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like