Deduplicated Storage, Years after the Salesperson Leaves

Deduplicated Storage, Years after the Salesperson Leaves

So you are thinking about purchasing a deduplication storage system for your backups. Everyone tells you that, by removing redundancies, it will save storage space, and reduce costs, power, rack space, and network bandwidth requirements. Perhaps a vendor even did a quick analysis of your current data to estimate those savings. But, many months from now, what will your experience really be? We decided to investigate that question.

I was fortunate enough to collaborate with researchers studying long-term deduplication patterns, and our work was recently published at the Conference on Massive Storage Systems and Technology (MSST 2016)[1].

The Study

How do you perform a long-term study of deduplication patterns?

  • Step one: Create tools to gather data.
  • Step two: Gather data for years. And years. And years.
  • Step three: Analyze, analyze, analyze. Then upgrade your tools to analyze some more.

Our data collection tools were designed to process dozens of user home directories in a similar fashion as an actual deduplicated storage system. The tools ran daily from 2011 through today, with infrequent gaps due to weather-related power outages, hardware failures, etc. Every day, the tools would read all the files, break each file into chunks of various sizes (2KB – 128KB and full file), calculate a hash for each chunk, and record the file’s metadata.

In this paper, we analyzed a 21 month subset of the data covering 33 users, 4,100 daily snapshots, and a total of 456TB of data. If this isn’t the largest and longest running deduplication-oriented data set ever gathered, it is certainly among the top few. Analyzing the collected data was a gargantuan task involving sorting and merging algorithms that ran in phases, writing incremental progress to hard disk since the data was larger than memory.  For anyone that wants to perform their own data collection or wants to perform their own analysis of this immense data set, the tools and data set are publicly available.


The Results

What did we discover? Starting with general deduplication characteristics, we find that whole file deduplication works well for small files (<1 MB) but poorly for large files. Since large files consume most of the bytes in the storage system, this result supports the industry trend towards deduplicating chunks of files. We then looked at optimal deduplication chunk sizes based on the type of backup schedules.

  • Weekly full and daily incremental backups: optimal chunk size is 4-8KB
  • Daily full backups: optimal chunk size is 32KB

Deduplication ratios also varied by file type. Unsurprisingly, compressed files such as .gz, .zip, and .mp3 get little deduplication no matter the chunk size, while source code had the highest space savings because only small portions are modified between snapshots.  Interestingly, VMDK files took up the bulk of the space, highlighting that storage systems need to be VM-aware.

Looking at the snapshots for our 33 users, we found significant differences that are hidden by grouping all the users together. Per-user deduplication ratios varied widely from a low of 40X (meaning their data fits in 1/40th of the original space) to a high of 2,400X!  That outlier user had large and redundant data files that could be compacted dramatically through deduplication.

While much of the redundancy is related to the large number of user snapshots, increasing the number of preserved snapshots (e.g. the retention window) does not linearly increase the deduplication ratio because of metadata. For example, we saw the deduplication was as high as 5,000X for the outlier user before considering the impact of file recipes. A file recipe is internal metadata needed to represent and reconstruct the files in their deduplicated format. The recipe consumes a tiny percentage of the space compared to the actual file data. However, even when data isn’t changing, we still need a recipe for each file. This led to an interesting conclusion. We found that deduplication ratios tend to grow rapidly for the first few dozen snapshots and then grow more slowly because file recipes become a larger fraction of what is stored. There are even cases where deduplication ratios dropped because a user added novel content.

Finally we looked at grouping users together to see if there are advantages to deduplicating similar users together. We found that there are pairs of users that overlapped as much as 44%, which can directly turn into space savings by placing those users on the same deduplication system. Intelligent grouping and data placement can improve customers’ deduplication results.

These findings can provide guidance to storage designers as well as customers wondering how their system will respond to long-term deduplication patterns. Customers and designers should think about matching chunk sizes to backup policies, metadata management at scale, and grouping together similar users’ data. There are many more details in the paper, and researchers are welcome to use our tools and data set to perform their own analysis.

[1]Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. “A Long-Term User-Centric Analysis of Deduplication Patterns.”  In the Proceedings of the 32nd International Conference on Massive Storage Systems and Technology (MSST 2016).


-Philip Shilane @philipshilane

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s