We recently published a paper at the 15th USENIX Conference on File and Storage Technologies, describing how Dell EMC Data Domain’s method to reclaim free space has changed in the face of new workloads.
Readers of the Data Cortex are likely familiar with Data Domain and the way the Data Domain File System (DDFS) deduplicates redundant data. The original technical paper about DDFS gave a lot of information about deduplication, but it said little about how dead data gets reclaimed during Garbage Collection (GC). Nearly a decade later, we’ve filled in that gap while also describing how and why that process has changed in recent years.
In DDFS, there are two types of data that should be cleaned up by GC: unreferenced chunks (called segments in the DDFS paper and much other Data Domain literature, but chunks elsewhere), belonging to deleted files; and duplicate chunks, which have been written to storage multiple times when a single copy is sufficient. (The reason for duplicates being written is performance: it is generally faster to write a duplicate than to look up an arbitrary entry in the on-disk index to decide it’s already there, so the system limits how often it does index lookups.)
Both unreferenced and duplicate chunks can be identified via a mark-and-sweep garbage collection process. First, DDFS marks all the chunks that are referenced by any file, noting any chunks that appear multiple times. Then DDFS sweeps the unique, referenced chunks into new locations and frees up the original storage. Since chunks are grouped into larger units called storage containers, largely dead containers can be cleaned up with low overhead (i.e. copying the still-live data to new containers), while containers with lots of live data are left unchanged (i.e. the sweep process does not happen). This process is much like the early log-structured file system work, except that liveness is complicated by deduplication.
Originally, and for many years, DDFS performed the mark phase by going through every file in the file system and marking all the chunks reached by that file. This included both data chunks (which DDFS calls L0 chunks) and metadata (chunks containing fingerprints of other data or metadata chunks in the file, which DDFS calls L1-L6 chunks). Collectively this representation is known as a Merkle tree. We call this type of GC “logical garbage collection” because it operates on the logical representation of the file system, i.e., the way the file system appears to a client.
Logical GC worked well for quite some time, but recent changes to workloads caused problems. Some systems used a form of backups that created many files that all referenced the same underlying data, driving up the system’s deduplication ratio. The total compression, which is the cumulative effect of deduplication and intra-file compression, might be 100-1000X on such systems, compared to 10-20X on typical systems in the past. Revisiting the same data hundreds of times, with the random I/O that entailed, slowed the mark phase of GC. Another new workload, having many (e.g., hundreds of millions) small files rather than a small number of very large files, similarly ran slowly when processing a file at a time.
Data Domain engineers reimplemented GC to do the mark phase using the physical layout of the storage containers, rather than the files. Every L1-L6 chunk gets processed exactly once, starting from the higher levels of the Merkle tree (L6) to flag the live chunks in the next level below. This physical GC avoids the random I/O and repeated traversals of the earlier logical GC procedure. Instead of scanning the file trees and jumping around the containers, the physical GC scans the containers sequentially. (Note: It may scan the same container multiple times as it moves from L6 to L1 blocks because each time through it only looks for blocks of one level. However, there are not that many L1-L6 containers compared to the actual L0 data containers: the metadata is only about 2-10% at most, with less metadata for traditional backups and more for the new high-deduplication usage patterns).
Physical GC requires a new data structure, a “perfect hash,” which is similar to a Bloom filter (representing the presence of a value in just a few bits) but requires about half the memory and has no false positives. In exchange for these two great advantages, the perfect hash requires extra overhead to preprocess all the chunk fingerprints: it creates a one-to-one mapping of fingerprint values to bits in the array, with the additional space needed to identify which bit matches a value. Analyzing the fingerprints at the start of the mark phase is somewhat time-consuming; however, using the perfect hash ensures both that no chunks are missed and that no false positives result in large amounts of data being retained needlessly.
We learned that physical GC dramatically improved performance for the new workloads. However, it was slightly slower for the traditional workloads. Because of other changes made in parallel with the move to physical GC, it was hard to determine how much of this slower performance was due to the perfect hash overhead, and how much might be due to the other changes.
We needed to make GC faster overall. One of the causes of the slow mark phase was the need to make two passes through the file system much of the time. This was necessary because there was insufficient memory to track all chunks at once. Instead, GC would do much of the work of traversing the file system, but only sampling to get a sense of which containers should be focused on for cleaning. Then GC would identify which chunks are stored in those containers, and traverse the file system a second time while focusing only on those chunks and containers.
Phase-optimized Physical GC (PGC+) reduces the memory requirements by using a perfect hash in place of one Bloom filter and eliminating the need for another Bloom filter completely. This allows PGC+ to run in a single GC phase rather than with two passes. Further optimizations also improved performance dramatically. Now GC is at least as fast as the original logical GC for all workloads and is about twice as fast for those that required two passes of LGC or PGC. Like PGC, PGC+ is orders of magnitude better than LGC for the new problematic workloads.
Data Domain continues to evolve, as do the applications using it. Aspects of the system, such as garbage collection, have to evolve with it. Logical GC was initially a very intuitive way to identify which chunks were referenced and which ones could be reclaimed. Doing it by looking at the individual storage containers is, by comparison, very elaborate. Physical GC may seem like a complex redesign of what was a fairly intuitive algorithm, but in practice it’s a carefully designed optimization to cope with the random-access penalty of spinning disks while ensuring the stringent guarantees of the Data Domain Data Invulnerability Architecture.
Because after all, slow garbage collection … just isn’t logical!
~Fred Douglis @freddouglis
 Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. “The Logic of Physical Garbage Collection in Deduplicating Storage.” In 15th USENIX Conference on File and Storage Technologies (FAST 17), pp. 29-44. USENIX Association, 2017.