Cleaning Up Is Hard To Do

Cleaning Up Is Hard To Do

We recently published a paper[1] at the 15th USENIX Conference on File and Storage Technologies, describing how Dell EMC Data Domain’s method to reclaim free space has changed in the face of new workloads.

Readers of the Data Cortex are likely familiar with Data Domain and the way the Data Domain File System (DDFS) deduplicates redundant data. The original technical paper about DDFS gave a lot of information about deduplication, but it said little about how dead data gets reclaimed during Garbage Collection (GC). Nearly a decade later, we’ve filled in that gap while also describing how and why that process has changed in recent years.

In DDFS, there are two types of data that should be cleaned up by GC: unreferenced chunks (called segments in the DDFS paper and much other Data Domain literature, but chunks elsewhere), belonging to deleted files; and duplicate chunks, which have been written to storage multiple times when a single copy is sufficient. (The reason for duplicates being written is performance: it is generally faster to write a duplicate than to look up an arbitrary entry in the on-disk index to decide it’s already there, so the system limits how often it does index lookups.)

Both unreferenced and duplicate chunks can be identified via a mark-and-sweep garbage collection process. First, DDFS marks all the chunks that are referenced by any file, noting any chunks that appear multiple times. Then DDFS sweeps the unique, referenced chunks into new locations and frees up the original storage. Since chunks are grouped into larger units called storage containers, largely dead containers can be cleaned up with low overhead (i.e. copying the still-live data to new containers), while containers with lots of live data are left unchanged (i.e. the sweep process does not happen). This process is much like the early log-structured file system work, except that liveness is complicated by deduplication.

Originally, and for many years, DDFS performed the mark phase by going through every file in the file system and marking all the chunks reached by that file. This included both data chunks (which DDFS calls L0 chunks) and metadata (chunks containing fingerprints of other data or metadata chunks in the file, which DDFS calls L1-L6 chunks). Collectively this representation is known as a Merkle tree. We call this type of GC “logical garbage collection” because it operates on the logical representation of the file system, i.e., the way the file system appears to a client.

Logical GC worked well for quite some time, but recent changes to workloads caused problems. Some systems used a form of backups that created many files that all referenced the same underlying data, driving up the system’s deduplication ratio. The total compression, which is the cumulative effect of deduplication and intra-file compression, might be 100-1000X on such systems, compared to 10-20X on typical systems in the past. Revisiting the same data hundreds of times, with the random I/O that entailed, slowed the mark phase of GC.   Another new workload, having many (e.g., hundreds of millions) small files rather than a small number of very large files, similarly ran slowly when processing a file at a time.

Data Domain engineers reimplemented GC to do the mark phase using the physical layout of the storage containers, rather than the files. Every L1-L6 chunk gets processed exactly once, starting from the higher levels of the Merkle tree (L6) to flag the live chunks in the next level below. This physical GC avoids the random I/O and repeated traversals of the earlier logical GC procedure. Instead of scanning the file trees and jumping around the containers, the physical GC scans the containers sequentially. (Note: It may scan the same container multiple times as it moves from L6 to L1 blocks because each time through it only looks for blocks of one level. However, there are not that many L1-L6 containers compared to the actual L0 data containers: the metadata is only about 2-10% at most, with less metadata for traditional backups and more for the new high-deduplication usage patterns).

Physical GC requires a new data structure, a “perfect hash,” which is similar to a Bloom filter (representing the presence of a value in just a few bits) but requires about half the memory and has no false positives. In exchange for these two great advantages, the perfect hash requires extra overhead to preprocess all the chunk fingerprints: it creates a one-to-one mapping of fingerprint values to bits in the array, with the additional space needed to identify which bit matches a value. Analyzing the fingerprints at the start of the mark phase is somewhat time-consuming; however, using the perfect hash ensures both that no chunks are missed and that no false positives result in large amounts of data being retained needlessly.

We learned that physical GC dramatically improved performance for the new workloads. However, it was slightly slower for the traditional workloads. Because of other changes made in parallel with the move to physical GC, it was hard to determine how much of this slower performance was due to the perfect hash overhead, and how much might be due to the other changes.

We needed to make GC faster overall. One of the causes of the slow mark phase was the need to make two passes through the file system much of the time. This was necessary because there was insufficient memory to track all chunks at once. Instead, GC would do much of the work of traversing the file system, but only sampling to get a sense of which containers should be focused on for cleaning. Then GC would identify which chunks are stored in those containers, and traverse the file system a second time while focusing only on those chunks and containers.

Phase-optimized Physical GC (PGC+) reduces the memory requirements by using a perfect hash in place of one Bloom filter and eliminating the need for another Bloom filter completely. This allows PGC+ to run in a single GC phase rather than with two passes. Further optimizations also improved performance dramatically. Now GC is at least as fast as the original logical GC for all workloads and is about twice as fast for those that required two passes of LGC or PGC. Like PGC, PGC+ is orders of magnitude better than LGC for the new problematic workloads.

Data Domain continues to evolve, as do the applications using it. Aspects of the system, such as garbage collection, have to evolve with it. Logical GC was initially a very intuitive way to identify which chunks were referenced and which ones could be reclaimed. Doing it by looking at the individual storage containers is, by comparison, very elaborate. Physical GC may seem like a complex redesign of what was a fairly intuitive algorithm, but in practice it’s a carefully designed optimization to cope with the random-access penalty of spinning disks while ensuring the stringent guarantees of the Data Domain Data Invulnerability Architecture.

Because after all, slow garbage collection … just isn’t logical!

~Fred Douglis @freddouglis

[1] Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. “The Logic of Physical Garbage Collection in Deduplicating Storage.” In 15th USENIX Conference on File and Storage Technologies (FAST 17), pp. 29-44. USENIX Association, 2017.

Professional Organizations for Computing – More than the Elks’ Lodge

Professional Organizations for Computing – More than the Elks’ Lodge

There are many professional organizations, serving all sorts of purposes. For instance, the American Bar Association and American Medical Association help to represent lawyers and doctors, respectively, when setting standards, policies, and laws.

Within the field of computing, there are a number of professional organizations of note. Some are specific to certain roles, such as the League of Professional System Administrators. Here I will focus on three that serve software engineers, Computer Science (CS) researchers, CS academics, and those of similar professional interests. Mostly I’m doing this to try and impress upon readers the benefits of membership and participation in these organizations.

I first joined the Association for Computing Machinery (ACM) and the Computer Society of the Institute of Electrical and Electronics Engineers (IEEE) when I was a Ph.D. student. By becoming a student member, I subscribed to their monthly magazines, which contained numerous articles of interest. Shortly after finishing my degree I added the USENIX Association to the list.

Initially, the primary motivation for joining (or continuing membership in) these organizations was the significant discounts offered to members when attending conferences sponsored by one of them. Often the savings would more than compensate for the membership fee. In addition, there were personal benefits, such as the IEEE’s group life insurance plan.

The three professional organizations all run conferences, but beyond that they quickly diverge in their services.


I’ll start with the simplest first. USENIX basically exists to run computer-related conferences. They also have a quarterly newsletter, and many years ago published a Computing Systems journal, but the conferences are the reason USENIX exists … and they do a great job of it. The top systems conferences include such events as OSDI, NSDI, FAST, the USENIX Annual Technical Conference, and the USENIX Security conference.   I’ve chaired a couple of conferences for them many years ago, and USENIX makes it incredibly easy for the conference organizers. Instead of depending on the chair to manage volunteers to handle logistics, the chair simply is responsible for selecting content. In addition, USENIX has enacted a policy of making all conference publications freely available over the Internet.


ACM conducts a broader set of activities than Usenix. ACM runs a number of conferences, many of which are among the most prestigious conferences within their domain, but it does much more. ACM is organized into “Special Interest Groups” such as the SIG on Operating Systems (SIGOPS) or the SIG on Data Communications (SIGCOMM). The SIGs run conferences, such as the Symposium on Operating Systems Principals, known as SOSP (SIGOPS) or the SIGCOMM annual conference. Each SIG typically publishes a regular newsletter with a combination of news and technical content (with little or no peer review).   ACM also publishes a number of journals, which provide archival-quality content, often extended versions of conference papers. For example, Transactions on Storage publishes a number of articles that extend papers from FAST, including the papers selected as “best papers” for the conference. Finally, ACM has a number of awards, such as membership levels (fellows, distinguished members, and senior members) and for exceptional achievements (such as the Mark Weiser award).

IEEE Computer Society

IEEE-CS (“CS”) is the largest society within IEEE, though there are other computer-related societies and councils, such as IEEE Communications Society. I’ll focus on CS.

Like ACM, CS runs conferences and publishes journals and magazines. Many of their magazines are much closer to the journals in style and quality than to the newsletters run by ACM SIGs or their CS counterparts, technical committees (TCs). Compared to journals, the magazines tend to have shorter articles, as well as columns and other technical content of general interest. Each issue tends to have a “theme” focusing articles on a particular topic. I was editor-in-chief of Internet Computing for four years, so I led the decisions about what themes for which to request submissions, and I would assign other submissions to associate editors to gather peer reviews and make recommendations. I highly recommend CS magazines for those interested in high-level material in general (Computer, which comes with CS membership, or specific areas such as Cloud Computing or Security & Privacy.

IEEE-CS also sponsors many conferences across a variety of subdisciplines. I mention these after the periodicals because I feel like CS stands out more because of its magazines than its journals or conferences, which are roughly analogous to those from ACM. Additionally, many conferences are sponsored jointly by two or more societies, blurring that boundary further. Conferences are sponsored by Technical Committees, which are similar to ACM SIGs.

Finally, it is worth pointing out that both IEEE and ACM make a number of contributions in other important areas, such as education and standards. The societies cooperate on things like curricula guidelines; in addition, CS produces bodies of knowledge, which are handbooks on specific topics such as software engineering. IEEE has entire Standards Association, which produces such things as the 802.11 WiFi standard. The societies have local chapters as well, which sponsor invited talks, local conferences, and other ways to reach out to the immediate community.

My Own Role

I started as a volunteer with CS by serving as general chair of the Workshop on Workstation Operating Systems, which we later renamed the Workshop on Hot Topics on Operating Systems. I chaired the Technical Committee on Operating Systems, then created and formed the Technical Committee on the Internet. At that point I was asked to join the Internet Computing editorial board as liaison to the TC, but when my term expired I was kept on the board anyway and became associate editor in chief, then EIC. In 2015 I was elected to a three-year term on the CS Board of Governors. From there, I help set CS policies and decide on the next generation of volunteers such as periodical editors.

In parallel, I’ve also been active with USENIX. In addition to serving on many technical program committees, I was the program chair for the USENIX Annual Technical Conference in 1998 and USENIX Symposium on Internet Technologies and Systems (later NSDI) in 1999. I’ve served on the steering committee for the Workshop on Hot Topics in Cloud Computing since 2015.

What’s In It for You?

By now I hope I’ve given you an idea what the three societies do for their members and the community at large. Even if you don’t tend to participate in the major technical conferences, there are local opportunities to network with colleagues and learn about new technologies. The magazines offered by IEEE-CS, as well as Communications of the ACM, are extremely informative. And don’t forget about those great insurance discounts!


~Fred Douglis @freddouglis

12th USENIX Symposium on Operating Systems Design and Implementation

12th USENIX Symposium on Operating Systems Design and Implementation

OSDI’16 was held in early November in Savannah, GA. It’s a very competitive conference, accepting 18% of what is already by and large a set of very strong papers. They shortened the talks and lengthened the conference to fit in 47 papers, which is well over twice the size of the conference when it started with 21 papers in 1994. (Fun fact: I had a paper in the first conference, but by the time we submitted the paper, not a single author was still affiliated with the company where the work was performed.) This year there were over 500 attendees, which is a pretty good number for a systems conference, and as usual it was like “old home week” running into past colleagues, students, and faculty.


There are too many papers at the conference to say much about many of them, but I will highlight a few, as well as some of the other awards.


Best Papers

There were three best paper selections. The first two are pretty theoretical as OSDI papers go, though verification and trust are certainly recurring themes.


Push-Button Verification of File Systems via Crash Refinement

Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang, University of Washington

This work uses a theorem prover to try and verify file systems. The “push-button verification” refers to letting the system automatically reason about correctness without manual intervention. The idea of “crash refinement” is to add states that are allowed in the specification.


Ryoan: A Distributed Sandbox for Untrusted Computation on Secret Data

Tyler Hunt, Zhiting Zhu, Yuanzhong Xu, Simon Peter, and Emmett Witchel, The University of Texas at Austin

Ryoan leverages the Intel secure processing enclave to try and build a system that enables private data to be computed upon in the cloud without leaking it to other applications.


Early Detection of Configuration Errors to Reduce Failure Damage

Tianyin Xu, Xinxin Jin, Peng Huang, and Yuanyuan Zhou, University of California, San Diego; Shan Lu, University of Chicago; Long Jin, University of California, San Diego; Shankar Pasupathy, NetApp, Inc.

PCHECK is a tool that stresses systems to try to uncover “latent” errors that otherwise would not manifest themselves for a long period of time. In particular, configuration errors are often not caught because they don’t involve the common execution path. PCHECK can analyze the code to add checkers to run at initialization time, and it has been found empirically to identify a high fraction of latent configuration errors.


Some of the Others

Here are a few other papers I thought either might be of particular interest to readers of this blog, or which I found particularly cool.


TensorFlow: A System for Large-Scale Machine Learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain

TensorFlow is a tool Google uses for machine learning, using dataflow graphs. Google has open-sourced the tool ( so it’s gaining traction in the research community. The talk was primarily about the model and performance. Since I know nothing about machine learning, I include this here only because it had a lot of hype at the conference, and not because I have much to say about it. (Read the paper.)


Shuffler: Fast and Deployable Continuous Code Re-Randomization

David Williams-King and Graham Gobieski, Columbia University; Kent Williams-King, University of British Columbia; James P. Blake and Xinhao Yuan, Columbia University; Patrick Colp, University of British Columbia; Michelle Zheng, Columbia University; Vasileios P. Kemerlis, Brown University; Junfeng Yang, Columbia University; William Aiello, University of British Columbia

This is another security-focused paper, but it was focused on a very specific attack vector. (And I have to give the presenter credit for making it understandable even to someone with no background in this sort of issue.) The idea behind return-oriented programming is that an attacker finds snippets of code to string together to turn into a bad set of instructions. The idea here is to move the code around faster than the attacker can do this. It uses a function pointer table to indirect so one can find functions via an index, but the index isn’t disclosable in user space.

Interestingly, the shuffler runs in the same address space, so has to shuffle its own code to protect it. In all, a neat idea, and an excellent talk.


EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure CodingV.

Rashmi, University of California, Berkeley; Mosharaf Chowdhury and Jack Kosaian, University of Michigan; Ion Stoica and Kannan Ramchandran, University of California, Berkeley

I’ll start by pointing out this is the one talk that was presented via recording (the primary author couldn’t travel). The technology for the presentation was excellent: the image of the speaker appeared in a corner of the video, integrated into the field of vision much better than what I’ve seen in things like Webex. However, rather than that person then taking questions by audio, there was a coauthor in person to handle questions.

EC-cache gains the benefits of both increased reliability and improved performance via erasure coding (EC) rather than full replicas. It gets better read performance by reading K+delta units when it needs only K to reconstruct an object, then it uses the first K that arrive. (Eric Brewer spoke of a similar process in Google at his FAST’17 keynote.)   Even with delta just equal to 1, this improves tail latency considerably.

One of the other benefits of EC over replication is that replication creates integral multiples of data, while EC allows fractional overhead. Note, though, that this is for read-mostly data – the overhead of EC for read-write data would be another story.


To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code

Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.

This work was done by the FS performance team at NetApp and was IMHO the most applied paper as well as the one nearest and dearest to Dell EMC. Because NetApp is a competitor, I hesitate to go into too many details for fear of mischaracterizing something.   The gist of the paper was that NetApp needed to take better advantage of multiprocessing in a system that wasn’t initially geared for that. Over time, the system evolved to break files into smaller stripes that could be operated on independently, then additional data structures were partitioned for increased parallelism, then finally finer-grained locking was added to work in conjunction with the partitioning.


Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services

Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song, Facebook Inc.

This was one of my favorite talks. Facebook updates their system multiple times per day. They need to safely determine the peak capacity across different granularities (web server, cluster, or region) and back off when experiencing degradation. The use this to identify things like inefficient load balancing. After identifying hundreds of bottlenecks, they could serve 20% more customers with the same infrastructure.



It is worth a quick shout-out to the various people recognized with other awards at the conference. Ant Rowstron at Microsoft Cambridge won the Weiser award for best young researcher. Vijay Chidambaram, a past student of Andrea and Remzi Arpaci-Dusseau at the University of Wisconsin–Madison,  won the Richie thesis award for “Orderless and Eventually Durable File Systems”. Charles M. Curtsinger won Honorable Mention. Finally, BigTable won the “test of time” award 10 years after it was published.


~Fred Douglis @FredDouglis