OSDI’16 was held in early November in Savannah, GA. It’s a very competitive conference, accepting 18% of what is already by and large a set of very strong papers. They shortened the talks and lengthened the conference to fit in 47 papers, which is well over twice the size of the conference when it started with 21 papers in 1994. (Fun fact: I had a paper in the first conference, but by the time we submitted the paper, not a single author was still affiliated with the company where the work was performed.) This year there were over 500 attendees, which is a pretty good number for a systems conference, and as usual it was like “old home week” running into past colleagues, students, and faculty.
There are too many papers at the conference to say much about many of them, but I will highlight a few, as well as some of the other awards.
There were three best paper selections. The first two are pretty theoretical as OSDI papers go, though verification and trust are certainly recurring themes.
Push-Button Verification of File Systems via Crash Refinement
Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang, University of Washington
This work uses a theorem prover to try and verify file systems. The “push-button verification” refers to letting the system automatically reason about correctness without manual intervention. The idea of “crash refinement” is to add states that are allowed in the specification.
Ryoan: A Distributed Sandbox for Untrusted Computation on Secret Data
Tyler Hunt, Zhiting Zhu, Yuanzhong Xu, Simon Peter, and Emmett Witchel, The University of Texas at Austin
Ryoan leverages the Intel secure processing enclave to try and build a system that enables private data to be computed upon in the cloud without leaking it to other applications.
Early Detection of Configuration Errors to Reduce Failure Damage
Tianyin Xu, Xinxin Jin, Peng Huang, and Yuanyuan Zhou, University of California, San Diego; Shan Lu, University of Chicago; Long Jin, University of California, San Diego; Shankar Pasupathy, NetApp, Inc.
PCHECK is a tool that stresses systems to try to uncover “latent” errors that otherwise would not manifest themselves for a long period of time. In particular, configuration errors are often not caught because they don’t involve the common execution path. PCHECK can analyze the code to add checkers to run at initialization time, and it has been found empirically to identify a high fraction of latent configuration errors.
Some of the Others
Here are a few other papers I thought either might be of particular interest to readers of this blog, or which I found particularly cool.
TensorFlow: A System for Large-Scale Machine Learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain
TensorFlow is a tool Google uses for machine learning, using dataflow graphs. Google has open-sourced the tool (www.tensorflow.org) so it’s gaining traction in the research community. The talk was primarily about the model and performance. Since I know nothing about machine learning, I include this here only because it had a lot of hype at the conference, and not because I have much to say about it. (Read the paper.)
Shuffler: Fast and Deployable Continuous Code Re-Randomization
David Williams-King and Graham Gobieski, Columbia University; Kent Williams-King, University of British Columbia; James P. Blake and Xinhao Yuan, Columbia University; Patrick Colp, University of British Columbia; Michelle Zheng, Columbia University; Vasileios P. Kemerlis, Brown University; Junfeng Yang, Columbia University; William Aiello, University of British Columbia
This is another security-focused paper, but it was focused on a very specific attack vector. (And I have to give the presenter credit for making it understandable even to someone with no background in this sort of issue.) The idea behind return-oriented programming is that an attacker finds snippets of code to string together to turn into a bad set of instructions. The idea here is to move the code around faster than the attacker can do this. It uses a function pointer table to indirect so one can find functions via an index, but the index isn’t disclosable in user space.
Interestingly, the shuffler runs in the same address space, so has to shuffle its own code to protect it. In all, a neat idea, and an excellent talk.
EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure CodingV.
Rashmi, University of California, Berkeley; Mosharaf Chowdhury and Jack Kosaian, University of Michigan; Ion Stoica and Kannan Ramchandran, University of California, Berkeley
I’ll start by pointing out this is the one talk that was presented via recording (the primary author couldn’t travel). The technology for the presentation was excellent: the image of the speaker appeared in a corner of the video, integrated into the field of vision much better than what I’ve seen in things like Webex. However, rather than that person then taking questions by audio, there was a coauthor in person to handle questions.
EC-cache gains the benefits of both increased reliability and improved performance via erasure coding (EC) rather than full replicas. It gets better read performance by reading K+delta units when it needs only K to reconstruct an object, then it uses the first K that arrive. (Eric Brewer spoke of a similar process in Google at his FAST’17 keynote.) Even with delta just equal to 1, this improves tail latency considerably.
One of the other benefits of EC over replication is that replication creates integral multiples of data, while EC allows fractional overhead. Note, though, that this is for read-mostly data – the overhead of EC for read-write data would be another story.
To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code
Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.
This work was done by the FS performance team at NetApp and was IMHO the most applied paper as well as the one nearest and dearest to Dell EMC. Because NetApp is a competitor, I hesitate to go into too many details for fear of mischaracterizing something. The gist of the paper was that NetApp needed to take better advantage of multiprocessing in a system that wasn’t initially geared for that. Over time, the system evolved to break files into smaller stripes that could be operated on independently, then additional data structures were partitioned for increased parallelism, then finally finer-grained locking was added to work in conjunction with the partitioning.
Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services
Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song, Facebook Inc.
This was one of my favorite talks. Facebook updates their system multiple times per day. They need to safely determine the peak capacity across different granularities (web server, cluster, or region) and back off when experiencing degradation. The use this to identify things like inefficient load balancing. After identifying hundreds of bottlenecks, they could serve 20% more customers with the same infrastructure.
It is worth a quick shout-out to the various people recognized with other awards at the conference. Ant Rowstron at Microsoft Cambridge won the Weiser award for best young researcher. Vijay Chidambaram, a past student of Andrea and Remzi Arpaci-Dusseau at the University of Wisconsin–Madison, won the Richie thesis award for “Orderless and Eventually Durable File Systems”. Charles M. Curtsinger won Honorable Mention. Finally, BigTable won the “test of time” award 10 years after it was published.
~Fred Douglis @FredDouglis