Honoring Dell EMC’s Data Protection and Storage Technical Directors

Honoring Dell EMC’s Data Protection and Storage Technical Directors

Everything changes. Organizational structures, company names, and, of course, technology. For technical companies to survive, however, one thing cannot change. We need technical leaders who can turn changing technology into new products that solve our customers’ problems. Dell EMC has replaced the Core Technologies Division with the Data Protection Division and the Storage Division, but we are still are building the core of customers’ data infrastructure.

Therefore, every quarter, we recognize the newest Dell EMC Data Protection and Storage Technical Directors. These are senior technical leaders who have delivered a sustained business impact by delivering – products, solutions, and customer experience. They are the engine of the company. The previous recipients are detailed here and here.

John Adams – John helps deliver VMAX performance that matters – performance for our customers’ most important applications. He’s demonstrated and optimized performance in the most demanding customer environments. He then drives customer-critical performance work into the engineering team – from evaluating large flash drives to host LUN scaling to dynamic cache partitioning. His skill spans from Unisys to EPIC (leading health care database/application). John is the go-to person who connects with customers, customer support, and engineering for their performance needs.

Michael Barber – Michael Barber is the rare quality engineer who truly is the customers’ advocate. First, Michael understands that customers buy VMAX replication to ensure that their mission-critical data is always safe. Since customer environments are constantly under load, facing all manner of unusual circumstances (especially in a disaster), Michael has built a tool that validates data consistency while generating all of those unusual situation. The tool is used across VMAX and much of the rest of the company. Michael also reviews and influences core features to ensure they meet customers’ standards and needs. VMAX customers have Michael Barber on their side.

Fred Douglis – Fred has led Dell EMC’s Data Protection Division’s academic strategy, while also driving research into the product. Under Fred’s guidance, Dell EMC has consistently published in prestigious conferences and journals. This work has helped advanced the state-of-the-art in deduplication research and development. He has also built strong relationships with leading universities like University of Chicago, Northeastern, and University of Wisconsin at Madison. His contributions to the industry have also been recognized. Fred is an IEEE Fellow and is currently serving on the Board of Governors for the IEEE Computer Society. Finally, the innovation of Data Domain BOOST-FS has enabled customers to more easily and rapidly protect custom and big data apps.

Martin Feeney – Martin helps the largest customers in the world run their most mission critical applications. As an expert in both FICON and VMAX, Martin has helped our mainframe customers get reliable access and predictable performance from their storage. He was instrumental in unifying the data format and storage algorithms for VMAX Mainframe and Open Systems support. This enables our customers to get better performance, functionality, and reliability more quickly. Martin is also responsible for optimizing the VMAX2 performance while also delivering the Mainframe Host Adapter optimizations for the VMAX3 platform. As customers continue to run their most important workloads on Mainframe, Martin keeps those applications running optimally.

Simon Gordon – Simon has been the Product Management lead for ProtectPoint for VMAX and XtremIO. Our most innovative customers deploy ProtectPoint to protect, refresh, and test some of their largest and most mission critical databases – like Oracle, DB2, Microsoft SQL, and EPIC. Simon has been instrumental in connecting customers, the field, application partners, and our engineering teams so that we can deliver a comprehensive protection solution built on top of revolutionary technology.

Colin Johnson – Colin, an expert in user experience design, has been the UX leader for Data Domain Management Console, Data Domain Long Term Retention in the Cloud, and ProtectPoint for XtremIO. Colin’s expertise in user experience, visual design, customer interaction, and data protection has enabled the Data Protection Division to deliver products that are easier for our customers to use across cloud, replication, multi-tenancy, and next-generation data protection.

Jerry Jourdain – Jerry has been the driving technical force behind Dell EMC’s email archiving solutions. Jerry co-founded Dell EMC’s initial industry-leading email archiving product, EmailXtender, and then was Chief Architect of the follow-on SourceOne product. Thousands of customers depend on Dell EMC to protect their most critical information for compliance, legal, or business needs. Jerry ensures that we can address their most challenging compliance and retention needs.

Amit Khanna – Amit has been modernizing data protection for NetWorker customers. He was the force behind NetWorker’s vProxy support – standalone, re-purposable, fully RESTful protection for VMware. Amit began by integrating Data Domain BOOST into NetWorker and tying together NetWorker and Data Domain replication. He then delivered the policy-based management for NetWorker 9, which allows customers to move toward Backup as a Service. His work on CloudBoost allows customers to back up both to the cloud and in the cloud. Amit’s work has made NetWorker a core part of modern data protection.

Ilya Liubovich – Over the past couple of years, VMAX customers have raved about how much easier it is to manage their systems. Ilya led one of the biggest optimizations, Unisphere/360 for VMAX. It is already attached to the majority of new VMAX systems, simplifying the management of their most critical storage. Furthermore, as security becomes an even more important issue in the world, Ilya has led the security standards for the management software – ensuring compliance to the highest standards, without intruding on the customer experience. With Ilya’s work, VMAX delivers high-end storage functionality with greater simplicity.

Prasanna Malaiyandi – Prasanna, a Data Protection Solution Architect, has led both ProtectPoint and ECDM from inception to delivery. ProtectPoint directly backs up VMAX and XtremIO systems to Data Domain, delivering up to 20x faster data protection than any other method. ECDM enables IT organizations to deliver Data Protection as a Service. Protection teams centrally control data protection, while allowing application, VM, and storage administrators to back up and recover data on their own, using high performance technologies like ProtectPoint and DD BOOST. Prasanna connected disparate products to bring Dell EMC products together somewhere other than the purchase order.

Jeremy O’Hare – Jeremy has delivered core VMAX functionality that separates it from every other product in the marketplace. Most recently, Jeremy led the creation of VMAX compression that delivers space savings with unparalleled performance in the industry. He’s also been instrumental to Virtual LUNS (VLUN) which enabled the groundbreaking FAST functionality. As a technical leader, Jeremy stands out for being able to bring solutions across teams. Compression touches virtually every part of the VMAX and Jeremy drove development and QA efforts across all of the groups, so that our customers enjoy compression without compromise on their VMAX systems.

Kedar Patwardhan – Kedar enables Avamar customers to solve their biggest, most challenging backup problems. First, Kedar created the only traditional file-system backup that doesn’t need to scan the file system. Customers with large file servers can scale their backups without compromising on functionality. Second, Kedar delivered OpenStack integration to protect some of our largest customers’ data. Third, the integration with vRA enables our Dell EMC’s customers to manage their protection from VMware interfaces. For the largest file systems to OpenStack to large VMware deployments, Kedar’s work enables us to deliver simple, scalable data protection.

Rong Yu – Rong is responsible for key algorithmic and architectural improvements to Symmetrix systems. First, he delivered a Quality of Service (QoS) framework that delivers both customer-defined Service Level Objectives while meeting the needs of internal operations like cloning and drive rebuild. He overhauled the prefetching model to leverage the knowledge of the host/application access patterns. He continues to help optimize RDF performance. Most recently, he developed the new middleware layer in the VMAX system that has enabled new features (like compression) and performance optimizations (optimizing cache-read-miss). Customers depend on VMAX for reliable, predictable high performance regardless of the situation. Rong’s work helps ensure that VMAX meets and exceeds those expectations.

Congratulations and thanks to the new and existing Dell EMC Technical Directors. You are the engine of Dell EMC!

~Stephen Manley @makitadremel

Cleaning Up Is Hard To Do

Cleaning Up Is Hard To Do

We recently published a paper[1] at the 15th USENIX Conference on File and Storage Technologies, describing how Dell EMC Data Domain’s method to reclaim free space has changed in the face of new workloads.

Readers of the Data Cortex are likely familiar with Data Domain and the way the Data Domain File System (DDFS) deduplicates redundant data. The original technical paper about DDFS gave a lot of information about deduplication, but it said little about how dead data gets reclaimed during Garbage Collection (GC). Nearly a decade later, we’ve filled in that gap while also describing how and why that process has changed in recent years.

In DDFS, there are two types of data that should be cleaned up by GC: unreferenced chunks (called segments in the DDFS paper and much other Data Domain literature, but chunks elsewhere), belonging to deleted files; and duplicate chunks, which have been written to storage multiple times when a single copy is sufficient. (The reason for duplicates being written is performance: it is generally faster to write a duplicate than to look up an arbitrary entry in the on-disk index to decide it’s already there, so the system limits how often it does index lookups.)

Both unreferenced and duplicate chunks can be identified via a mark-and-sweep garbage collection process. First, DDFS marks all the chunks that are referenced by any file, noting any chunks that appear multiple times. Then DDFS sweeps the unique, referenced chunks into new locations and frees up the original storage. Since chunks are grouped into larger units called storage containers, largely dead containers can be cleaned up with low overhead (i.e. copying the still-live data to new containers), while containers with lots of live data are left unchanged (i.e. the sweep process does not happen). This process is much like the early log-structured file system work, except that liveness is complicated by deduplication.

Originally, and for many years, DDFS performed the mark phase by going through every file in the file system and marking all the chunks reached by that file. This included both data chunks (which DDFS calls L0 chunks) and metadata (chunks containing fingerprints of other data or metadata chunks in the file, which DDFS calls L1-L6 chunks). Collectively this representation is known as a Merkle tree. We call this type of GC “logical garbage collection” because it operates on the logical representation of the file system, i.e., the way the file system appears to a client.

Logical GC worked well for quite some time, but recent changes to workloads caused problems. Some systems used a form of backups that created many files that all referenced the same underlying data, driving up the system’s deduplication ratio. The total compression, which is the cumulative effect of deduplication and intra-file compression, might be 100-1000X on such systems, compared to 10-20X on typical systems in the past. Revisiting the same data hundreds of times, with the random I/O that entailed, slowed the mark phase of GC.   Another new workload, having many (e.g., hundreds of millions) small files rather than a small number of very large files, similarly ran slowly when processing a file at a time.

Data Domain engineers reimplemented GC to do the mark phase using the physical layout of the storage containers, rather than the files. Every L1-L6 chunk gets processed exactly once, starting from the higher levels of the Merkle tree (L6) to flag the live chunks in the next level below. This physical GC avoids the random I/O and repeated traversals of the earlier logical GC procedure. Instead of scanning the file trees and jumping around the containers, the physical GC scans the containers sequentially. (Note: It may scan the same container multiple times as it moves from L6 to L1 blocks because each time through it only looks for blocks of one level. However, there are not that many L1-L6 containers compared to the actual L0 data containers: the metadata is only about 2-10% at most, with less metadata for traditional backups and more for the new high-deduplication usage patterns).

Physical GC requires a new data structure, a “perfect hash,” which is similar to a Bloom filter (representing the presence of a value in just a few bits) but requires about half the memory and has no false positives. In exchange for these two great advantages, the perfect hash requires extra overhead to preprocess all the chunk fingerprints: it creates a one-to-one mapping of fingerprint values to bits in the array, with the additional space needed to identify which bit matches a value. Analyzing the fingerprints at the start of the mark phase is somewhat time-consuming; however, using the perfect hash ensures both that no chunks are missed and that no false positives result in large amounts of data being retained needlessly.

We learned that physical GC dramatically improved performance for the new workloads. However, it was slightly slower for the traditional workloads. Because of other changes made in parallel with the move to physical GC, it was hard to determine how much of this slower performance was due to the perfect hash overhead, and how much might be due to the other changes.

We needed to make GC faster overall. One of the causes of the slow mark phase was the need to make two passes through the file system much of the time. This was necessary because there was insufficient memory to track all chunks at once. Instead, GC would do much of the work of traversing the file system, but only sampling to get a sense of which containers should be focused on for cleaning. Then GC would identify which chunks are stored in those containers, and traverse the file system a second time while focusing only on those chunks and containers.

Phase-optimized Physical GC (PGC+) reduces the memory requirements by using a perfect hash in place of one Bloom filter and eliminating the need for another Bloom filter completely. This allows PGC+ to run in a single GC phase rather than with two passes. Further optimizations also improved performance dramatically. Now GC is at least as fast as the original logical GC for all workloads and is about twice as fast for those that required two passes of LGC or PGC. Like PGC, PGC+ is orders of magnitude better than LGC for the new problematic workloads.

Data Domain continues to evolve, as do the applications using it. Aspects of the system, such as garbage collection, have to evolve with it. Logical GC was initially a very intuitive way to identify which chunks were referenced and which ones could be reclaimed. Doing it by looking at the individual storage containers is, by comparison, very elaborate. Physical GC may seem like a complex redesign of what was a fairly intuitive algorithm, but in practice it’s a carefully designed optimization to cope with the random-access penalty of spinning disks while ensuring the stringent guarantees of the Data Domain Data Invulnerability Architecture.

Because after all, slow garbage collection … just isn’t logical!

~Fred Douglis @freddouglis

[1] Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. “The Logic of Physical Garbage Collection in Deduplicating Storage.” In 15th USENIX Conference on File and Storage Technologies (FAST 17), pp. 29-44. USENIX Association, 2017.

Professional Organizations for Computing – More than the Elks’ Lodge

Professional Organizations for Computing – More than the Elks’ Lodge

There are many professional organizations, serving all sorts of purposes. For instance, the American Bar Association and American Medical Association help to represent lawyers and doctors, respectively, when setting standards, policies, and laws.

Within the field of computing, there are a number of professional organizations of note. Some are specific to certain roles, such as the League of Professional System Administrators. Here I will focus on three that serve software engineers, Computer Science (CS) researchers, CS academics, and those of similar professional interests. Mostly I’m doing this to try and impress upon readers the benefits of membership and participation in these organizations.

I first joined the Association for Computing Machinery (ACM) and the Computer Society of the Institute of Electrical and Electronics Engineers (IEEE) when I was a Ph.D. student. By becoming a student member, I subscribed to their monthly magazines, which contained numerous articles of interest. Shortly after finishing my degree I added the USENIX Association to the list.

Initially, the primary motivation for joining (or continuing membership in) these organizations was the significant discounts offered to members when attending conferences sponsored by one of them. Often the savings would more than compensate for the membership fee. In addition, there were personal benefits, such as the IEEE’s group life insurance plan.

The three professional organizations all run conferences, but beyond that they quickly diverge in their services.

USENIX

I’ll start with the simplest first. USENIX basically exists to run computer-related conferences. They also have a quarterly newsletter, and many years ago published a Computing Systems journal, but the conferences are the reason USENIX exists … and they do a great job of it. The top systems conferences include such events as OSDI, NSDI, FAST, the USENIX Annual Technical Conference, and the USENIX Security conference.   I’ve chaired a couple of conferences for them many years ago, and USENIX makes it incredibly easy for the conference organizers. Instead of depending on the chair to manage volunteers to handle logistics, the chair simply is responsible for selecting content. In addition, USENIX has enacted a policy of making all conference publications freely available over the Internet.

ACM

ACM conducts a broader set of activities than Usenix. ACM runs a number of conferences, many of which are among the most prestigious conferences within their domain, but it does much more. ACM is organized into “Special Interest Groups” such as the SIG on Operating Systems (SIGOPS) or the SIG on Data Communications (SIGCOMM). The SIGs run conferences, such as the Symposium on Operating Systems Principals, known as SOSP (SIGOPS) or the SIGCOMM annual conference. Each SIG typically publishes a regular newsletter with a combination of news and technical content (with little or no peer review).   ACM also publishes a number of journals, which provide archival-quality content, often extended versions of conference papers. For example, Transactions on Storage publishes a number of articles that extend papers from FAST, including the papers selected as “best papers” for the conference. Finally, ACM has a number of awards, such as membership levels (fellows, distinguished members, and senior members) and for exceptional achievements (such as the Mark Weiser award).

IEEE Computer Society

IEEE-CS (“CS”) is the largest society within IEEE, though there are other computer-related societies and councils, such as IEEE Communications Society. I’ll focus on CS.

Like ACM, CS runs conferences and publishes journals and magazines. Many of their magazines are much closer to the journals in style and quality than to the newsletters run by ACM SIGs or their CS counterparts, technical committees (TCs). Compared to journals, the magazines tend to have shorter articles, as well as columns and other technical content of general interest. Each issue tends to have a “theme” focusing articles on a particular topic. I was editor-in-chief of Internet Computing for four years, so I led the decisions about what themes for which to request submissions, and I would assign other submissions to associate editors to gather peer reviews and make recommendations. I highly recommend CS magazines for those interested in high-level material in general (Computer, which comes with CS membership, or specific areas such as Cloud Computing or Security & Privacy.

IEEE-CS also sponsors many conferences across a variety of subdisciplines. I mention these after the periodicals because I feel like CS stands out more because of its magazines than its journals or conferences, which are roughly analogous to those from ACM. Additionally, many conferences are sponsored jointly by two or more societies, blurring that boundary further. Conferences are sponsored by Technical Committees, which are similar to ACM SIGs.

Finally, it is worth pointing out that both IEEE and ACM make a number of contributions in other important areas, such as education and standards. The societies cooperate on things like curricula guidelines; in addition, CS produces bodies of knowledge, which are handbooks on specific topics such as software engineering. IEEE has entire Standards Association, which produces such things as the 802.11 WiFi standard. The societies have local chapters as well, which sponsor invited talks, local conferences, and other ways to reach out to the immediate community.

My Own Role

I started as a volunteer with CS by serving as general chair of the Workshop on Workstation Operating Systems, which we later renamed the Workshop on Hot Topics on Operating Systems. I chaired the Technical Committee on Operating Systems, then created and formed the Technical Committee on the Internet. At that point I was asked to join the Internet Computing editorial board as liaison to the TC, but when my term expired I was kept on the board anyway and became associate editor in chief, then EIC. In 2015 I was elected to a three-year term on the CS Board of Governors. From there, I help set CS policies and decide on the next generation of volunteers such as periodical editors.

In parallel, I’ve also been active with USENIX. In addition to serving on many technical program committees, I was the program chair for the USENIX Annual Technical Conference in 1998 and USENIX Symposium on Internet Technologies and Systems (later NSDI) in 1999. I’ve served on the steering committee for the Workshop on Hot Topics in Cloud Computing since 2015.

What’s In It for You?

By now I hope I’ve given you an idea what the three societies do for their members and the community at large. Even if you don’t tend to participate in the major technical conferences, there are local opportunities to network with colleagues and learn about new technologies. The magazines offered by IEEE-CS, as well as Communications of the ACM, are extremely informative. And don’t forget about those great insurance discounts!

 

~Fred Douglis @freddouglis

Making Math Delicious: The Research Cortex

Making Math Delicious: The Research Cortex

 

Last time I posted, I was the grouchy mathematician “telling data scientists to get off my lawn” as I attempted to persuade you that eating Brussels sprouts of Math is just as cool as eating that thick porterhouse steak named Data Science. (Disclaimer: I recognize all diets as equally valid, and WLOG operate in the space where Brussels sprouts are uncool and steaks are cool.) Data Science gets to be that porterhouse because its practitioners not only demonstrate its nutritional value to a business, but found a way to make it delectable, satisfying, and visually appealing to a wide audience. We all know that Brussels sprouts are nutritious, but in that way that tastes nutritious. With all this in mind, I’d like to provide a recipe for preparing those Brussels sprouts in a way that doesn’t feel like you are forcing them down while your mother glares at you.

 

Meet the Research Cortex at www.theresearchcortex.com.

 

The Research Cortex has the lofty goal of doing for mathematics what so many others have done for data science—make it rigorous, yet accessible to a wide audience that spans disciplines and industries. This new sibling of The Data Cortex serves as the unofficial hub for academic research of Dell EMC’s Data Protection Division CTO Team. Initially, our focus will be primarily mathematics.

 

The work we’ll publish is original, rigorous content… with a twist. Shortly after publishing a new paper, we add video overviews about the work and the key results. We also feature video microcontent (Math Snacks) that spans various topics and metatopics in mathematics. Our first series of Math Snacks looks at types of mathematical proofs, beginning with direct proofs, in order to give some insight into how mathematicians approach problems.

 

The scope of the research is broad; no echo chambers here. We want full exploration of all branches of mathematics, pure and applied. Our first published work, by yours truly, examines sequences of dependent random variables and constructs a new probability distribution that analytically handles correlated categorical random variables. The next paper is the first part of a Masters thesis by Jonathan Johnson, currently a PhD student at the University of Texas at Austin, discussing summation chains of sequences. Future work will touch queuing theory, reliability theory, algebraic statistics, and anything else that needs a home and an audience.

 

Mathematics is that underground river that nurtures every other branch of science and engineering. My hope is that, by making these theoretical and foundational works accessible and enjoyable to consume, we can spark innovative ideas and applications by our readers in any area they can think of.

 

I also want to take the time to acknowledge those who helped the Research Cortex go from a mathematician’s lofty ideal to a tangible (sort of) object. Mariah Arevalo, a software engineer in the ELDP program at Dell EMC is the site administrator, designer, social media manager, and other titles I’m sure I’ve missed. I’ll also throw a quick shout-out to Jason Hathcock for the assistance in video design and production, and music composition.

 

We are very proud of the Data Cortex’s new brother, and hope you will bookmark www.theresearchcortex.com and visit regularly to check out all our new content.

 

~Rachel Traylor @ Mathopocalypse

12th USENIX Symposium on Operating Systems Design and Implementation

12th USENIX Symposium on Operating Systems Design and Implementation

OSDI’16 was held in early November in Savannah, GA. It’s a very competitive conference, accepting 18% of what is already by and large a set of very strong papers. They shortened the talks and lengthened the conference to fit in 47 papers, which is well over twice the size of the conference when it started with 21 papers in 1994. (Fun fact: I had a paper in the first conference, but by the time we submitted the paper, not a single author was still affiliated with the company where the work was performed.) This year there were over 500 attendees, which is a pretty good number for a systems conference, and as usual it was like “old home week” running into past colleagues, students, and faculty.

 

There are too many papers at the conference to say much about many of them, but I will highlight a few, as well as some of the other awards.

 

Best Papers

There were three best paper selections. The first two are pretty theoretical as OSDI papers go, though verification and trust are certainly recurring themes.

 

Push-Button Verification of File Systems via Crash Refinement

Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang, University of Washington

This work uses a theorem prover to try and verify file systems. The “push-button verification” refers to letting the system automatically reason about correctness without manual intervention. The idea of “crash refinement” is to add states that are allowed in the specification.

 

Ryoan: A Distributed Sandbox for Untrusted Computation on Secret Data

Tyler Hunt, Zhiting Zhu, Yuanzhong Xu, Simon Peter, and Emmett Witchel, The University of Texas at Austin

Ryoan leverages the Intel secure processing enclave to try and build a system that enables private data to be computed upon in the cloud without leaking it to other applications.

 

Early Detection of Configuration Errors to Reduce Failure Damage

Tianyin Xu, Xinxin Jin, Peng Huang, and Yuanyuan Zhou, University of California, San Diego; Shan Lu, University of Chicago; Long Jin, University of California, San Diego; Shankar Pasupathy, NetApp, Inc.

PCHECK is a tool that stresses systems to try to uncover “latent” errors that otherwise would not manifest themselves for a long period of time. In particular, configuration errors are often not caught because they don’t involve the common execution path. PCHECK can analyze the code to add checkers to run at initialization time, and it has been found empirically to identify a high fraction of latent configuration errors.

 

Some of the Others

Here are a few other papers I thought either might be of particular interest to readers of this blog, or which I found particularly cool.

 

TensorFlow: A System for Large-Scale Machine Learning

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain

TensorFlow is a tool Google uses for machine learning, using dataflow graphs. Google has open-sourced the tool (www.tensorflow.org) so it’s gaining traction in the research community. The talk was primarily about the model and performance. Since I know nothing about machine learning, I include this here only because it had a lot of hype at the conference, and not because I have much to say about it. (Read the paper.)

 

Shuffler: Fast and Deployable Continuous Code Re-Randomization

David Williams-King and Graham Gobieski, Columbia University; Kent Williams-King, University of British Columbia; James P. Blake and Xinhao Yuan, Columbia University; Patrick Colp, University of British Columbia; Michelle Zheng, Columbia University; Vasileios P. Kemerlis, Brown University; Junfeng Yang, Columbia University; William Aiello, University of British Columbia

This is another security-focused paper, but it was focused on a very specific attack vector. (And I have to give the presenter credit for making it understandable even to someone with no background in this sort of issue.) The idea behind return-oriented programming is that an attacker finds snippets of code to string together to turn into a bad set of instructions. The idea here is to move the code around faster than the attacker can do this. It uses a function pointer table to indirect so one can find functions via an index, but the index isn’t disclosable in user space.

Interestingly, the shuffler runs in the same address space, so has to shuffle its own code to protect it. In all, a neat idea, and an excellent talk.

 

EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure CodingV.

Rashmi, University of California, Berkeley; Mosharaf Chowdhury and Jack Kosaian, University of Michigan; Ion Stoica and Kannan Ramchandran, University of California, Berkeley

I’ll start by pointing out this is the one talk that was presented via recording (the primary author couldn’t travel). The technology for the presentation was excellent: the image of the speaker appeared in a corner of the video, integrated into the field of vision much better than what I’ve seen in things like Webex. However, rather than that person then taking questions by audio, there was a coauthor in person to handle questions.

EC-cache gains the benefits of both increased reliability and improved performance via erasure coding (EC) rather than full replicas. It gets better read performance by reading K+delta units when it needs only K to reconstruct an object, then it uses the first K that arrive. (Eric Brewer spoke of a similar process in Google at his FAST’17 keynote.)   Even with delta just equal to 1, this improves tail latency considerably.

One of the other benefits of EC over replication is that replication creates integral multiples of data, while EC allows fractional overhead. Note, though, that this is for read-mostly data – the overhead of EC for read-write data would be another story.

 

To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code

Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.

This work was done by the FS performance team at NetApp and was IMHO the most applied paper as well as the one nearest and dearest to Dell EMC. Because NetApp is a competitor, I hesitate to go into too many details for fear of mischaracterizing something.   The gist of the paper was that NetApp needed to take better advantage of multiprocessing in a system that wasn’t initially geared for that. Over time, the system evolved to break files into smaller stripes that could be operated on independently, then additional data structures were partitioned for increased parallelism, then finally finer-grained locking was added to work in conjunction with the partitioning.

 

Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services

Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song, Facebook Inc.

This was one of my favorite talks. Facebook updates their system multiple times per day. They need to safely determine the peak capacity across different granularities (web server, cluster, or region) and back off when experiencing degradation. The use this to identify things like inefficient load balancing. After identifying hundreds of bottlenecks, they could serve 20% more customers with the same infrastructure.

 

Awards

It is worth a quick shout-out to the various people recognized with other awards at the conference. Ant Rowstron at Microsoft Cambridge won the Weiser award for best young researcher. Vijay Chidambaram, a past student of Andrea and Remzi Arpaci-Dusseau at the University of Wisconsin–Madison,  won the Richie thesis award for “Orderless and Eventually Durable File Systems”. Charles M. Curtsinger won Honorable Mention. Finally, BigTable won the “test of time” award 10 years after it was published.

 

~Fred Douglis @FredDouglis

Managing your computing ecosystem Pt. 3

Managing your computing ecosystem Pt. 3

Overview

The prospect of universal and interoperable management interfaces is closer to reality than ever. Not only is infrastructure converging, but so is the control and management plane. Last time, we discussed Redfish for managing hardware platforms. This time we will talk about Swordfish for managing storage.

Swordfishswordfish

The goal of Swordfish is to provide scalable storage management interfaces. The interfaces are designed to provide efficient, low footprint management for simple direct attached storage with the ability to scale up to provide easy to use management across cooperating enterprise class storage servers in a storage network.

The Swordfish Scalable Storage Management API specification defines extensions to the Redfish API. Thus a Swordfish service is at the same time a Redfish service. These extensions enable simple, scalable, and interoperable management of storage resources, ranging from direct attached to complex enterprise class storage servers. These extensions are collectively named Swordfish and are defined by the Storage Networking Industry Association (SNIA) as open industry standards.

Swordfish extends Redfish in two principal areas. The first is the introduction of the management and configuration based on service levels. The other is the addition of management interfaces for higher level storage resources. The following sections provide more detail on each.

Service based management

Swordfish interfaces allow the client to get what they want without having to know how the implementation produces the results. As an example, a client might want storage protected so that no more than 5 seconds of data is lost in the event of some failure. Instead of specifying implementation details like mirroring, clones, snapshots, or journaling, the interface allows the client to request storage with a recovery point objective of 5 seconds.   The implementation then chooses how to accomplish this requirement.

The basic ideas are borrowed from ITIL (a set of practices for IT service management that focuses on aligning IT services with the needs of business) and are consistent with ISO/IEC 20000.

A Swordfish line of service describes a category of requirements. Each instance of a line of service describes a service requirement within that category. The management service will typically be configured with a small number of supported choices for each line of service. The service may allow an administrator to create new choices if it is able to implement and enforce that choice. To take an example from airlines, you have seating as one line of service with choices of first, business, and steerage. Another line of service could be meals, with choices like regular, vegetarian, and gluten free. Lines of service are meant to be independent from each other. So, in our airline example, we can mix any meal choice with any seating choice.

Swordfish provides three lines of service covering requirements for data storage, (protection, security, and storage), and two lines of service covering requirements for access to data storage, (connectivity and performance).   Swordfish leaves the specification of specific choices within each of these lines of service to management service implementations.

A Swordfish class of service resource describes a service level agreement (SLA). If an SLA is specified for a resource, the service implementation is responsible for assuring that level of service is provided. For that reason, the management service will typically advertise only a small number of SLAs. The service may allow an administrator to create new SLAs if it is able to implement and enforce that agreement.   The requirements of an SLA represented by a class of service resource are defined by a small set of line of service choices.

Swordfish storage

Swordfish starts with Redfish definitions and then extends them. Redfish specifies drive and memory resources from a hardware centric point of view.   Redfish also specifies volumes as block addressable storage composed from drives. Redfish volumes may be encrypted. Swordfish then extends volumes and adds filesystems, file shares, storage pools, storage groups, and a storage service.   (Object stores are intended to be added in the future.)

A storage service provides a focus for management and discovery of the storage resources of a system.  Two principal resources of the storage service are storage pools and storage groups.

A storage pool is a container of data storage capable of providing capacity that conforms to a specified class of service. A storage pool does not support IO to its data storage.  The storage pool acts as a factory to provide storage resources (volumes, file systems, and other storage pools) that have a specified class of service. The capacity of a storage pool may come from multiple sources and are not all required to be of the same type. The storage pool tracks allocated capacity and may provide alerts when space is low.

A storage group is an administrative collection of storage resources (volumes or file shares) that are managed as a group. Typically, the storage group would be associated with one or more client applications. The storage group can be used to specify that all of its resources share the characteristics of a specified class of service. For example a class of service specifying data protection requirements might be applied to all of the resources in the storage group.

One primary purpose of a storage group is to support exposing or hiding all of the volumes associated with a particular application. When exposed, all clients can access the storage in the group via the specified server and client endpoints. The storage group also supports storage (or crash) consistency across the resources in the storage group.

Swordfish extends volumes and adds support for file systems and file shares, including support for both local and remote replication. Each type supports provisioning and monitoring by class of service. The resulting SLA based interface is a significant improvement for clients over the current practice where the client must know the individual configuration requirements of each product in the client’s ecosystem. Each storage service lists the filesystems, endpoints, storage pools, storage groups, drives and volumes that are managed by the storage service.

Recommendations

These three specifications should form the basis for any Restful system management solution.

As a starting point, OData provides a uniform interface suitable for any data service. It is agnostic to the functions of the service, but it supports inspection of an entity data model via an OData conformant metadata document provided by the service. Because of the generic functionality of the Restful style and with the help of inspection of the metadata document, any OData client can have both syntactic and semantic access to most of the functionality of an OData service implementation. OData is recommended as the basis for any Restful service.

Redfish defines an OData data service that provides a number of basic utility functions as well as hardware discovery and basic system management functions. A Redfish implementation can be very light-weight.   All computing systems should implement a Redfish management service. This recommendation runs the gamut from very simple devices in the IOT space up to enterprise class systems.

Finally, Swordfish extends the Redfish service to provide service based storage management. A Swordfish management service is recommended for all systems that provide advanced storage services, whether host based or network based.

Universal, interoperable management based on well-defined, supported standards. It may still seem like an impossible hope to some. Every day, however, we move closer to a more standard, more manageable infrastructure environment.

~George Ericson @GEricson

Security vs Protection – The Same, but Different

Though the words “security” and “protection” are mostly interchangeable in regular use of the English language, when talking about data, it’s a different story.

When we talk about data security, we are referring to securing data from becoming compromised due to an external, premeditated attack. The most well-known examples are malware and ransomware attacks.

Data protection, however, refers to protecting data against corruption usually caused by an internal factor such as human error or hardware failures. We generally address data protection by way of backup or replication – creating accessible versions of the data that may be stored on different media and in various locations.

Of course, these backups can be used for data recovery in either scenario.

 

Under attack

We have seen a dramatic rise in ransomware attacks in recent years, with startling results. According to the FBI, in Q1 of 2016, victims paid $209M to ransomware criminals. Intermedia reported that 72% of companies infected with ransomware cannot access their data for at least 2 days, and 32% lose access for 5 days or more. According to a July 2016 Osterman Research Survey, nearly 80 percent of organizations breached have had high-value data held for ransom.

 

security-vs-protection-1So what is ransomware?

Ransomware is a form of malware that is covertly installed on a victim’s computer and adversely affects it, often by encrypting the data and making it unavailable until a ransom is paid to receive the decryption key or prevent the information from being published.

Most infamously, Sony fell victim two years ago to a crippling attack that shut down its computers and email systems and sensitive information was published on the web. The Sony breach was a watershed moment in the history of cyber attacks. It is believed that the attackers were inside Sony’s network for over 6 months, giving them plenty of time to map the network and identify where the most critical data was stored.

The attack unfolded over a 48 hour period. It began by destroying Sony’s recovery capability. Backup media targets and the associated master and media servers were destroyed first. Then the attack moved to the DR and Production environments. Only after it had crippled the recovery capabilities did the attack target the production environment. After Sony recognized the attack, they turned to their Data Protection infrastructure to restore the damaged systems. However, they had lost their ability to recover. Sony was down for over 28 days and never recovered much of its data.

In Israel, the Nazareth Illit municipality was recently paralyzed by ransomware. Tts critical data was locked until the municipality pays the attackers the ransom price.

 

security-vs-protection

What do we propose?

While Dell EMC offers a range of products and solutions for backup and recovery on traditional media such as tape and disk, data is increasingly sitting in publicly-accessible domains such as networks, causing a heightened threat to data security. To address the shift in data storage, in particular the growing trend towards application development and storage in the cloud, Dell EMC is utilizing its decades of experience in the area of securing data with the most stringent requirements and the most robust and secure technology set in the market, to architect and implement solutions. The new technologies will lock out hackers from critical data sets and secure a path to quick business recovery. One such solution is Isolated Recovery Solution (IRS).

IRS 101

Essentially, IRS creates an isolated environment to protect data from deletion and corruption while allowing for a quick recovery time. It comprises the following concepts:security-vs-protection-3

  • Isolated systems so that the environment is disconnected from the network and restricted from users other than those with proper clearance.
  • Periodic data copying whereby software automates data copies to secondary storage and backup targets. Procedures are put in place to schedule the copy over an air gap* between the production environment and the isolated recovery area.
  • Workflows to stage copied data in an isolated recovery zone and periodic integrity checks to rule out malware attacks.
  • Mechanisms to trigger alerts in the event of a security breach.
  • Procedures to perform recovery or remediation after an incident.

*What is an air gap?

An air gap is a security measure that isolates a computer or network and prevents it from establishing an external connection. An air-gapped computer is neither connected to the Internet nor any systems that are connected to the Internet. Generally, air gaps are implemented where the system or network requires extra security, such as classified military networks, payment networks, and so on.

Let’s compare an air gap to a water lock used for raising and lowering boats between stretches of water of different levels on a waterway. A boat that is traveling upstream enters the lock, the lower gates are closed, the lock is filled with water from upstream causing the boat to rise, the upper gates are opened and the boat exits the lock.

In order to transfer data securely, air gaps are opened for scheduled periods of time during actual copy operations to allow data to move from the primary storage to the isolated storage location. Once the replication is completed, the air gap is closed.

 

Dell EMC’s Data Domain product currently offers a retention lock feature preventing the deletion of files until a predefined date. IRS takes such capabilities further. The solution will continue to evolve to simplify deployment and provide security against an even broader range of attacks (rogue IT admins, for example), IRS solutions will make life more difficult for hackers and data more secure. In IT, “security” and “protection” have been treated as two independent, orthogonal concepts. The new, destructive style of attacks changes that relationship. The two teams must partner to make a coherent solution.

 

~Assaf Natanzon @ANatanzon

Managing Your Computing Ecosystem

Managing Your Computing Ecosystem

Overview

There is a very real opportunity to take a giant step towards universal and interoperable management interfaces that are defined in terms of what your clients want to achieve. In the process, the industry can evolve away from the current complex, proprietary and product specific interfaces.

You’ve heard this promise before, but it’s never come to pass. What’s different this time? Major players are converging storage and servers. Functionality is commoditizing. Customers are demanding it more than ever.

Three industry-led open standards efforts have come together to collectively provide an easy to use and comprehensive API for managing all of the elements in your computing ecosystem, ranging from simple laptops to geographically distributed data centers.

This API is specified by:

  • the Open Data Protocol (OData) from Oasis
  • the Redfish Scalable Platforms Management API from the DMTF
  • the Swordfish Scalable Storage Management API from the SNIA

One can build a management service that is conformant to the Redfish or Swordfish specifications that provides a comprehensive interface for the discovery of the managed physical infrastructure, as well as for the provisioning, monitoring, and management of the environmental, compute, networking, and storage resources provided by that infrastructure. That management service is an OData conformant data service.

These specifications are evolving and certainly are not complete in all aspects. Nevertheless, they are already sufficient to provide comprehensive management of most features of products in the computing ecosystem.

This post and the following two will provide a short overview of each.

OData odata

The first effort is the definition of the Open Data Protocol (OData). OData v4 specifications are OASIS standards that have also begun the international standardization process with ISO.

Simply asserting that a data service has a Restful API does nothing to assure that it is interoperable with any other data service. More importantly, Rest by itself makes no guarantees that a client of one Restful data service will be able to discover or know how to even navigate around the Restful API presented by some other data service.

OData enables interoperable utilization of Restful data services. Such services allow resources, identified using Uniform Resource Locators (URLs) and defined in an Entity Data Model (EDM), to be published and edited by Web clients using simple HTTP messages.  In addition to Redfish and Swordfish described below, a growing number of applications support OData data services, e.g. Microsoft Azure, SAP NetWeaver, IBM WebSphere, and Salesforce.

The OData Common Schema Definition Language (CSDL) specifies a standard metamodel used to define an Entity Data Model over which an OData service acts. The metamodel defined by CSDL is consistent with common elements of the UML v2.5 metamodel.   This fact enables reliable translation to your programming language of your choice.

OData standardizes the construction of Restful APIs. OData provides standards for navigation between resources, for request and response payloads and for operation syntax. It specifies the discovery of the entity data model for the accessed data service. It also specifies how resources defined by the entity data model can be discovered. While it does not standardize the APIs themselves, OData does standardize how payloads are constructed and a set of query options and many other items that are often different across the many current Restful data services. OData specifications utilize standard HTTP, AtomPub, and JSON. Also, standard URIs are used to address and access resources.

The use of the OData protocol enables a client to access information from a variety of sources including relational databases, servers, storage systems, file systems, content management systems, traditional Web sites, and more.

Ubiquitous use will break down information silos and will enable interoperability between producers and consumers. This will significantly increase the ability to provide new and richer functionality on top of the OData services.

The OData specifications define:

Conclusion

While Rest is a useful architectural style, it is not a “standard” and the variances in Restful APIs to express similar functions means that there is no standard way to interact with different systems. OData is laying the groundwork for interoperable management by standardizing the construction of Restful APIs. Next up – Redfish.

 

~George Ericson @GEricson

 

Data Protection for Public Cloud Environments

Data Protection for Public Cloud Environments

In late 2015 I was researching the options available to protect application workloads running in public cloud environments. In this post I will discuss my findings, and what we are doing at Dell EMC to bring Enterprise grade Data Protection Solutions to workloads running in public cloud environments.

 

To understand how Data Protection applies to public cloud environments, we need to recognize that Data Protection can occur at different layers in the infrastructure. These include the server, storage, hypervisor (if virtualized), application and platform layer. When we implement Data Protection for on premises environments, our ability to exercise Data Protection functions at any one of these layers depends upon the technologies in use.

 

At the server layer, we typically deploy an agent-based solution that manages the creation of Data Protection copies of the running environment. This method can be used for virtualized, bare metal and even containerized environments that persist data.

 

At the application layer we typically rely on the applications’ native data protection functions to generate copies (usually to file system or pseudo media targets). Examples of this can include database dumps to local or remote disk storage. We can go a step further and embed control-and-data path plugins into the application layer to enable the application’s native data protection methods to interface with Data Protection storage for efficient data transfer and Data Protection management software for policy, scheduling, reporting and audit purposes.

 

Like the server approach, the application native approach is agnostic to the platform the application is running on, be it virtualized, bare metal or containerized, in public or private cloud environments. Where things get interesting is when we start leveraging infrastructure layers to support Data Protection requirements. The most common infrastructure layers used are the hypervisor and storage-centric Data Protection methods.

 

A by-product of infrastructure methods is they require privileged access to the infrastructure’s interfaces to create protection copies. In private cloud environments this requires coordination and trust between the Backup or Protection Administrator and the Storage or Virtualization Administrator. This access is often negotiated when the service is first established. In Public Cloud environments there is no Storage or Virtualization Administrator we can talk with to negotiate access. These layers are off limits to consumers of the Public Cloud. If we want to exercise Data Protection at these layers, we have to rely on the services that the Public Cloud provider makes available. These services are often referred to as Cloud-based Data Protection.

 

For example, Amazon Web Services (AWS) offers snapshots of Elastic Block Storage (EBS) volumes to S3 storage. This provides protection of volumes at the block-level. Microsoft Azure offers snapshots of VM’s to Azure Blob Storage and the Azure backup service for VM instances running the Windows Operating Systems.

 

A common property of Cloud-based Data Protection services and infrastructure-centric protection methods for that matter, is they are tightly coupled. Tight coupling means the technologies and methods are highly dependent on one another to function, which allows the method to perform at peak efficiency. For example, the method is able to track the underlying data that is changing in the virtual machine instance, and when appropriate take copies of the data that has changed between copies.

 

Tightly coupled methods have gained popularity in recent years simply because data volumes continue to grow to the extent that traditional methods are struggling to keep up. However, there are some important trade-offs being made when we bet the business solely on tightly coupled Data Protection methods.

 

Tight coupling trades efficiency for flexibility. In other words, we can have a very efficient capability, but it is highly inflexible. In the case of Data Protection, a solution focused on flexibility allows one to free the data copies from the underlying infrastructure. For example, in the case of AWS snapshot copies to S3, the copies are forever tied to the public cloud platform. This is a critical point that requires careful attention when devising a Public Cloud Data Protection strategy.

 

The best way I can describe the implications is to compare the situation to traditional on premises Data Protection methods. With on premises solutions, you are in full control of the creation, storage and recovery processes. For example, let us assume you have implemented a protection solution using a vendor’s product. This product would normally implement and manage the process of creating copies and storing these copies on media in the vendor’s data format (which in modern times is native to the application being protected). The property we usually take for granted here is we can move these copies from one media format to another or one location to another. We can also recover them to different systems and platforms. This heterogeneity offers flexibility, which enables choice. The choice to change our mind or adjust our approach to managing copies subject to changing conditions. For example, with loosely coupled copies, we can migrate them from one public cloud providers’ object storage (e.g. AWS S3) to another public cloud providers’ object storage (Azure Blob Storage), or even back to private cloud object storage (Elastic Cloud Storage), if we decide to bring certain workloads on premises.

 

Despite these trade-offs, there are very good reasons to use a public cloud providers native Data Protection functions. For example, if we want fast full VM recovery back to the source, we would be hard pressed to find a faster solution. However, cloud-native solutions do not address all recovery scenarios and lack flexibility. To mitigate these risks, a dual approach is often pursued that address the efficiency, speed and flexibility required by Enterprise applications, in public, private or hybrid cloud models.

 

My general advice to customers is to leverage tightly coupled Data Protection methods for short-lived Data Protection requirements, along with loosely coupled methods. In the case of Public Cloud models, this requires the deployment of software technologies (or hardware, via services like Direct Connect and ExpressRoute) that are not tied to the Public Cloud provider’s platform or data formats. As a consumer of Public Cloud services, this will afford you the flexibility to free your data copies if need be, in future.

 

Our Strategy

 

At Dell EMC we recognize that customers will deploy workloads across a variety of cloud delivery models. These workloads will require multiple forms of Data Protection, based on the value of the data and the desire to maintain independence from the underlying infrastructure or platform hosting the workloads.

 

Our strategy is to provide customers Data Protection everywhere. This protection will be delivered via multiple avenues, including orchestrating the control path of Public Cloud provider’s native solutions, and allowing the Public Cloud to host and manage the data path and storage. For workloads that require ultimate flexibility and independent Data Protection copies, we will also manage the data path and storage, to enable copies to remain agnostic to the cloud vendor. Furthermore, for customers that choose to consume SaaS-based solutions, we will continue to work with SaaS vendors to expand our existing SaaS Data Protection offering to export and manage data copies using the vendor’s available API’s, to the extent possible.

 

Ultimately, customers will choose which path they take. Our strategy is to ensure our Data Protection solutions allow customers to take any path available to them.

 

~Peter Marelas @pmarelas

Mathematics, Big Data, and Joss Whedon

Mathematics, Big Data, and Joss Whedon

Definition 1: The symmetric difference of two sets A and B, denoted A \Delta B , is the set of elements in each of A and B, but not in their intersection.

Let A be “Mathematics”, and let B be “Data Science”. This is certainly not the first article vying for attention with the latter buzzword, so I’ll go ahead and insert a few more here to help boost traffic and readership:

Analytics, Machine Learning, Algorithm,

Neural Networks, Bayesian, Big Data

These formerly technical words (except that last one) used to live solidly in the dingy faculty lounge of set A. They have since been distorted into vague corporate buzzwords, shunning their well-defined mathematical roots for the sexier company of “synergy”, “leverage”, and “swim lanes” at refined business luncheons. All of the above words have allowed themselves to become elements of the nebulous set B: “Data Science”. As the entire corporate and academic world scrambles to rebrand themselves as members of Big Data™, allow me to pause the chaos in order to reclaim set A.   This isn’t to say that set B is without its merits. Data Science is Joss Whedon, making the uncool comic books so hip that Target sells T-shirts now. The advent of powerful computational resources and a worldwide saturation of data have sparked a mathematical revival of sorts. (It is actually possible for university mathematics departments to receive funding now.) Data Science has inspired the development of methods for quantifying every aspect of life and business, many of which were forged in mathematical crucibles. Data science has built bridges between research disciplines, and sparked some taste for a subject that was previously about as appetizing to most as dry Thanksgiving turkey without gravy. Data science has driven billions of dollars in sales across every industry, customized our lives to our particular tastes, and advanced medical technology, to name a few. Moreover, the techniques employed by data scientists have mathematical roots. Good data scientists have some mathematical background, and my buzzwords above are certainly in both sets. Clearly,  A \cup B   is nonempty, and the two sets are not disjoint. However, the symmetric difference between the two sets is large. Symbolically,  (A \Delta B) \gg   A \cup B   . To avoid repetition of the plethora of articles about Data Science, our focus will be on the elements of mathematics that data science lacks. In mathematical symbols, we investigate the set A \ B.

Mathematics is simplification. Mathematicians seek to strip a problem bare. Just as every building has a foundation and a frame, every “applied” problem has a premise and a structure. Abstracting the problem into a mathematical realm identifies the facade of the problem that previously seemed necessary. An architect can design an entire subdivision with one floor plan, and introduce variation in cosmetic features to produce a hundred seemingly different homes. Mathematicians reverse this process, ignoring the unnecessary variation in building materials to find the underlying structure of the houses. A mathematician can solve several business problems with one good model by studying the anatomy of the problems.

Mathematics is rigor. My real analysis professor in graduate school told us that a mathematician’s job is two-fold: to break things and to build unbreakable things. We work in proofs, not judgment. Many of the data science algorithms and statistical tests that get name dropped at parties today are actually quite rigorous, if the assumptions are met. It is disingenuous to scorn statistics as merely a tool to lie; one doesn’t blame the screwdriver that is being misused as a hammer. Mathematicians focus on these assumptions. A longer list of assumptions prior to a statement indicates a weak statement; our goal is to strip assumptions one by one to see when the statement (or algorithm) breaks. Once we break it, we recraft it into a stronger statement with fewer assumptions, giving it more power.

Mathematics is elegance. Ultimately, this statement is a linear combination of the previous two, but still provides an insightful contrast. Data science has become a tool crib of “black box” algorithms that one employs in his language of choice. Many of these models have become uninterpretable blobs that churn out an answer (even good ones by many measures of performance. Pick your favorite measure–p values, Euclidean distance, prediction error.) They solve the specific problem given wonderfully, molding themselves to the given data like a good pair of spandex leggings. However, they provide no structure, no insight beyond that particular type of data. Understanding the problem takes a back seat to predictions, because predictions make money, especially before the end of the quarter. Vision is long-term and expensive. This type of thinking is short-sighted; with some investment, that singular dataset may reveal a structure that is isomorphic to another problem in an unrelated department, and even one that may be exceedingly simple in nature. In this case, mathematics can provide an interpretable, elegant solution that solves multiple problems, provides insight to behavior, and still retains predictive power.

As an example, let us examine the saturated research field of disk failures. There is certainly no shortage of papers that develop complex algorithms for disk failure prediction; typically the best performing ones are an ensemble method of some kind. Certain errors are good predictors of disk failure, for instance, medium errors and reallocated sectors. These errors evolve randomly, but always increase. A Markov chain fits this behavior perfectly, and we have developed the method to model these errors. Parameter estimation is a challenge, but the idea is simple, elegant, and interpretable. Because the mathematics are so versatile, with just one transition matrix a user can answer almost any question he likes without needing to rerun the model. This approach allows for both predictive analytics and behavior monitoring, is quick to implement, and is analytically (in the mathematical sense) sound. The only estimation needed is in the parameters, not in the model structure itself. Effective parameter estimation will effectively guarantee good performance.

There is room for both data scientists and mathematicians; the relationship between a data scientist and a mathematician is a symbiotic one. Practicality forces a quick or canned solution at times, and sometimes the time investment needed to “reinvent the wheel” when we have (almost) infinite storage and processing power at hand is not good business. Both data science and mathematics require extensive study to be effective; one course on Coursera does not make one a data scientist, just as calculus knowledge does not make one a mathematician. But ultimately, mathematics is the foundation of all science; we shouldn’t forget to build that foundation in the quest to be industry Big Data™ leaders.

 

~Rachel Traylor @mathpocalypse