Copy Data Management – What About Unstructured Data?

Copy Data Management – What About Unstructured Data?

There’s a comfort in certainty. When you know where you’re going… when you know what’s going to happen… when you know you’re right… the future can’t come fast enough. For the past twenty years, I’ve felt that way. In the past six months, however, I’ve found out what it means to have questions that I can’t confidently answer. Since misery loves company, I’ll share my questions and internal debate.

This week: What does Copy Data Management mean for Unstructured Data?

What is Copy Data Management?

Copy Data Management (CDM) is the centralized management of the copies of data that an organization creates. These copies can be used for backup, disaster recovery, test & development, data analytics, and other custom uses.

Each CDM product has a unique target today. Some CDM products focus on reducing the number of copies. Others emphasize meeting SLAs and compliance regulations. Still others try to optimize specific workflows (e.g. test & development). The products also split on the layer of copy management: storage, VM, application, or cloud.

Despite the product diversity, they all have one thing in common: application focus. The products all try to streamline copy management for applications (whether physical, virtual, or cloud). The decision makes sense. Applications are valuable. Application developers create multiple data copies. Application developers are technically savvy enough to understand the value of CDM and pay for it.

Still, unstructured data continues to grow exponentially. Much of that data is business critical and a source of security and compliance concerns. Traditional backup and archive techniques have not scaled with unstructured data growth and use cases, so companies need new answers.

What should CDM products do about unstructured data (i.e. files and objects)?

Affirmative Side: CDM should Include Unstructured Data

Copy Data Management should include unstructured data because there is more commonality than difference.

First, the core use cases are common. Customers need to protect their unstructured data in accordance with SLAs and compliance regulations, just as they do their applications. While customers may not run test & development with most of their unstructured data (except for file-based applications), many are interested in running analytics against that data. With that much overlap in function, CDM should aggressively incorporate unstructured data.

Second, unstructured data is part of applications. While some applications are built with only containerized processes and a centralized database, many apps leverage unstructured data (file and object) to store information. Thus, the “application” vs. “unstructured data” dichotomy is an illusion.

Third, the data path will be common. Customers use files for their applications already, like running Oracle and VMware over NFS. Since CDM products are already managing files for their application data, why not extend to unstructured data?

Finally, today most customers use one tool to manage all of their copies – their backup application. CDM is an upstart compared to backup software. In a world where everybody is attempting to streamline operations and become more agile, why are we splitting one tool into two?

The use cases and data path are similar, CDM needs to support files no matter what, and customers don’t want multiple products. CDM must support unstructured data – case closed.

Negative Side: CDM should Include Unstructured Data

My adversary lives in a legacy, backup-centric world. (Yes, I often resort to ad hominem attacks when debating myself.) Copy Data Management and Unstructured Data Management are evolving into two very different things, and need to be handled separately. The use cases are already very distinct and the divergence is increasing. The underlying technical flows are also from different worlds.

First, the requirements for application vs. unstructured protection are as different as their target audiences. Application owners recover application data; end-users recover unstructured data. Application owners, a small and technically savvy group, want an application-integrated interface from which to browse and recover (usually by rolling back in time) their application. In contrast, end-users need a simple, secure interface to enable them to search for and recover lost or damaged files. Protection means very different things to these two very different audiences.

Second, the requirements for compliance also vary because of the audience. Since there are relatively few applications (even with today’s application sprawl), application compliance focuses on securing access, encrypting the data, and ensuring the application does not store/distribute private information. Unstructured data truly is the Wild West. Since users create it, there is little ability to oversee what is created, where it is shared, and what happens to it. As a result, companies use brute force mechanisms (e.g. backup) to copy all the data, store those copies for years, and then employ tools (or contractors) to try to find information in response to an event. When you have no control over what’s happening, it’s hard not to be reactive. With applications, you can be proactive.

Third, test and development is becoming as important as protection. The application world is moving to a dev ops model. Teams automate their testing and deployment, constantly update their applications, and roll forward (and back) through application versions faster than backup and recovery ever dreamed of. As a result, the test and development use cases will become more common and more critical than protection. Over time, they may even absorb much of what was considered “protection” in the past.

Finally, the data flows are very different. To support the application flows, the data needs to stay in its original, native format. You cannot run test and development against a tar image of an application. Fortunately, the application infrastructure has built data movement mechanisms (generally based on snapshots or filters) to enable that movement. Even better, since the application has already encapsulated the data, it becomes possible to just copy the LUN, VM, or container without needing to know what’s inside. In contrast, protecting unstructured data is messy. Backup software agents run on Windows, Linux, and Unix file servers to generate proprietary backup images. NAS servers generate their own proprietary NDMP backup streams, or replicate only to compatible NAS servers and generate no catalog information. There are few high-performance data movement mechanisms, and since each file system is unique, there is no elegant encapsulation. The data flows between application and unstructured data could not be more different.

Due to the differences in use cases, users, and underlying technology, it is unrealistic to design a single CDM product to effectively cover both use cases.

Verdict: Confusion

I don’t see an obvious answer. The use cases, workflows, and technology demonstrate that application data CDM is not the same as unstructured data CDM. Of course, the overlap in general needs (protection policies, support for file/object) combined with the preference/expectation for centralized support demonstrates that integrated CDM has significant value.

The question comes down to: Is the value of an integrated product greater than the compromises required to build an “all-in-one”? The market is moving toward infrastructure convergence, but is the same going to happen with data management software?

I don’t have the answer, yet. But just as Tyler Stone is sharing “how products are built”, I’ll take you behind the scenes on how strategic decisions are made. Just wait until you see the great coin flip algorithm we employ…

Stephen Manley @makitadremel

The Origins of eCDM

The Origins of eCDM

EMC recently announced Enterprise Copy Data Management (eCDM), a product that enables global management and monitoring of copy data across primary and protection storage. Perhaps just as interesting as the product itself is the way that the product was conceived, designed, developed, and taken to market. Like the trailblazers of the west, the engineering team behind eCDM was faced with the daunting challenge of exploring uncharted territory. The team created a product from scratch using agile methodologies, open source technology, a brand new UI, and an entirely custom go-to-market strategy.

This is the first post in a series of posts that will detail the challenges and successes of the product team from conception to release.

John sat at his desk with a pen and an 11”x17” sheet of paper, outlining quadrants for each of the topics that I suggested we discuss. He was a software engineer after all, and he was approaching this interview with the same comprehensiveness that would be expected when designing feature specifications for a product. However, John is no ordinary software engineer – he is the chief architect for the eCDM v1.0 release, and he’s been working on this concept for years.

Nearly 7 years ago, EMC sponsored the ideation of next generation data protection concepts. John participated in the conception of “objective based management” – managing data copies based on what you want the outcome to be rather than what you need to do to produce a desired outcome.

For example, when a backup administrator needs to meet SLAs that require one copy of production on primary storage, one copy on protection storage updated every 24 hours, and another copy in the cloud updated every week, the backup administrator will configure their backup application to create copies to align with the SLA. However, with objective based management, the backup administrator would provide the SLA to the software and the software would configure and automate the necessary operations to meet the SLA. While the concept is simple, it fundamentally shifts the way that traditional backup software is designed and used.

“There is a difference between performing protection and knowing that you’re protected,” John explained to me while discussing the benefits of eCDM and objective based management. When a backup administrator performs a backup, they are simply confirming that their backup application completed a protection action. For example, if a backup application backs up their production data to protection storage, the application will flash a happy green. They’re protected now, right? What if something happens to that copy on the protection storage, though? The backup application doesn’t actively monitor protection copies; it simply provides an interface to perform and report on backup actions.

eCDM does not simply perform protection actions, it enables users to know that they are protected. In addition, the objective-based management model has implications beyond traditional backup and recovery. As the use cases for copies expand and application self-service grows in popularity, the need for a global interface to enable automated compliance and governance has risen. After designing a service plan, eCDM monitors and automates compliance operations to ensure that copies are meeting SLAs. It provides much more insight than a green light when a backup completes.

This concept is certainly intriguing, but what struck me was the passion that John exuded while discussing these topics. This enthusiasm is common on the eCDM team; it’s clear to see that the team shares John’s vision of objective based management and the experience that eCDM is promising. The members of the eCDM team are technical experts in storage and data protection, each with an average of 15-20 years in the industry. Their experience spans across varying business units within EMC and throughout the industry. When this many technical experts are so passionate about a product, it deserves notice.

John Rokicki is a 25+ year veteran of the storage management and data protection industries, having spent the last 11 years with EMC as a consulting engineer and senior manager. He has been primarily responsible for storage integration within EMC’s data protection products, including EMC NetWorker, with such technologies as EMC RecoverPoint, EMC Isilon NAS, and EMC VMAX storage. John currently serves as a product owner, chief architect and developer for the eCDM v1 release.  John has been acknowledged by the United States Patent Office for several inventions related to his past and current work.

-Tyler Stone @tyler_stone_