Copy Data Management – What About Unstructured Data?

There’s a comfort in certainty. When you know where you’re going… when you know what’s going to happen… when you know you’re right… the future can’t come fast enough. For the past twenty years, I’ve felt that way. In the past six months, however, I’ve found out what it means to have questions that I can’t confidently answer. Since misery loves company, I’ll share my questions and internal debate.

This week: What does Copy Data Management mean for Unstructured Data?

What is Copy Data Management?

Copy Data Management (CDM) is the centralized management of the copies of data that an organization creates. These copies can be used for backup, disaster recovery, test & development, data analytics, and other custom uses.

Each CDM product has a unique target today. Some CDM products focus on reducing the number of copies. Others emphasize meeting SLAs and compliance regulations. Still others try to optimize specific workflows (e.g. test & development). The products also split on the layer of copy management: storage, VM, application, or cloud.

Despite the product diversity, they all have one thing in common: application focus. The products all try to streamline copy management for applications (whether physical, virtual, or cloud). The decision makes sense. Applications are valuable. Application developers create multiple data copies. Application developers are technically savvy enough to understand the value of CDM and pay for it.

Still, unstructured data continues to grow exponentially. Much of that data is business critical and a source of security and compliance concerns. Traditional backup and archive techniques have not scaled with unstructured data growth and use cases, so companies need new answers.

What should CDM products do about unstructured data (i.e. files and objects)?

Affirmative Side: CDM should Include Unstructured Data

Copy Data Management should include unstructured data because there is more commonality than difference.

First, the core use cases are common. Customers need to protect their unstructured data in accordance with SLAs and compliance regulations, just as they do their applications. While customers may not run test & development with most of their unstructured data (except for file-based applications), many are interested in running analytics against that data. With that much overlap in function, CDM should aggressively incorporate unstructured data.

Second, unstructured data is part of applications. While some applications are built with only containerized processes and a centralized database, many apps leverage unstructured data (file and object) to store information. Thus, the “application” vs. “unstructured data” dichotomy is an illusion.

Third, the data path will be common. Customers use files for their applications already, like running Oracle and VMware over NFS. Since CDM products are already managing files for their application data, why not extend to unstructured data?

Finally, today most customers use one tool to manage all of their copies – their backup application. CDM is an upstart compared to backup software. In a world where everybody is attempting to streamline operations and become more agile, why are we splitting one tool into two?

The use cases and data path are similar, CDM needs to support files no matter what, and customers don’t want multiple products. CDM must support unstructured data – case closed.

Negative Side: CDM should Include Unstructured Data

My adversary lives in a legacy, backup-centric world. (Yes, I often resort to ad hominem attacks when debating myself.) Copy Data Management and Unstructured Data Management are evolving into two very different things, and need to be handled separately. The use cases are already very distinct and the divergence is increasing. The underlying technical flows are also from different worlds.

First, the requirements for application vs. unstructured protection are as different as their target audiences. Application owners recover application data; end-users recover unstructured data. Application owners, a small and technically savvy group, want an application-integrated interface from which to browse and recover (usually by rolling back in time) their application. In contrast, end-users need a simple, secure interface to enable them to search for and recover lost or damaged files. Protection means very different things to these two very different audiences.

Second, the requirements for compliance also vary because of the audience. Since there are relatively few applications (even with today’s application sprawl), application compliance focuses on securing access, encrypting the data, and ensuring the application does not store/distribute private information. Unstructured data truly is the Wild West. Since users create it, there is little ability to oversee what is created, where it is shared, and what happens to it. As a result, companies use brute force mechanisms (e.g. backup) to copy all the data, store those copies for years, and then employ tools (or contractors) to try to find information in response to an event. When you have no control over what’s happening, it’s hard not to be reactive. With applications, you can be proactive.

Third, test and development is becoming as important as protection. The application world is moving to a dev ops model. Teams automate their testing and deployment, constantly update their applications, and roll forward (and back) through application versions faster than backup and recovery ever dreamed of. As a result, the test and development use cases will become more common and more critical than protection. Over time, they may even absorb much of what was considered “protection” in the past.

Finally, the data flows are very different. To support the application flows, the data needs to stay in its original, native format. You cannot run test and development against a tar image of an application. Fortunately, the application infrastructure has built data movement mechanisms (generally based on snapshots or filters) to enable that movement. Even better, since the application has already encapsulated the data, it becomes possible to just copy the LUN, VM, or container without needing to know what’s inside. In contrast, protecting unstructured data is messy. Backup software agents run on Windows, Linux, and Unix file servers to generate proprietary backup images. NAS servers generate their own proprietary NDMP backup streams, or replicate only to compatible NAS servers and generate no catalog information. There are few high-performance data movement mechanisms, and since each file system is unique, there is no elegant encapsulation. The data flows between application and unstructured data could not be more different.

Due to the differences in use cases, users, and underlying technology, it is unrealistic to design a single CDM product to effectively cover both use cases.

Verdict: Confusion

I don’t see an obvious answer. The use cases, workflows, and technology demonstrate that application data CDM is not the same as unstructured data CDM. Of course, the overlap in general needs (protection policies, support for file/object) combined with the preference/expectation for centralized support demonstrates that integrated CDM has significant value.

The question comes down to: Is the value of an integrated product greater than the compromises required to build an “all-in-one”? The market is moving toward infrastructure convergence, but is the same going to happen with data management software?

I don’t have the answer, yet. But just as Tyler Stone is sharing “how products are built”, I’ll take you behind the scenes on how strategic decisions are made. Just wait until you see the great coin flip algorithm we employ…

Stephen Manley @makitadremel

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s