Managing your computing ecosystem Pt. 3

Managing your computing ecosystem Pt. 3

Overview

The prospect of universal and interoperable management interfaces is closer to reality than ever. Not only is infrastructure converging, but so is the control and management plane. Last time, we discussed Redfish for managing hardware platforms. This time we will talk about Swordfish for managing storage.

Swordfishswordfish

The goal of Swordfish is to provide scalable storage management interfaces. The interfaces are designed to provide efficient, low footprint management for simple direct attached storage with the ability to scale up to provide easy to use management across cooperating enterprise class storage servers in a storage network.

The Swordfish Scalable Storage Management API specification defines extensions to the Redfish API. Thus a Swordfish service is at the same time a Redfish service. These extensions enable simple, scalable, and interoperable management of storage resources, ranging from direct attached to complex enterprise class storage servers. These extensions are collectively named Swordfish and are defined by the Storage Networking Industry Association (SNIA) as open industry standards.

Swordfish extends Redfish in two principal areas. The first is the introduction of the management and configuration based on service levels. The other is the addition of management interfaces for higher level storage resources. The following sections provide more detail on each.

Service based management

Swordfish interfaces allow the client to get what they want without having to know how the implementation produces the results. As an example, a client might want storage protected so that no more than 5 seconds of data is lost in the event of some failure. Instead of specifying implementation details like mirroring, clones, snapshots, or journaling, the interface allows the client to request storage with a recovery point objective of 5 seconds.   The implementation then chooses how to accomplish this requirement.

The basic ideas are borrowed from ITIL (a set of practices for IT service management that focuses on aligning IT services with the needs of business) and are consistent with ISO/IEC 20000.

A Swordfish line of service describes a category of requirements. Each instance of a line of service describes a service requirement within that category. The management service will typically be configured with a small number of supported choices for each line of service. The service may allow an administrator to create new choices if it is able to implement and enforce that choice. To take an example from airlines, you have seating as one line of service with choices of first, business, and steerage. Another line of service could be meals, with choices like regular, vegetarian, and gluten free. Lines of service are meant to be independent from each other. So, in our airline example, we can mix any meal choice with any seating choice.

Swordfish provides three lines of service covering requirements for data storage, (protection, security, and storage), and two lines of service covering requirements for access to data storage, (connectivity and performance).   Swordfish leaves the specification of specific choices within each of these lines of service to management service implementations.

A Swordfish class of service resource describes a service level agreement (SLA). If an SLA is specified for a resource, the service implementation is responsible for assuring that level of service is provided. For that reason, the management service will typically advertise only a small number of SLAs. The service may allow an administrator to create new SLAs if it is able to implement and enforce that agreement.   The requirements of an SLA represented by a class of service resource are defined by a small set of line of service choices.

Swordfish storage

Swordfish starts with Redfish definitions and then extends them. Redfish specifies drive and memory resources from a hardware centric point of view.   Redfish also specifies volumes as block addressable storage composed from drives. Redfish volumes may be encrypted. Swordfish then extends volumes and adds filesystems, file shares, storage pools, storage groups, and a storage service.   (Object stores are intended to be added in the future.)

A storage service provides a focus for management and discovery of the storage resources of a system.  Two principal resources of the storage service are storage pools and storage groups.

A storage pool is a container of data storage capable of providing capacity that conforms to a specified class of service. A storage pool does not support IO to its data storage.  The storage pool acts as a factory to provide storage resources (volumes, file systems, and other storage pools) that have a specified class of service. The capacity of a storage pool may come from multiple sources and are not all required to be of the same type. The storage pool tracks allocated capacity and may provide alerts when space is low.

A storage group is an administrative collection of storage resources (volumes or file shares) that are managed as a group. Typically, the storage group would be associated with one or more client applications. The storage group can be used to specify that all of its resources share the characteristics of a specified class of service. For example a class of service specifying data protection requirements might be applied to all of the resources in the storage group.

One primary purpose of a storage group is to support exposing or hiding all of the volumes associated with a particular application. When exposed, all clients can access the storage in the group via the specified server and client endpoints. The storage group also supports storage (or crash) consistency across the resources in the storage group.

Swordfish extends volumes and adds support for file systems and file shares, including support for both local and remote replication. Each type supports provisioning and monitoring by class of service. The resulting SLA based interface is a significant improvement for clients over the current practice where the client must know the individual configuration requirements of each product in the client’s ecosystem. Each storage service lists the filesystems, endpoints, storage pools, storage groups, drives and volumes that are managed by the storage service.

Recommendations

These three specifications should form the basis for any Restful system management solution.

As a starting point, OData provides a uniform interface suitable for any data service. It is agnostic to the functions of the service, but it supports inspection of an entity data model via an OData conformant metadata document provided by the service. Because of the generic functionality of the Restful style and with the help of inspection of the metadata document, any OData client can have both syntactic and semantic access to most of the functionality of an OData service implementation. OData is recommended as the basis for any Restful service.

Redfish defines an OData data service that provides a number of basic utility functions as well as hardware discovery and basic system management functions. A Redfish implementation can be very light-weight.   All computing systems should implement a Redfish management service. This recommendation runs the gamut from very simple devices in the IOT space up to enterprise class systems.

Finally, Swordfish extends the Redfish service to provide service based storage management. A Swordfish management service is recommended for all systems that provide advanced storage services, whether host based or network based.

Universal, interoperable management based on well-defined, supported standards. It may still seem like an impossible hope to some. Every day, however, we move closer to a more standard, more manageable infrastructure environment.

~George Ericson @GEricson

Security vs Protection – The Same, but Different

Though the words “security” and “protection” are mostly interchangeable in regular use of the English language, when talking about data, it’s a different story.

When we talk about data security, we are referring to securing data from becoming compromised due to an external, premeditated attack. The most well-known examples are malware and ransomware attacks.

Data protection, however, refers to protecting data against corruption usually caused by an internal factor such as human error or hardware failures. We generally address data protection by way of backup or replication – creating accessible versions of the data that may be stored on different media and in various locations.

Of course, these backups can be used for data recovery in either scenario.

 

Under attack

We have seen a dramatic rise in ransomware attacks in recent years, with startling results. According to the FBI, in Q1 of 2016, victims paid $209M to ransomware criminals. Intermedia reported that 72% of companies infected with ransomware cannot access their data for at least 2 days, and 32% lose access for 5 days or more. According to a July 2016 Osterman Research Survey, nearly 80 percent of organizations breached have had high-value data held for ransom.

 

security-vs-protection-1So what is ransomware?

Ransomware is a form of malware that is covertly installed on a victim’s computer and adversely affects it, often by encrypting the data and making it unavailable until a ransom is paid to receive the decryption key or prevent the information from being published.

Most infamously, Sony fell victim two years ago to a crippling attack that shut down its computers and email systems and sensitive information was published on the web. The Sony breach was a watershed moment in the history of cyber attacks. It is believed that the attackers were inside Sony’s network for over 6 months, giving them plenty of time to map the network and identify where the most critical data was stored.

The attack unfolded over a 48 hour period. It began by destroying Sony’s recovery capability. Backup media targets and the associated master and media servers were destroyed first. Then the attack moved to the DR and Production environments. Only after it had crippled the recovery capabilities did the attack target the production environment. After Sony recognized the attack, they turned to their Data Protection infrastructure to restore the damaged systems. However, they had lost their ability to recover. Sony was down for over 28 days and never recovered much of its data.

In Israel, the Nazareth Illit municipality was recently paralyzed by ransomware. Tts critical data was locked until the municipality pays the attackers the ransom price.

 

security-vs-protection

What do we propose?

While Dell EMC offers a range of products and solutions for backup and recovery on traditional media such as tape and disk, data is increasingly sitting in publicly-accessible domains such as networks, causing a heightened threat to data security. To address the shift in data storage, in particular the growing trend towards application development and storage in the cloud, Dell EMC is utilizing its decades of experience in the area of securing data with the most stringent requirements and the most robust and secure technology set in the market, to architect and implement solutions. The new technologies will lock out hackers from critical data sets and secure a path to quick business recovery. One such solution is Isolated Recovery Solution (IRS).

IRS 101

Essentially, IRS creates an isolated environment to protect data from deletion and corruption while allowing for a quick recovery time. It comprises the following concepts:security-vs-protection-3

  • Isolated systems so that the environment is disconnected from the network and restricted from users other than those with proper clearance.
  • Periodic data copying whereby software automates data copies to secondary storage and backup targets. Procedures are put in place to schedule the copy over an air gap* between the production environment and the isolated recovery area.
  • Workflows to stage copied data in an isolated recovery zone and periodic integrity checks to rule out malware attacks.
  • Mechanisms to trigger alerts in the event of a security breach.
  • Procedures to perform recovery or remediation after an incident.

*What is an air gap?

An air gap is a security measure that isolates a computer or network and prevents it from establishing an external connection. An air-gapped computer is neither connected to the Internet nor any systems that are connected to the Internet. Generally, air gaps are implemented where the system or network requires extra security, such as classified military networks, payment networks, and so on.

Let’s compare an air gap to a water lock used for raising and lowering boats between stretches of water of different levels on a waterway. A boat that is traveling upstream enters the lock, the lower gates are closed, the lock is filled with water from upstream causing the boat to rise, the upper gates are opened and the boat exits the lock.

In order to transfer data securely, air gaps are opened for scheduled periods of time during actual copy operations to allow data to move from the primary storage to the isolated storage location. Once the replication is completed, the air gap is closed.

 

Dell EMC’s Data Domain product currently offers a retention lock feature preventing the deletion of files until a predefined date. IRS takes such capabilities further. The solution will continue to evolve to simplify deployment and provide security against an even broader range of attacks (rogue IT admins, for example), IRS solutions will make life more difficult for hackers and data more secure. In IT, “security” and “protection” have been treated as two independent, orthogonal concepts. The new, destructive style of attacks changes that relationship. The two teams must partner to make a coherent solution.

 

~Assaf Natanzon @ANatanzon

Managing Your Computing Ecosystem Pt. 2

Managing Your Computing Ecosystem Pt. 2

Overview

We are making strides toward universal and interoperable management interfaces. These are not only interfaces that will interoperate across one vendor or one part of the stack, but management interfaces that can truly integrate your infrastructure management. Last time, we discussed OData, the Rest standardization. This time we will talk about Redfish for managing hardware platforms.

Redfish redfish

Redfish defines a simple and secure, OData conformant data service for managing scalable hardware platforms. Redfish is defined by a set of open industry standard specifications that are developed by the Distributed Management Task Force, Inc. (DMTF).

The initial development was from the point of view of a Baseboard Management Controller (BMC) or equivalent. Redfish management currently covers bare-metal discovery, configuration, monitoring, and management of all common hardware components. It is capable of managing and updating installed software, including for the operating system and for device drivers.

Redfish is not limited to low-level hardware/firmware management. It is also expected to be deployed to manage higher level functionality, including configuration and management of containers and virtual systems.   In collaboration with the IETF, Redfish is also being extended to include management of networks.

The Redfish Scalable Platforms Management API Specification specifies functionality that can be divided into three areas: OData extensions, utility interfaces, and platform management interfaces. These are described briefly in the following sections.

Redfish OData extensions

Redfish requires at least OData v4 and specifies some additional constraints:

  • Use of HTTP v1.1 is required, with support for POST, GET, PATCH, and DELETE operations, including requirements on many HTTP headers
  • JSON representations are required within payloads
  • Several well-known URIs are specified
    • /redfish/v1/ returns the ServiceRoot resource for locating resources
    • /redfish/v1/OData/ returns the OData service document for locating resources
    • /redfish/v1/$metadata returns the OData metadata document for locating the entity data model declarations.

Redfish also extends the OData metamodel with an additional vocabulary for annotating model declarations. The annotations specify information about, or behaviors of the modeled resources.

Redfish utility interfaces

The utility interfaces provide functionality that is useful for any management domain (for example, these interfaces are used by Swordfish for storage management). These interfaces include account, event, log, session, and task management.

The account service manages access to a Redfish service via a manager accounts and roles.

The event service provides the means to specify events and to subscribe to indications when a defined event occurs on a specified set of resources. Each subscription specifies where indications are sent, this can be to a listening service or to an internal resource, (e.g. a log service).

Each log service manages a collection of event records, including size and replacement policies. Resources may have multiple log services for different purposes.

The session service manages sessions and enables creation of an X-Auth-Token representing a session used to access the Redfish service.

The task service manages tasks that represent independent threads of execution known to the redfish service. Typically tasks are spawned as a result of a long running operation.

The update service provides management of firmware and software resources, including the ability to update those resources.

Redfish platform management interfaces

The principal resources managed by a Redfish service are chassis, computer systems and fabrics. Each resource has its current status. Additionally, each type of resource may have references to other resources, properties defining the current state of the resource, and additional actions as necessary.

Each chassis represents a physical or logical container. It may represent a sheet-metal confined space like a rack, sled, shelf, or module. Or, it may represent a logical space like a row, pod, or computer room zone.

Each computer system represents a computing system and its software-visible resources such as memory, processors and other devices that can be accessed from that system. The computer system can be general purpose system or can be a specialized system like a storage server or a switch.

Each fabric represents a collection of zones, switches and related endpoints. A zone is a collection of involved switches and contained endpoints. A switch provides connectivity between a set of endpoints.

All other subsystems are represented as resources that are linked via one or more of these principal resources. These subsystems include: bios, drives, endpoints, fans, memories, PCIe devices, ports, power, sensors, processors and various types of networking interfaces.

Conclusion

Redfish delivers a standardized management interface for hardware resources. While it is beginning with basic functionality like discovery, configuration and monitoring, it will deliver much more. It will extend into both richer services and cover more than physical resources – e.g. virtual systems, containers, and networks. Redfish is built as an OData conformant service, which makes it the second connected part of an integrated management API stack. Next up – Swordfish.

~George Ericson @GEricson

Managing Your Computing Ecosystem

Managing Your Computing Ecosystem

Overview

There is a very real opportunity to take a giant step towards universal and interoperable management interfaces that are defined in terms of what your clients want to achieve. In the process, the industry can evolve away from the current complex, proprietary and product specific interfaces.

You’ve heard this promise before, but it’s never come to pass. What’s different this time? Major players are converging storage and servers. Functionality is commoditizing. Customers are demanding it more than ever.

Three industry-led open standards efforts have come together to collectively provide an easy to use and comprehensive API for managing all of the elements in your computing ecosystem, ranging from simple laptops to geographically distributed data centers.

This API is specified by:

  • the Open Data Protocol (OData) from Oasis
  • the Redfish Scalable Platforms Management API from the DMTF
  • the Swordfish Scalable Storage Management API from the SNIA

One can build a management service that is conformant to the Redfish or Swordfish specifications that provides a comprehensive interface for the discovery of the managed physical infrastructure, as well as for the provisioning, monitoring, and management of the environmental, compute, networking, and storage resources provided by that infrastructure. That management service is an OData conformant data service.

These specifications are evolving and certainly are not complete in all aspects. Nevertheless, they are already sufficient to provide comprehensive management of most features of products in the computing ecosystem.

This post and the following two will provide a short overview of each.

OData odata

The first effort is the definition of the Open Data Protocol (OData). OData v4 specifications are OASIS standards that have also begun the international standardization process with ISO.

Simply asserting that a data service has a Restful API does nothing to assure that it is interoperable with any other data service. More importantly, Rest by itself makes no guarantees that a client of one Restful data service will be able to discover or know how to even navigate around the Restful API presented by some other data service.

OData enables interoperable utilization of Restful data services. Such services allow resources, identified using Uniform Resource Locators (URLs) and defined in an Entity Data Model (EDM), to be published and edited by Web clients using simple HTTP messages.  In addition to Redfish and Swordfish described below, a growing number of applications support OData data services, e.g. Microsoft Azure, SAP NetWeaver, IBM WebSphere, and Salesforce.

The OData Common Schema Definition Language (CSDL) specifies a standard metamodel used to define an Entity Data Model over which an OData service acts. The metamodel defined by CSDL is consistent with common elements of the UML v2.5 metamodel.   This fact enables reliable translation to your programming language of your choice.

OData standardizes the construction of Restful APIs. OData provides standards for navigation between resources, for request and response payloads and for operation syntax. It specifies the discovery of the entity data model for the accessed data service. It also specifies how resources defined by the entity data model can be discovered. While it does not standardize the APIs themselves, OData does standardize how payloads are constructed and a set of query options and many other items that are often different across the many current Restful data services. OData specifications utilize standard HTTP, AtomPub, and JSON. Also, standard URIs are used to address and access resources.

The use of the OData protocol enables a client to access information from a variety of sources including relational databases, servers, storage systems, file systems, content management systems, traditional Web sites, and more.

Ubiquitous use will break down information silos and will enable interoperability between producers and consumers. This will significantly increase the ability to provide new and richer functionality on top of the OData services.

The OData specifications define:

Conclusion

While Rest is a useful architectural style, it is not a “standard” and the variances in Restful APIs to express similar functions means that there is no standard way to interact with different systems. OData is laying the groundwork for interoperable management by standardizing the construction of Restful APIs. Next up – Redfish.

 

~George Ericson @GEricson

 

Data Protection for Public Cloud Environments

Data Protection for Public Cloud Environments

In late 2015 I was researching the options available to protect application workloads running in public cloud environments. In this post I will discuss my findings, and what we are doing at Dell EMC to bring Enterprise grade Data Protection Solutions to workloads running in public cloud environments.

 

To understand how Data Protection applies to public cloud environments, we need to recognize that Data Protection can occur at different layers in the infrastructure. These include the server, storage, hypervisor (if virtualized), application and platform layer. When we implement Data Protection for on premises environments, our ability to exercise Data Protection functions at any one of these layers depends upon the technologies in use.

 

At the server layer, we typically deploy an agent-based solution that manages the creation of Data Protection copies of the running environment. This method can be used for virtualized, bare metal and even containerized environments that persist data.

 

At the application layer we typically rely on the applications’ native data protection functions to generate copies (usually to file system or pseudo media targets). Examples of this can include database dumps to local or remote disk storage. We can go a step further and embed control-and-data path plugins into the application layer to enable the application’s native data protection methods to interface with Data Protection storage for efficient data transfer and Data Protection management software for policy, scheduling, reporting and audit purposes.

 

Like the server approach, the application native approach is agnostic to the platform the application is running on, be it virtualized, bare metal or containerized, in public or private cloud environments. Where things get interesting is when we start leveraging infrastructure layers to support Data Protection requirements. The most common infrastructure layers used are the hypervisor and storage-centric Data Protection methods.

 

A by-product of infrastructure methods is they require privileged access to the infrastructure’s interfaces to create protection copies. In private cloud environments this requires coordination and trust between the Backup or Protection Administrator and the Storage or Virtualization Administrator. This access is often negotiated when the service is first established. In Public Cloud environments there is no Storage or Virtualization Administrator we can talk with to negotiate access. These layers are off limits to consumers of the Public Cloud. If we want to exercise Data Protection at these layers, we have to rely on the services that the Public Cloud provider makes available. These services are often referred to as Cloud-based Data Protection.

 

For example, Amazon Web Services (AWS) offers snapshots of Elastic Block Storage (EBS) volumes to S3 storage. This provides protection of volumes at the block-level. Microsoft Azure offers snapshots of VM’s to Azure Blob Storage and the Azure backup service for VM instances running the Windows Operating Systems.

 

A common property of Cloud-based Data Protection services and infrastructure-centric protection methods for that matter, is they are tightly coupled. Tight coupling means the technologies and methods are highly dependent on one another to function, which allows the method to perform at peak efficiency. For example, the method is able to track the underlying data that is changing in the virtual machine instance, and when appropriate take copies of the data that has changed between copies.

 

Tightly coupled methods have gained popularity in recent years simply because data volumes continue to grow to the extent that traditional methods are struggling to keep up. However, there are some important trade-offs being made when we bet the business solely on tightly coupled Data Protection methods.

 

Tight coupling trades efficiency for flexibility. In other words, we can have a very efficient capability, but it is highly inflexible. In the case of Data Protection, a solution focused on flexibility allows one to free the data copies from the underlying infrastructure. For example, in the case of AWS snapshot copies to S3, the copies are forever tied to the public cloud platform. This is a critical point that requires careful attention when devising a Public Cloud Data Protection strategy.

 

The best way I can describe the implications is to compare the situation to traditional on premises Data Protection methods. With on premises solutions, you are in full control of the creation, storage and recovery processes. For example, let us assume you have implemented a protection solution using a vendor’s product. This product would normally implement and manage the process of creating copies and storing these copies on media in the vendor’s data format (which in modern times is native to the application being protected). The property we usually take for granted here is we can move these copies from one media format to another or one location to another. We can also recover them to different systems and platforms. This heterogeneity offers flexibility, which enables choice. The choice to change our mind or adjust our approach to managing copies subject to changing conditions. For example, with loosely coupled copies, we can migrate them from one public cloud providers’ object storage (e.g. AWS S3) to another public cloud providers’ object storage (Azure Blob Storage), or even back to private cloud object storage (Elastic Cloud Storage), if we decide to bring certain workloads on premises.

 

Despite these trade-offs, there are very good reasons to use a public cloud providers native Data Protection functions. For example, if we want fast full VM recovery back to the source, we would be hard pressed to find a faster solution. However, cloud-native solutions do not address all recovery scenarios and lack flexibility. To mitigate these risks, a dual approach is often pursued that address the efficiency, speed and flexibility required by Enterprise applications, in public, private or hybrid cloud models.

 

My general advice to customers is to leverage tightly coupled Data Protection methods for short-lived Data Protection requirements, along with loosely coupled methods. In the case of Public Cloud models, this requires the deployment of software technologies (or hardware, via services like Direct Connect and ExpressRoute) that are not tied to the Public Cloud provider’s platform or data formats. As a consumer of Public Cloud services, this will afford you the flexibility to free your data copies if need be, in future.

 

Our Strategy

 

At Dell EMC we recognize that customers will deploy workloads across a variety of cloud delivery models. These workloads will require multiple forms of Data Protection, based on the value of the data and the desire to maintain independence from the underlying infrastructure or platform hosting the workloads.

 

Our strategy is to provide customers Data Protection everywhere. This protection will be delivered via multiple avenues, including orchestrating the control path of Public Cloud provider’s native solutions, and allowing the Public Cloud to host and manage the data path and storage. For workloads that require ultimate flexibility and independent Data Protection copies, we will also manage the data path and storage, to enable copies to remain agnostic to the cloud vendor. Furthermore, for customers that choose to consume SaaS-based solutions, we will continue to work with SaaS vendors to expand our existing SaaS Data Protection offering to export and manage data copies using the vendor’s available API’s, to the extent possible.

 

Ultimately, customers will choose which path they take. Our strategy is to ensure our Data Protection solutions allow customers to take any path available to them.

 

~Peter Marelas @pmarelas

Data is the currency of the 21st century

Data is the currency of the 21st century

For the last 20 years I’ve helped clients design and build Enterprise IT applications and infrastructure to support their businesses. During this time, my focus was on ensuring that data and the business processes IT supports are always available, under any conditions, planned or unplanned.

 

Recently, my focus changed. Last month I joined the Chief Technology Office inside the Information Solutions Group. For those not familiar, the Infrastructure Solutions Group is responsible for Dell EMC’s Primary Storage and Data Protection Solutions. If you have heard of VMAX, VNX, XtremIO, Unity, Data Domain, NetWorker or Avamar, you are in good company.

 

I’ve shifted from data infrastructure to exploring the value one can derive from data. In my quest to find answers, I have buried my head in research on many topics: conventional business intelligence, data visualization, statistics, machine learning, probabilistic programming and artificial neural networks. Suffice to say it is a lot to take on in a few months (and my head still hurts). What I have come to learn is that data generated by machines, in the course of doing business and interacting with each other, contains value that is seemingly invisible to the human eye. That value needs to be explored and unlocked.

 

Since the beginning of time, the human race has collected data to build a plan and support decisions. Data would be collected from surveys, forms, experiments, tests, optical devices and analogue to digital converters. This data was used to understand the environment around us, to find meaning, and to support objective decision making. However, the amount and diverse nature of the data we could generate was limited by the instruments and methods available to describe the environment in terms computers could process.

 

What changed? Thanks to mobile phones and other portable devices, the world switched from analogue to digital communications.

 

According to IDC, the digital universe we now live in is doubling in size every two years, and by 2020 the data we create and copy annually will reach 44 trillion gigabytes.

 

This digital revolution is having another profound effect. It is fueling a resurgence in Artificial Intelligence.

 

With data streaming into Artificial Intelligent systems from sight (images), sound (microphone) and touch (screen) sensors, we can now understand what people are doing, feeling, saying and seeing. And by viewing the environment from multiple perspectives, Artificial Intelligence can build a greater and more accurate understanding of the environment. Consider a little experiment. Next time you speak to someone, close your eyes and try to imagine how they feel. Are they happy, are they sad, are they calm, are they angry? The human brain can associate emotion from speech, the same way we associate objects through the visual cortex. Now, perform the same experiment, but this time with your eyes open. Are you more confident in your answer now?

 

What may not be obvious is that one of the main ingredients driving the resurgence in Artificial Intelligence is the abundance of rich and diverse data. This form of Artificial Intelligence, better known as Deep Learning, requires access to large amounts of data and computational power to identify patterns and form opinions. Humans then take these opinions to support their motivations and actions. The motivation behind the learning can be anything. For example, a business may want to identify growth opportunities, optimize a system or process, or anticipate and prevent a bad situation.

 

Major organizations including IBM, Google, Facebook, Amazon, Baidu and Microsoft recognize the potential of Artificial Intelligence, and each are accumulating data at massive scale. Much of this data comes in the form of speech, images and text from digital sources such as social media platforms, mobile devices and IoT sensors. What was once seemingly invisible to a human being can now be observed from the data.

 

What does the future hold?

 

Data rich organizations and governments will be able to do a lot of good for society. They will be able to solve crimes, identify disease early, prevent the spread of disease, anticipate social unrest and reduce our dependency on fossil fuels. However, they will also be able to anticipate and take advantage of events before everyone else, such as economic conditions, domestic trends and worldwide movements.

 

If you are in the business of delivering products and services, consider the type of data you need and questions you could ask to obtain a complete understanding of your products and users.

 

If you are engaging with customers and partners, consider the type of data you need and questions you could ask to improve customer satisfaction, simplify interactions, increase efficiency and anticipate next steps.

 

To summarize, data is the currency of the 21st century brought on by the digital revolution. This is fueling an Artificial Intelligence arms race. Those that ask questions of the data and listen will thrive. Those that ignore the data will fall further behind. To remain relevant in this digital era we all need to produce, nurture and listen to the data.

~Peter Marelas @pmarelas

The Evolution of Cybercrime: Ransomware and the need for an Isolated Recovery Solution

The Evolution of Cybercrime: Ransomware and the need for an Isolated Recovery Solution

Hope is not a strategy.

 

It takes some fortitude to begin a blog with such a well-worn cliché, but in this case it is more than fitting. With the emergence of ransomware and hacktivism as a rapidly growing new category of threats, all too often the response is hope.

 

Hope the hackers don’t attack us. Hope we can detect and shutdown and attack before it does too much damage.  Hope that by upgrading our perimeter defense and educating our employees that we will be a harder target.

 

Good luck with that.

 

The 2015 Data Breach Investigation report revealed that over 60% of companies could be compromised in six minutes or less. EMC has conducted two global data protection surveys in the past two years, and the resulting data is equally worrisome.  Approximately 1/3rd of all customers have experienced data loss due to a security breach, and the average cost of a single incident is $914,000, which in Dr. Evil terms is really, really close to “One Million Dollars!”

 

The statistics are out there for you to find if you want a greater dose of shock and horror. (Sometimes I wonder given all the travel I do why I watch the television show “Why Airplanes Crash”, but we are all drawn to the macabre and disturbing, right?)

 

But the reason I’m writing this isn’t just to put forth the disturbing data again. Instead, it’s to point out that many groups are looking in the wrong place for solutions.

 

While the threats are evolving, the Information Security community, vendors and IT professionals are all doubling down on more of the same. All of the emphasis is on incident prevention and very little attention is given to data protection and recovery.  The facts however lean very heavily toward the probability that any given business will be the subject of a successful cyber-attack in the foreseeable future.  In fact, most security experts agree that there are two kinds of companies:  those who have experienced a successful cyber-attack, and those who have experienced a successful cyber-attack and simply don’t know it yet.

 

As these threats have emerged, guidance in the form of Cyber Security Frameworks (CSFs) have been established. Most of these CSFs reference or borrow heavily from the NIST CSF, which has some very important fundamental information.  “Protect” and “Recover” are both pillars of the NIST CSF, as well as every other highly regarded framework.  In this context, “Protect” explicitly means to make secure, isolated protection copies of data.   “Recover” means that after detecting a threat a business needs a plan to suspend production, eliminate the threat, and recover data and systems necessary to resume operations.

 

EMC has a solution for customers who want to protect themselves from these modern and sophisticated threats called the “Isolated Recovery Solution”. Logically, this provides an “Air gap” between the production environment and the data that is critical for the survival of any business.  Here is a diagram demonstrating the key elements of the EMC Isolated Recovery Solution:

 

ransomware rc.jpg

 

Isolated Recovery is NOT Disaster Recovery. IR represents only the most critical data that a business needs to survive.  Most customers do not stand up an IR solution at the same size and scale as their primary backup or DR infrastructure.  Furthermore, IR usually lives in the same location as the production data.  This is essential for rapid recovery from an incident.  The mechanisms IR puts in place are both physical and logical.  For many enterprises this means creating a small locked cage within the data center and restricting access to the very few people who are responsible for the system.

 

There is a lot more to talk about when it comes to the IR solution, so I encourage you to reach out to one or more trusted advisors, prioritize what matters to you, and then dive into all the details.

 

But what about my earlier statements about not hanging all our hopes on prevention? Prevention is useful, but a successful defense requires a layered approach.  All of the parts of the solution need to fulfill a purpose within the CSF.  But how are you organized as a company?  When was the last time the Information Security team and the Backup team got together and built a joint solution?  My guess is probably never.  Traditional organizational structures simply aren’t aligned very well in order to solve this problem.  In order to defend against a threat that evolves rapidly, everyone needs to adapt.  Whether it is the backup guru talking to the firewall guru, or the CIO talking to the Board of Directors, someone needs to start having this conversation.  You are that someone.  You can’t just hope that somebody else will do it. Hope is not a strategy.

 

Rich Colbert

Cloud Native for The Enterprise

Cloud Native for The Enterprise

In Part I of this series, we explored how the Heroku architecture wires middleware software to their platform, allowing developers to focus on writing applications to drive their business. In this part, focusing on the enterprise use case, we’ll analyze the Heroku inspired PaaS system – Cloud Foundry.

 

Part II – Cloud Foundry – Cloud Native for the Masses

 

Cloud Foundry – The PaaS for the Enterprise

 

Cloud Foundry (CF) is considered by many to be the PaaS for the enterprise – and for good reasons. With mature APIs, built-in multi tenancy, authentication and monitoring it’s no wonder vendors like IBM, HP and GE built their platforms as certified CF distributions. Cloud Foundry can be thought of as a customizable implementation of the Heroku architecture. The main difference between Heroku and CF is the flexibility that allows CF to be installed anywhere. Like Heroku, CF adopted the strict distinction between the stateless 12-factor part of the App (“the application space”) and the stateful part of the App (“Services”). While the application space is very similar to the equivalent on Heroku, CF can’t depend on Heroku engineers to manage the services (e.g. databases/messaging). Instead, CF delegates this role to DevOps. As a result, enterprises can configure the platform to expose custom services that meet their unique needs. This sort of adaptability plays in favor of CF in the enterprise world.

 

Some of the major Cloud Foundry qualities that make it attractive in this space:

 

Avoid lock-in – With years of experience of being locked in by software and infrastructure stacks, enterprises seek freedom of choice. With the CPI abstraction (Cloud Provider Interface), Cloud Foundry (using BOSH) can be deployed on practically any infrastructure.

 

Mature on-premises use cases – The cloud is great! Enterprises are not passing up on that trend. However, reality has its own role in decision making, and many workloads are staying on premises. While security, regulations and IT culture are often cites, what keeps a large portion of the workloads on premises is the years and years of legacy systems holding the most important asset of any organization – its data. In order to move mission critical workloads to new architectures on a remote datacenter (e.g. cloud), organizations have to port all the proprietary non-portable data too. Translating for readers with cloud native mindsets: DB2 on mainframe is not yet “Dockerized”, one can’t simply push it around in a blink of an eye. J

 

Good Transition Story – CF can do more than run legacy workflows on premises. The big difference is that it provides a well-defined architectural transition story. The ability to move parts of the app or new modules to run as CF apps, while easily connecting to the legacy data systems as services (via service brokers), is powerful. This allows developers to experiment with the modern technology while accessing the core systems, giving them a real opportunity to experience the cloud native productivity boosts while keeping their feet firmly on the ground.

 

Compliance, Compliance and again Compliance – Many cloud native discussions leave out where data is being stored. We often hear that cloud native apps only use NoSQL or Big Data databases. While using modern databases makes sense in some use cases, in order to deliver compliant applications, organizations find it easy and safe to use mature database back-ends (e.g. Oracle/MS SQL) for serving their modern cloud native apps. With Cloud Foundry’s Service Brokers model, they are able to leverage the tools and processes they are already proficient with to protect their data, while modernizing their apps and workflows.

 

Vendor behind the platform – although Cloud Foundry is open source, Enterprises like having a throat to choke . Proprietary distributions like Pivotal Cloud Foundry support the transformation and can be engaged when customers encounter challenges.

 

While the motivation to modernize exists in almost every organization these days, not seeing a clear path for transformation can hold companies back. Many of the modern platforms (e.g. Kubernetes or Docker Datacenter) have an appealing vision for where the world of software needs to be, but it is not clear how to make a gradual transformation. By adopting CF, enterprises see an steady path that they can pursue and start getting results relatively quickly.

 

Cloud Foundry – Trouble in Paradise

 

Cloud Foundry is an exciting platform that can bring many benefits to organizations adopting it. However, there is always room to improve. CF’s flexibility in handling of stateful services via the “Service Brokers” abstraction and DevOps management is what made the “transition story” possible. However, as always, when making something more generic, there are tradeoffs. Since Heroku manages both stateful and stateless services, they can gain full control over the platform. Without that control, some pain points emerge.

cloud native 2 pic 1

Cloud Native New Silos

 

On one hand, the application space is modern and fully automated, while on the other hand the services space and its integration points with the app space have a quite a manual feel. It’s no surprise that the the platform shortcomings revolve around lack of control and lack of visibility caused by the new “DevOps Silo” the CF architecture enforces. For example:

 

Scale out – Cloud Foundry can do an impressive job scaling out the stateless services according to the load on the application. But sometimes, when the compute hassle has been resolved, the next bottleneck becomes the database. If the platform could control stateful services as well, it could scale the database resources as well. Today developers have to pick up the phone and call dev-ops, asking for specific services tuning.

 

Multi-Site HA – Production systems often run on more than a single site due to HA requirements and/or regulations. In order to support such deployment topologies, someone has to make the stateful services available on multiple sites when required. As the CF runtime ignores stateful services, there has to be an out-of-platform process of orchestrating stateful services. From the CF perspective, on every site you run a different application, and the coupling of those sites is not visible to CF. Seeing such deployment topologies in real life, I can testify that it is painful and error prone.

 

Portability – Great! CF is portable! Is it? Let’s say I run Facebook on my datacenter using CF. It would be effortless to point my CF client and AWS and push Facebook to a CF instance running on AWS. Job done. Or is it? If you are following so far you might have noticed that in my new Facebook production site on AWS, I’m going to be very lonely. In fact – I’m not going to have any friends! While the stateless services are portable, the stateful ones, that more than everything are the core value of any application, are certainly not portable.

 

Test & Dev – When new architectures emerge, they often focus on solving the primary use case of running in production. Only later, as the technology matures, do the toolsets and patterns for application maintenance come. To be fair, Cloud Foundry does have nice patterns for the application lifecycle, like blue-green deployments. However this is not enough especially in the enterprise. When maintaining an application, developers often need access to instances with fresh data from production. In the past, they had to speak with a DBA to extract and scrub the data for them, or even had a self-service way for doing this. In the modern world of micro-services, applications tend to use multiple smaller databases rather than one monolith. In addition, the new dev-ops silo means that there is one additional hop in the chain for doing this already complex operation. I’ll elaborate on the developer use case in the next post.

 

Data protection – While I have heard more than once that data protection is not required in cloud native, where the data services are replicating data across nodes, I’ve also witnessed organizations lose data due to human error (expiring full S3 buckets without backup) or malicious attacks (e.g. dropping entire schema). When crisis happens, organizations need the ability to respond quickly and restore their data. The CF platform hides the data services from us and creates yet another silo in the already disconnected world of data protection. Storage, backup admins, DBAs and now – DevOps.

 

Cost Visibility – When adopting modern platforms, more and more processes become automatic. This empowers developers to do more. However, with great power (should) come great responsibility. Companies complain that in large systems they lose visibility into the resources cost per application and use case. While with CF you can limit an application to consume some amount of compute resources, you get no control or visibility on stateful services cost (which in many cases is the most significant). For example, developers can run test-dev workloads attaching them to expensive stateful services while they could have used cheaper infrastructures for their test workloads. With Big Data Analytics services there is also lack of visibility in production. There is no ability to distinguish which app is utilizing what percent of the shared resource, and therefore the ability to prioritize and optimize.

Cloud native 2 pic 2

Native Application Data Distribution example

 

As the technology matures, and more organizations will take it to production and the community will start catching up with solutions. Some initiatives are already being thrown into the air (e.g. BOSH releases for stateful services) and offer some enhanced features (although always with trade-offs). I’ve recently seen a demo by Azure architects that runs Pivotal Cloud Foundry in Azure. Since they have full control of the stateful services, they could preconfigure Cloud Foundry marketplace to work seamlessly. Even more impressive, they stated they are working on supporting this configuration on premises using Azure virtualization software on customer hardware. With the tradeoff of being locked with Microsoft, having the ability to control and get visibility to the full spectrum of the application is certainly an attractive offering.

 

Conclusion

Cloud Foundry is bringing the Cloud Native environment to the enterprise ecosystem. By creating a more flexible model that depends on Dev Ops, it solves some of the challenges with bringing the Heroku model to the enterprise. Of course, there are new complexities that arise with the Cloud Foundry model. In particular, adding Dev Ops makes the stateful (i.e. persistent data) operations more challenging – especially for developers.

 

In the next article we’ll discuss the role of developers in the Cloud Native Enterprise. While cloud native applications empower developers to do more, in the enterprise world there are implications that create an interesting dissonance.

 

Amit Lieberman @shpandrak, Assaf Natanzon @ANatanzon, Udi Shemer @UdiShemer

Deduplicated Storage, Years after the Salesperson Leaves

Deduplicated Storage, Years after the Salesperson Leaves

Deduplicated Storage, Years after the Salesperson Leaves

So you are thinking about purchasing a deduplication storage system for your backups. Everyone tells you that, by removing redundancies, it will save storage space, and reduce costs, power, rack space, and network bandwidth requirements. Perhaps a vendor even did a quick analysis of your current data to estimate those savings. But, many months from now, what will your experience really be? We decided to investigate that question.

I was fortunate enough to collaborate with researchers studying long-term deduplication patterns, and our work was recently published at the Conference on Massive Storage Systems and Technology (MSST 2016)[1].

The Study

How do you perform a long-term study of deduplication patterns?

  • Step one: Create tools to gather data.
  • Step two: Gather data for years. And years. And years.
  • Step three: Analyze, analyze, analyze. Then upgrade your tools to analyze some more.

Our data collection tools were designed to process dozens of user home directories in a similar fashion as an actual deduplicated storage system. The tools ran daily from 2011 through today, with infrequent gaps due to weather-related power outages, hardware failures, etc. Every day, the tools would read all the files, break each file into chunks of various sizes (2KB – 128KB and full file), calculate a hash for each chunk, and record the file’s metadata.

In this paper, we analyzed a 21 month subset of the data covering 33 users, 4,100 daily snapshots, and a total of 456TB of data. If this isn’t the largest and longest running deduplication-oriented data set ever gathered, it is certainly among the top few. Analyzing the collected data was a gargantuan task involving sorting and merging algorithms that ran in phases, writing incremental progress to hard disk since the data was larger than memory.  For anyone that wants to perform their own data collection or wants to perform their own analysis of this immense data set, the tools and data set are publicly available.

 

The Results

What did we discover? Starting with general deduplication characteristics, we find that whole file deduplication works well for small files (<1 MB) but poorly for large files. Since large files consume most of the bytes in the storage system, this result supports the industry trend towards deduplicating chunks of files. We then looked at optimal deduplication chunk sizes based on the type of backup schedules.

  • Weekly full and daily incremental backups: optimal chunk size is 4-8KB
  • Daily full backups: optimal chunk size is 32KB

Deduplication ratios also varied by file type. Unsurprisingly, compressed files such as .gz, .zip, and .mp3 get little deduplication no matter the chunk size, while source code had the highest space savings because only small portions are modified between snapshots.  Interestingly, VMDK files took up the bulk of the space, highlighting that storage systems need to be VM-aware.

Looking at the snapshots for our 33 users, we found significant differences that are hidden by grouping all the users together. Per-user deduplication ratios varied widely from a low of 40X (meaning their data fits in 1/40th of the original space) to a high of 2,400X!  That outlier user had large and redundant data files that could be compacted dramatically through deduplication.

While much of the redundancy is related to the large number of user snapshots, increasing the number of preserved snapshots (e.g. the retention window) does not linearly increase the deduplication ratio because of metadata. For example, we saw the deduplication was as high as 5,000X for the outlier user before considering the impact of file recipes. A file recipe is internal metadata needed to represent and reconstruct the files in their deduplicated format. The recipe consumes a tiny percentage of the space compared to the actual file data. However, even when data isn’t changing, we still need a recipe for each file. This led to an interesting conclusion. We found that deduplication ratios tend to grow rapidly for the first few dozen snapshots and then grow more slowly because file recipes become a larger fraction of what is stored. There are even cases where deduplication ratios dropped because a user added novel content.

Finally we looked at grouping users together to see if there are advantages to deduplicating similar users together. We found that there are pairs of users that overlapped as much as 44%, which can directly turn into space savings by placing those users on the same deduplication system. Intelligent grouping and data placement can improve customers’ deduplication results.

These findings can provide guidance to storage designers as well as customers wondering how their system will respond to long-term deduplication patterns. Customers and designers should think about matching chunk sizes to backup policies, metadata management at scale, and grouping together similar users’ data. There are many more details in the paper, and researchers are welcome to use our tools and data set to perform their own analysis.

[1]Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. “A Long-Term User-Centric Analysis of Deduplication Patterns.”  In the Proceedings of the 32nd International Conference on Massive Storage Systems and Technology (MSST 2016).

 

-Philip Shilane @philipshilane

Cloud Native for The Enterprise

Cloud Native for The Enterprise

The “Cloud Native” application architecture is disrupting the way organizations build software because it generates massive productivity boosts. The two major benefits are: 1) getting from idea to production faster and 2) scaling more easily. “Cloud native” helps remedy the biggest challenges of modern software. Cloud native patterns and practices, originally crafted by the likes of Google or Netflix, have found their way into the industry via popular open source projects. Developers from small startups all the way to huge corporations are adopting those technologies and practices and weaving them into the software fabric of their business.

 

In this series of articles, we’ll explore how cloud native practices are affecting companies, especially enterprises. We’ll look at this transformation through the eyes of a “typical enterprise” focusing on their unique needs. We’ll make some observations on their challenges, how they are addressing them, and what vendor opportunities lie ahead in this market.

“if you want to understand today, you have to search yesterday”

~ Pearl Buck

Part I – The Heroku Use Case

To understand the current state, we’ll first briefly review the history of what we call “Cloud Native for Masses”. Heroku was the first successful initiative popularizing cloud native for developers outside the software giant club. Until Heroku, organizations had to build an entire stack of middleware software to deliver cloud native microservices in production. The software giants have done exactly that – built or patched together a variety of tools to maintain their microservices. Netflix, Twitter, and Facebook all had similar, but proprietary, stacks for doing service discovery, configuration, monitoring, container orchestration etc. Heroku engineers were able to distill a key ingredient of cloud native as we know it today: the “platform” component aka PaaS.

 

cloud native 1.jpg

Heroku – Cloud Native for the masses

Heroku began by hosting small-to-medium scale web applications by cleanly separating between the platform’s responsibility and the application’s. Coining the term 12-factor apps, Heroku could run, configure and scale various types of web applications. This enabled their customers to focus on their business and applications, delegating maintenance of the platform to Heroku. We won’t go into a detailed description of the Heroku architecture and workflows, but we will describe one key aspect, since it is important for this analysis – how Heroku handles stateful services.

 

In order to orchestrate the application, Heroku had to create a clear line between the stateful and stateless parts of the application; in Heroku terminology it separated the application and Add-ons. The reason for the split is physics. Scaling, porting, and reconfiguring stateless processes is done seamlessly using modern container orchestration (Dyno – in Heroku). Doing the same operations for stateful processes requires moving physical bits between the processes and maintaining their consistency. To scale a stateless process, you run another instance. To scale a stateful service involves setting network intensive distributed state protocols that are different from system to system. Heroku’s way of solving the stateful needs is by focusing on popular stateful services and offering those as dependencies to the applications. Heroku added Postgres, Redis, RabbitMQ, email service, SMS – all as services in Heroku’s add-ons marketplace maintained by Heroku engineers (DBAs, IT).

 

cloud native 2

App space v.s. Stateful Marketplace

For “the Heroku use case”, this separation was elegant since it not only solved real world problems, but also was relatively easy to monetize. Developers could use variety of programming languages and stacks for their applications without requiring Heroku to support each stack (*to learn more, point google at “heroku buildpacks”). Focusing on web applications, Heroku’s experience allowed them to offer unique capabilities out of the box like powerful http caches, developer workflow tools etc. Another advantage of this approach is that it creates stickiness with Heroku because the data is managed by the service and porting the data won’t be a straight forward task. Over time, more vendors offered Add-ons on the Heroku marketplace and enabled 3rd parties to leverage the Heroku platform.

 

Seeing the potential in the Heroku architecture led to various initiatives trying to imitate it in a more generic and portable manner. In the next article we’ll focus on one of those initiatives – Cloud Foundry. We’ll observe the resulting implications and effects when applying the Heroku architecture on use cases other than the “Heroku hosting use case” it was designed for.

 

To summarize, we’ve seen that in order to deliver Cloud Native Applications, developers need a platform – Either they build it themselves, or they buy/lease one. In most organizations, having engineers with this skillset doesn’t make sense – hence there is room for consumable PaaS platforms. A common metaphor I heard from Pivotal puts this point nicely: “Some people are superman so they can fly around, but most of us need to board an airplane”.

 

Amit Lieberman @shpandrak, Assaf Natanzon @ANatanzon, Udi Shemer @UdiShemer