How I Learned to Stop Worrying and Love New Storage Media: The Promises and Pitfalls of Flash and New Non-volatile Memory Part III

How I Learned to Stop Worrying and Love New Storage Media:   The Promises and Pitfalls of Flash and New Non-volatile Memory Part III

Moving Beyond Flash: Non-volatile Memory

I previously discussed my work with current flash products as well as technology trends driving the future of flash. Next, I will look into a fuzzy crystal ball at new storage media technologies coming to market.

The traditional storage hierarchy had DRAM at the top for fast accesses and hard disk drives at the bottom for persistent storage. Then we added flash to the hierarchy, either as a cache between the two layers or as a replacement for hard disk drives. But the storage vendors do not stand still; they are innovating on new storage media. In this blog, I want to focus on two non-volatile memory technologies: Non-volatile Memory Express (NVMe) and non-volatile DIMMs (NVDIMMs). For a primer on these topics, check out the proceedings of the 2016 Flash Memory Summit.

NVMe

When performance engineers started measuring the time each step takes when accessing flash, they found a surprising result. While flash is very fast, a sizable fraction of the access time is devoted to communication overheads just to reach flash. NVMe was developed to address this problem. Now, instead of taking 80 microseconds to read traditional flash, using NVMe, it may take 7 microseconds. Any time you can improve access latency by an order of magnitude, it gets attention.

NVMe is really a protocol, not a media itself, and Dell EMC is part of the specification committee considering better parallelization, deeper queues, and fine-grained user control. NVMe could be built with traditional flash or a newer technology. While flash-based NVMe products are currently coming to market, a more speculative technology may increase the lifespan of NVMe devices and up-end the storage hierarchy.

Upgrading the D in DRAM

There is research into new storage media with improved characteristics relative to flash, such as nanosecond access times (vs. microseconds), millions of erasures (vs. thousands), and access with memory pointers (vs. block access with read/writes). These technologies are a new alphabet soup of acronyms to learn: Phase Change Memory (PCM), Spin Transfer Torque (STT), Resistive RAM (RRAM), Carbon Nanotube RAM (NRAM), and others.

I won’t go through all of the details of these techniques, but it is enough to know that these are non-volatile storage media that fall somewhere between DRAM and flash for most characteristics. There has been work on these media for decades, but perhaps we are reaching the point where a mature product is ready for customers.

Intel and Micron leaped into the middle of the NVM space when they announced the 3D XPoint product for both NVMe and NVDIMMS. Unlike previous university and corporate research projects, the companies described roadmaps for the next few years. There have been enough internal meetings and vendor demonstrations, that 3D XPoint looks real.

The implementation details are under wraps, but the best guess is that it is a variant of PCM. Accessed via NVMe, 3D XPoint largely solves the lifespan issues with flash by increasing the number of erasure cycles by several orders of magnitude. Used within NVDIMMS, it has the potential to extend server memory from gigabytes to terabytes. Imagine running multi-terabyte databases in memory. Imagine a storage system where flash is the slow media for colder content. Imagine compute where all data is available with a pointer instead of a read/write call that waits millions of CPU cycles to complete. Turning these imaginations into reality is the hard part.

I need to provide a quick disclaimer: the terminology in this community can be misleading. In my opinion, any product labeled “non-volatile memory (NVM)” should 1. be non-volatile (i.e. hold its value across power events), and 2. be memory (i.e. accessed by a software pointer, not a read/write call at a block size). Sadly, there are products with the NVM label that are both volatile and read/written in large blocks. I am (not) sorry to be pedantic, but that makes me angry and confuses customers.

Not the End of Storage

Do NVMe and NVDIMM products eliminate the need for clever storage design? Thankfully no, or I would be out of a job. While a new media dramatically decreases the latency of data access, there are whole new storage architectures to figure out and programming models to design. The programming complexities of NVDIMMs are an area of active research because data persistence is not immediate. A flush command is needed to move data from processor caches to NVDIMMs, which introduces its own delay.

Also, we are used to the idea of rebooting a computer if it is acting odd. That has traditionally cleared the memory and reloaded a consistent state from storage. Now, the inconsistent state in memory will still be there after a reboot. We’ll need to figure out how to make systems behave with these new characteristics.

Besides these persistence issues, we will need to explore the right caching/tiering model in combination with existing media. While more data can stay in persistent memory, not all of it will. Furthermore, we’ll still need to protect and recover the data, regardless of where the primary copy resides.

I don’t know the right answers to these problems, but we will find out. Considering how flash has upended the market, I can’t wait to see the impact of new non-volatile memory technologies under development. I know that, regardless of how we answer the questions, the storage world will look different in a few years.

~Philip Shilane @philipshilane

The Hashtag Cortex

The Hashtag Cortex

Escaping the deadly radiation of the tech industry pulsar this time Inside The Data Cortex.

  • This year has been “The Year of all Flash” and Mark didn’t notice.
  • Weeks after day one Stephen and Mark discuss day one. It was kind of like day zero and not much different than day two. But day two had the world’s largest donut at Dell EMC World.
  • Weight gain and not much weight loss at tradeshows.
  • Stephen on the Goldilocks approach to embracing the public cloud and the tyranny of selection bias.
  • Do Google consider themselves an enterprise supplier?
  • This time of year there’s no sunshine anywhere outside of California. Says man living in California.
  • Software Defined Storage is kind of interesting. Says customer who thinks the installation packages will do everything.
  • Scale out is still a hard problem.
  • Mark has looked at home grown storage solutions and sees a lot of ugly babies. (Sorry! He’s not sorry.)
  • The Botnet of Things is real and your dishwasher is hitting someone with a denial of service attack right now.
  • This episode in reading things. Alcatraz Verses the Evil Librarians, Benjamin Franklin: An American Life, Steinbeck’s The Winter of our Discontent, Ken Clarke’s Kind of Blue and Stalin Paradoxes of Power.

No one likes to give up power. Go before you are pushed. Because it will be people like us doing the pushing.


Download this episode (right click and save)

Subscribe to this on iTunes

Get it from Podbean

Follow us on Pocket Casts
Stephen Manley @makitadremel Mark Twomey @Storagezilla

How I Learned to Stop Worrying and Love New Storage Media: The Promises and Pitfalls of Flash and New Non-volatile Memory

How I Learned to Stop Worrying and Love New Storage Media: The Promises and Pitfalls of Flash and New Non-volatile Memory

I tried to avoid learning about flash. I really did.  I’ve never been one of those hardware types who constantly chase the next hardware technology.  I’d rather work at the software layer, focusing on data structures and algorithms.  My attitude was that improving hardware performance raises all boats, so I did not have to worry about the properties of devices hiding under the waves.  Switching from hard drives to flash asthe common storage media, would just make everything faster, right?

Working for a large storage company broke me out of that mindset, though I still fought it for a few years. Even though I mostly worked on backup storage systems–one of the last hold-outs against flash–backup storage began to see a need for flash acceleration.   I figured we could toss a flash cache in front of hard drive arrays, and the system would be faster.  I was in for a rude awakening.   This multi-part blog post outlines what I have learned about working with flash in recent years as well as my view on the direction flash is heading.  I’ve even gotten so excited about the potential of media advances that I am pushing myself to learn about new non-volatile memory devices.

Flash Today

For those unfamiliar with the properties of flash, here is a quick primer. While a hard drive can supply 100-200 read/write operations per second (commonly referred to as input/output operations per second or IOPS), a flash device can provide 1000s – 100,000s IOPS.  Performing a read or write to a hard drive can take 4-12 milliseconds, while a flash device can typically respond in 40-200 microseconds (10-300X faster).  Flash handles more read/writes per second and responds more quickly than hard drives.  These are the main reasons flash has becoming widespread in the storage industry, as it dramatically speeds up applications that previously waited on hard drives.

If flash is so much faster, why do many storage products still use hard drives? The answer: price.  Flash devices cost somewhere in the range of $0.20 to $2 per gigabyte, while hard drives are as inexpensive as $0.03 per gigabyte.  For a given budget, you can buy an order of magnitude more hard drive capacity than flash capacity.  For applications that demand performance, though, flash is required.  On the other-hand, we find that the majority of storage scenarios follow an 80/20 rule, where 80% of the storage is cold and rarely accessed, while 20% is actively accessed.  For cost-conscious customers (and what customer isn’t cost conscious?), a mixture of flash and hard drives often seems like the best configuration.  This leads to a fun system design problem.  How do we combine flash devices and hard drives to meet customer requirements?  We have to meet the IOPS, latency, capacity and price requirements of varied customers.  The initial solution is to add a small flash cache to accelerate some data accesses while using hard drives to provide a large capacity for colder data.

A customer requirement that gets less attention, unfortunately, is lifespan. This means that a storage system should last a certain number of years without maintenance problems, such as 4-5 years.  While disk drives fail in a somewhat random manner each year, the lifespan of flash is more closely related to how many times it has been written.  It is a hardware property of flash that storage cells have to be erased before being written, and flash can only be erased a limited number of times.  Early flash devices supported 100,000 erasures, but that number is steadily decreasing to reduce the cost of the device.   For a storage system to last 4-5 years, the flash erasures have to be used judiciously over that time.   Most of my own architecture work around using flash has focused on the issues of maximizing the useful data available in flash, while controlling flash erasures to maintain its lifespan.

The team I have been a part of pursued several approaches to best utilize flash. First, we tried to optimize the data written to flash. We cached the most frequently accessed portions of the file system, such as index structures and metadata that are read frequently.   For data that changes frequently, we tried to buffer it in DRAM as much as possible to prevent unnecessary writes (and erasures) to flash.  Second, we removed as much redundancy as possible.  This can mean deduplication (replacing identical regions with references), compression and hand-designing data structures to be as compact as possible.  Enormous engineering effort goes into changing data structures to be flash-optimized.  Third, we sized our writes to flash to balance performance requirements and erasure limits.  As writes get larger, they tend to become a bottleneck for both writes and reads.  Also, erasures decrease because the write size aligns with the internal erasure size (e.g. multiple megabytes).  Depending on the flash internals, the best write size may be tens of kilobytes to tens of megabytes.  Fourth, we created cache eviction algorithms specialized for internal flash erasure concerns.  We throttled writes to flash and limited internal rearrangements of data (that also cause erasures) to extend flash lifespan.

Working with a strong engineering team to solve these flash-related problems is a recent highlight of my career, and flash acceleration is a major component of the 6.0 release of Data Domain OS.  Besides working with engineering, I have also been fortunate to work with graduate students researching flash topics, which culminated in three publications.  First we created Nitro, a deduplicated and compressed flash cache.  Next, Pannier is a specially-designed flash caching algorithm that handles data with varying access patterns.  Finally, we wanted to compare our techniques to an offline-optimal algorithm that maximized cache reads while minimizing erasures.  Such an algorithm did not exist, so we created it ourselves.

In my next blog post, I will present technology trends for flash. For those that can’t wait, the summary is “bigger, cheaper, slower.”

~Philip Shilane @philipshilane

In 2016, Flash Changes Everything!*

In 2016, Flash Changes Everything!*

*If by ‘everything’, you mean the media that sits inside of enterprise storage systems.

At an event in Paris, a customer asked, “Do you know what I like best about all-flash storage?” Since I had been warned that the French are sensitive, I resisted saying– “It doesn’t go on strike?” (At the time there was both a petrol and air traffic controller strike – in other words, a normal week in Paris.) His answer was disarmingly honest, “Everything else – cloud, hyperconverged infrastructure, containers – confuses me. But all-flash storage? It’s different, but I can understand it.”

While flash doesn’t disrupt the storage systems market, it is driving the evolution of storage systems. The evolution spans system design, vendor business models, and customer behavior. This time, let’s talk about basic storage system design.

The Evolution – System Design

Flash doesn’t change what storage systems do, but it does change how they do it. Flash storage systems enable applications and users to write and read data via a variety of protocols and networks – file, block, object, FICON, etc. They attempt to ensure that what a user stores is what the user reads. If that sounds like the functionality of disk and hybrid arrays, it is. Underneath, however, storage systems have changed how they do space optimization and how they make the media reliable.

Space Optimization

Disk storage systems make trade-offs between performance and space optimization. Space efficiency features like compression, deduplication, and clones incur costs: increased response time, management complexity, or unpredictable system performance. For decades, storage systems have optimized performance by laying out data in optimal locations on the disk. Space optimizations disrupt those carefully tuned algorithms. They fragment data, which increases the number of disk seeks, which degrades performance. As a result, disk systems implement space efficiency features for specific workloads (e.g. backup, archive, VDI, etc.) or as best-effort background tasks, but not as inline operations for general purpose usage.

All-flash storage systems both require and enable ubiquitous space efficiency. Flash delivers much greater I/O density than disk, but to make it cost effective, systems need to increase flash’s capacity density. While not all space efficiency techniques apply to all workloads, every flash array must make space efficiency features part of its toolkit. Conversely, flash storage makes it possible to deliver inline, ubiquitous space optimization. While the data may fragment, the random I/O performance of flash doesn’t depend on disk seeks; therefore, you can have space optimization and performance!

Note: Flash drives are growing much larger. The speed of reading data from the drive will not keep pace with the amount of data it stores. As a result, we’ll have a potential data access bottleneck. Flash storage systems will need to optimize data layout on a drive, intelligently spread data across drives, and cache efficiently. Storage media – the more things change, the more they stay the same.

Making Media Reliable

Storage systems work hard to return the same data that was written. All hardware fails. Storage media fails in multiple ways. The device completely fails. The device incorrectly writes data. The device returns wrong data. Regardless of the type of hardware failure, storage systems work to ensure that the users never know. While the mission remains the same, flash has different failure behaviors than disk drives.

Computer scientists have built companies, careers, and research groups on disk drive resiliency. Decades later, customers still debate over their preferred RAID algorithms. As we move into larger drives, we’ve resurrected the mirror vs. RAID vs. ECC debates. Meanwhile, the industry has increased the focus on predicting and handling drive failures, to reduce the impact of failed drives. Additionally, some research shows that media errors (on a healthy drive) and firmware bugs pose a more insidious threat to your data than full drive failures. Such events are both more common and less visible than failed drives. Thus, approaches like Data Domain’s Invulnerability Architecture have become a key market differentiator. Even in the year of “all-flash”, disk storage systems are evolving in the wake of their changing media.

Flash fails, but it fails differently than disk. The most obvious contrast is in “wear”. The mechanical components of disk drives wear out. That breakdown, however, is largely independent of the amount of times the system writes to the disk. Conversely, flash media is built of cells that can only be written a certain number of times before they wear out and cannot store data anymore. As a result, storage systems have changed their write behaviors to minimize and distribute the wear on the media. These modifications include: log-structured file systems to evenly distribute writes across the cells, space efficiency to reduce how many cells need to be written, and caching to eliminate frequent overwrites of data.

Meanwhile, all-flash arrays must respond to unique failure patterns of flash drives. First, we’re still learning how SSDs will fail. For example, how well will flash drives age? Unlike disk, where we have decades of experience in tracking drive failures over time, we’re still learning with flash. (I know vendors are trying to simulate accelerated aging, but I’m skeptical. The only proven way to accelerate aging is to have children.) Fortunately, we have more analytic tools available than ever before. Meanwhile, all-flash arrays are evolving traditional RAID approaches to better fit the new media. With a preference toward larger strip sizes (to minimize space consumption), resiliency across all components (e.g. across power zones in a disk array enclosure), and multi-drive resiliency (N+2), flash has forced an evolution of media failure analytics and protection.

Hardware fails – whether it is disk drives, flash drives, or memory. Storage systems will evolve to combat those failures. Regardless of the media and the failure characteristics, storage systems will continue deliver value by transforming inherently unreliable hardware into reliable data storage systems.

Conclusion

The disruption of storage media is driving the evolution of the storage system market. The basic needs haven’t changed. Customers want reliable storage that delivers the performance they need at the best possible cost. Flash storage changes many underlying assumptions, and storage systems are responding to the new media base. As a result, we’re all headed in the same direction.

The first question customers ask is – can a new system more quickly and efficiently add all of the expected resiliency and functionality to their “all-flash” base… or can established systems more quickly and efficiently modify their battle-tested resiliency and functionality to leverage the “all-flash” media? The second question is whether any of these systems can deliver more value than they’ve come to expect from traditional storage systems.

Before sharing my answer to those questions (giving time for each camp to bribe me – I do take t-shirts as payment), I will first discuss how business models and customer behaviors are changing in the next post.

Stephen Manley @makitadremel