How I Learned to Stop Worrying and Love New Storage Media: The Promises and Pitfalls of Flash and New Non-volatile Memory Part II

How I Learned to Stop Worrying and Love New Storage Media:   The Promises and Pitfalls of Flash and New Non-volatile Memory Part II

Flash Tomorrow

While my early work in flash caching focused on the flash devices we currently have, the next stage of my research looks at where flash is going.

A combination of customer needs and vendor competition is driving flash towards higher capacity and lower cost, through a variety of technical advancements. When I recently purchased a 128GB SD card for a digital camera, I was reminded of the now-useless 64MB cards I had purchased a decade ago.

A similar trend exists at the enterprise and cloud storage level. While a 100GB flash device was standard years ago, a prototype 60TB flash device was announced at the Flash Memory Summit in August. No price was announced, but I would guess it will cost over $10,000, which sounds high, but at potentially $0.20 per gigabyte, it will be among the cheapest flash devices, per gigabyte. I don’t need this product in my home, but it will find a role in enterprise or cloud storage. How did flash technology evolve to the point where a 60TB flash device is a plausible product?

Flash Trends

There are three main ongoing technical trends driving flash technology. All of the trends focus on increasing the density of flash storage. First, manufacturing processes continue to squeeze more bits into the same chip, which is analogous to denser hard drive capacities and more CPU cores.

Until recently, advancement in density has been within the two dimensional plane of flash chips. Now however, researchers are pursuing a second trend – to stack flash chips within a device, which is conveniently called 3D stacking. Vendors are stacking 48 or more layers already, and there is research pushing to 100 or more layers in the coming years.

The third trend is to increase the information content of individual cells. At a simplistic level, a flash cell is a storage unit for electrons.  Based on the charge of the cell, the value of the cell is either a 0 or a 1.  This was the original design for flash cells called a single-level cell (SLC).  Instead of simply splitting the range of cell charges into two sub-ranges (representing 0 and 1), it is possible to further divide the ranges again to produce four values (0, 1, 2 and 3), called a multi-level cell (MLC). This effectively doubled the capacity of a flash device.  This process of dividing charge ranges continued to three-level cells (TLC) with bit values of 0 through 7, and we are even beginning to see quad-level cells (QLC) with bit values 0 through 15.  QLC flash effectively has 8X the capacity of SLC flash, and flash vendors sell these denser products at cheaper prices per gigabyte.

Impacts of Flash Technology Trends

Unfortunately, there is a downside to packing flash cells more tightly and finely slicing the charge ranges. Recording a value in a cell is a slightly destructive process that wears on the components of the cell. This leads to slight fluctuations in a cell’s charge and means the chance of incorrectly identifying a cell’s true value increases.

Flash manufacturers have addressed the risk of storing/reading data incorrectly by adding error correcting technology, but the end result is that flash lifespans are shrinking. Once a cell has been damaged from numerous writes, it can no longer be reliably written again, so the flash device marks it as invalid.

How should you assess the reliability of flash devices? First, you want to think in terms of writes, since that’s what wears out the flash device. You should be able to read unmodified data for as long as you want. Flash devices are designed to spread writes evenly across the cells, so we typically think about the write cycles of the entire device instead of individual cells. Therefore, think of a write cycle as a period where the entire flash capacity is written. Whereas SLC flash could support 100,000 write cycles, MLC flash supports a few thousand, TLC flash around 1,000 and QLC flash perhaps a few hundred.  This leads to the industry using the term writes-per-day to describe flash devices. Writes-per-day is equal to the number of supported write cycles divided by the number of days in the expected lifetime of the device (e.g. 365 days times 5 years). More writes-per-day means that the flash device is less likely to have cells wear out and fail.

Performance characteristics are also changing with denser devices. Reading, writing and erasing are all getting slower. The fastest SLC flash devices could read in just 25 microseconds, write in 300 microseconds and erase in 1.5 milliseconds.  Each time the charge range is divided to cram more logical values into a cell, these operations have gotten slower. In some cases, doubling the bit density has doubled the latency of each operation. Reads have slowed for a number of reasons including writes and background erasures interfering with reads.

A related issue is the throughput to read and write, which is the number of megabytes (or gigabytes) that can be read and written per second. While capacities are doubling, throughput has increased slowly with each generation of products. Flash devices are getting bigger but we can’t access the full space as quickly as before.

We tend to think of flash as blazingly fast, and that is true relative to the latency of reading and writing to hard drives, but flash is slowing down each year. It is an interesting trend of cheaper, denser, slower flash devices coming to market with shorter lifespans.

Storage System Design Changes

The evolution of flash has led to unexpected choices in system design. Sometimes we select smaller, more expensive (per gigabyte) flash devices because they have the lifespan needed for a storage system. In other cases we may select multiple, smaller flash devices over a single, large flash device to get the overall IOPS and throughput needed by a system. Simply picking the newest, biggest flash device isn’t always the right answer.

The evolution of flash also affects how to design smaller storage systems. As flash vendors have emphasized manufacturing larger and larger devices, we have run into situations where they no longer stock low capacity flash devices needed for storage systems targeted at small businesses. Just as some customers had to buy excess hard disk capacity in their storage system (for performance or reliability), the same may be coming for flash storage systems, too.

As a result, we are starting to consider tiers of flash, where a high performance cache uses SLC flash with the highest endurance and lowest latency while colder data is stored on denser, low endurance flash. Of course, that assumes that flash won’t be supplanted by newer media, itself.

While flash has been the hot media of the last decade and will continue to be heavily leveraged for years to come, there are new storage media coming to market and in the prototype phase. My next blog post will discuss NVMe and new non-volatile DRAM options and the potential to upend not only storage but servers as well.

 

~Philip Shilane @philipshilane

In 2016, Flash Changes Everything!*

In 2016, Flash Changes Everything!*

*If by ‘everything’, you mean the media that sits inside of enterprise storage systems.

At an event in Paris, a customer asked, “Do you know what I like best about all-flash storage?” Since I had been warned that the French are sensitive, I resisted saying– “It doesn’t go on strike?” (At the time there was both a petrol and air traffic controller strike – in other words, a normal week in Paris.) His answer was disarmingly honest, “Everything else – cloud, hyperconverged infrastructure, containers – confuses me. But all-flash storage? It’s different, but I can understand it.”

While flash doesn’t disrupt the storage systems market, it is driving the evolution of storage systems. The evolution spans system design, vendor business models, and customer behavior. This time, let’s talk about basic storage system design.

The Evolution – System Design

Flash doesn’t change what storage systems do, but it does change how they do it. Flash storage systems enable applications and users to write and read data via a variety of protocols and networks – file, block, object, FICON, etc. They attempt to ensure that what a user stores is what the user reads. If that sounds like the functionality of disk and hybrid arrays, it is. Underneath, however, storage systems have changed how they do space optimization and how they make the media reliable.

Space Optimization

Disk storage systems make trade-offs between performance and space optimization. Space efficiency features like compression, deduplication, and clones incur costs: increased response time, management complexity, or unpredictable system performance. For decades, storage systems have optimized performance by laying out data in optimal locations on the disk. Space optimizations disrupt those carefully tuned algorithms. They fragment data, which increases the number of disk seeks, which degrades performance. As a result, disk systems implement space efficiency features for specific workloads (e.g. backup, archive, VDI, etc.) or as best-effort background tasks, but not as inline operations for general purpose usage.

All-flash storage systems both require and enable ubiquitous space efficiency. Flash delivers much greater I/O density than disk, but to make it cost effective, systems need to increase flash’s capacity density. While not all space efficiency techniques apply to all workloads, every flash array must make space efficiency features part of its toolkit. Conversely, flash storage makes it possible to deliver inline, ubiquitous space optimization. While the data may fragment, the random I/O performance of flash doesn’t depend on disk seeks; therefore, you can have space optimization and performance!

Note: Flash drives are growing much larger. The speed of reading data from the drive will not keep pace with the amount of data it stores. As a result, we’ll have a potential data access bottleneck. Flash storage systems will need to optimize data layout on a drive, intelligently spread data across drives, and cache efficiently. Storage media – the more things change, the more they stay the same.

Making Media Reliable

Storage systems work hard to return the same data that was written. All hardware fails. Storage media fails in multiple ways. The device completely fails. The device incorrectly writes data. The device returns wrong data. Regardless of the type of hardware failure, storage systems work to ensure that the users never know. While the mission remains the same, flash has different failure behaviors than disk drives.

Computer scientists have built companies, careers, and research groups on disk drive resiliency. Decades later, customers still debate over their preferred RAID algorithms. As we move into larger drives, we’ve resurrected the mirror vs. RAID vs. ECC debates. Meanwhile, the industry has increased the focus on predicting and handling drive failures, to reduce the impact of failed drives. Additionally, some research shows that media errors (on a healthy drive) and firmware bugs pose a more insidious threat to your data than full drive failures. Such events are both more common and less visible than failed drives. Thus, approaches like Data Domain’s Invulnerability Architecture have become a key market differentiator. Even in the year of “all-flash”, disk storage systems are evolving in the wake of their changing media.

Flash fails, but it fails differently than disk. The most obvious contrast is in “wear”. The mechanical components of disk drives wear out. That breakdown, however, is largely independent of the amount of times the system writes to the disk. Conversely, flash media is built of cells that can only be written a certain number of times before they wear out and cannot store data anymore. As a result, storage systems have changed their write behaviors to minimize and distribute the wear on the media. These modifications include: log-structured file systems to evenly distribute writes across the cells, space efficiency to reduce how many cells need to be written, and caching to eliminate frequent overwrites of data.

Meanwhile, all-flash arrays must respond to unique failure patterns of flash drives. First, we’re still learning how SSDs will fail. For example, how well will flash drives age? Unlike disk, where we have decades of experience in tracking drive failures over time, we’re still learning with flash. (I know vendors are trying to simulate accelerated aging, but I’m skeptical. The only proven way to accelerate aging is to have children.) Fortunately, we have more analytic tools available than ever before. Meanwhile, all-flash arrays are evolving traditional RAID approaches to better fit the new media. With a preference toward larger strip sizes (to minimize space consumption), resiliency across all components (e.g. across power zones in a disk array enclosure), and multi-drive resiliency (N+2), flash has forced an evolution of media failure analytics and protection.

Hardware fails – whether it is disk drives, flash drives, or memory. Storage systems will evolve to combat those failures. Regardless of the media and the failure characteristics, storage systems will continue deliver value by transforming inherently unreliable hardware into reliable data storage systems.

Conclusion

The disruption of storage media is driving the evolution of the storage system market. The basic needs haven’t changed. Customers want reliable storage that delivers the performance they need at the best possible cost. Flash storage changes many underlying assumptions, and storage systems are responding to the new media base. As a result, we’re all headed in the same direction.

The first question customers ask is – can a new system more quickly and efficiently add all of the expected resiliency and functionality to their “all-flash” base… or can established systems more quickly and efficiently modify their battle-tested resiliency and functionality to leverage the “all-flash” media? The second question is whether any of these systems can deliver more value than they’ve come to expect from traditional storage systems.

Before sharing my answer to those questions (giving time for each camp to bribe me – I do take t-shirts as payment), I will first discuss how business models and customer behaviors are changing in the next post.

Stephen Manley @makitadremel

The Gallic Cortex.

The Gallic Cortex.

This time, Inside The Data Cortex. Paris is burning but it takes us less than five minutes to begin arguing about Disney movies.

  • EMC holds an All Flash marketing event and asks Stephen to keynote. He opens by telling them Flash doesn’t matter anymore.
  • Media changes do not kill companies who know how to deal with media changes. Mark thinks the most dangerous competitors are the large companies doing business in different ways.
  • Mark laments to the EMC product family slide when it comes to Management & Orchestration. Stephen believes that every year in the latter half of this decade is the EMC year of M&O. Customers say that if EMC didn’t have ViPR it would have to go and create something like ViPR. Enterprise Copy Data Management, the new protection orchestration layer.
  • Stephen has an interest in standards since SMI-S is now looking weather beaten.
  • Case sensitivity worked for Stephen when he added so many options to the NDMP dump command that he had to move to capital letters.
  • Archiving changes, but stubbing files always sucks.
  • Terrible 80s TV shows. Where are the members of The Cosby Show now?
  • Moby Dick has fallen. Listener recommendation, Wool, starts well but finishes poorly. Mark is reading about why Ireland should leave the Euro. The Big Short, highly recommended for long haul flight viewing.
  • Star Trek, Hugh Jackman is now Old Man Wolverine, 16 Super Hero movies a year and the worst movie Mark paid to see in 2016.


Download this episode (right click and save)

Subscribe to this on iTunes

Follow us on Pocket Casts
Stephen Manley @makitadremel Mark Twomey @Storagezilla

2016 is the Year of All Flash!*

2016 is the Year of All Flash!*

*For primary storage that serves workloads appropriate for all-flash, assuming you care about primary data storage systems at all.

In other words, we’re almost done talking about flash. Wait, what? When something becomes ubiquitous, it becomes uninteresting to talk about. Over the past couple of years, there was tension between “all-flash” and “not all-flash” arrays. Customers have said, “I can only buy an ‘all-flash’ array because everything else is ‘obsolete’.” Storage systems were purchased almost solely because of the media they contain. When every array becomes “all flash”, however, the conversation can expand beyond media. Simply being “all flash” isn’t a compelling argument anymore.

Before we shift the conversation away from flash-centric topics like V-NAND, TLC, floating gate transistors, internal charge pumps, and hot electron injection, however, let’s understand how important flash has been to storage and IT infrastructure. (Besides – hot electron injection? Every conversation should start with hot electron injection. I feel like The Rock may use hot electron injections…)

First, we’ll talk about how flash disrupts storage media but not storage systems. Second, we’ll cover the evolutionary impact of flash on storage systems. Finally, we’ll talk about how flash can be a catalyst for the disruption of IT infrastructure. (And just for fun – hot electron injection!)

Storage Media – The Disruption

Flash is a disruptive technology… to disk drives. With space efficiency techniques, flash is already more cost effective than disk for primary workloads. While deduplication and compression also work on disk, the latency and performance impact can be significant. Flash’s dramatically better random-access performance makes it a better fit for combining space efficiency and primary workloads. As SSD capacity rapidly expands, the gap is widening quickly. Flash is already a better media technology for traditional primary workloads, and the advantage is only expanding.

This shift is reminiscent of what high-capacity disks have done to tape over the last decade. With space efficiency techniques, disk became more cost effective than tape for traditional backup. The space efficiency techniques were possible because of disk’s dramatically better random-access performance vs. tape. As HDD capacity rapidly expanded, the gap widened quickly. Disk became a better media technology for traditional backup, and the obvious advantages are revealed in the revenue numbers.

Over time, media will shift again. Flash will displace disk as protection storage (especially as we move to copy data management). New memory technologies will displace flash for primary I/O performance. Each time, we’ll marvel at the performance, capacity, and cost shifts. Each time, the hardware market will shift. Software, in response, will shift with it.

For now, though, flash displacing disk drives is enough to deal with.

Storage Systems – The Non-Disruption

Flash is not disrupting the primary storage array market. Disruption happens when there are new markets with new value networks. Customers change technology, people, and processes. They solve existing problems in different ways, with completely different approaches and different people. Existing vendors find themselves blindsided by the disruption. They don’t invest in the disruptive technology, miss the market, and find themselves out of business.

There is no sign of a flash-induced storage industry disruption. All-flash arrays are not changing customer behavior. Customers buy and deploy all-flash storage arrays for the same reasons and workflows that they previously bought disk and hybrid arrays. They’re bought and run by the same IT groups. The broad trends demonstrate that the leading vendors remain strong. EMC has, by a wide margin, invested the most in flash storage (in excess of $3Billion). EMC, the overall storage leader, is the All-Flash Array leader. The vendors who led in disk and hybrid storage continue to invest and address their customer needs in the All-Flash space, as well.

This lack of secondary industry disruption (beyond the media itself) is not unique to the flash conversion. Following the “disk backup” analogy, backup appliances delivered huge evolutionary value, but didn’t revolutionize the backup market. The leading software products in 2001 – Veritas NetBackup, EMC NetWorker, and IBM TSM – are the major products today. Like the storage array vendors, they each invested significant effort to better leverage disk as target storage. Disk has replaced tape as the backup media, but customers continue to buy the same backup software and run it in the same way. The tape market was disrupted, not the backup market.

Thus, when vendors say – “Media X changes EVERYTHING”, they may be exaggerating. Sometimes “Media X” changes … just the media.

Conclusion

Flash is disrupting the storage media market, which has led to a significant evolutionary investment and innovation in the storage system market. Like disk backup, the disruption doesn’t go beyond the media, but the media change is driving considerable investment in innovation.

Flash is enabling storage system companies and customers to evolve and solve problems we’ve stuggled with for years. In the next post, we’ll talk about how flash has changed how we design storage arrays, build business models, and enable customers to do more with their storage.

Welcome to the Year of All Flash. It should enable us to talk practically about how important flash is to our industry, without needing to resort to hyperbole. (And we get to talk about hot electron injection!)

Stephen Manley @makitadremel