I tried to avoid learning about flash. I really did. I’ve never been one of those hardware types who constantly chase the next hardware technology. I’d rather work at the software layer, focusing on data structures and algorithms. My attitude was that improving hardware performance raises all boats, so I did not have to worry about the properties of devices hiding under the waves. Switching from hard drives to flash asthe common storage media, would just make everything faster, right?
Working for a large storage company broke me out of that mindset, though I still fought it for a few years. Even though I mostly worked on backup storage systems–one of the last hold-outs against flash–backup storage began to see a need for flash acceleration. I figured we could toss a flash cache in front of hard drive arrays, and the system would be faster. I was in for a rude awakening. This multi-part blog post outlines what I have learned about working with flash in recent years as well as my view on the direction flash is heading. I’ve even gotten so excited about the potential of media advances that I am pushing myself to learn about new non-volatile memory devices.
For those unfamiliar with the properties of flash, here is a quick primer. While a hard drive can supply 100-200 read/write operations per second (commonly referred to as input/output operations per second or IOPS), a flash device can provide 1000s – 100,000s IOPS. Performing a read or write to a hard drive can take 4-12 milliseconds, while a flash device can typically respond in 40-200 microseconds (10-300X faster). Flash handles more read/writes per second and responds more quickly than hard drives. These are the main reasons flash has becoming widespread in the storage industry, as it dramatically speeds up applications that previously waited on hard drives.
If flash is so much faster, why do many storage products still use hard drives? The answer: price. Flash devices cost somewhere in the range of $0.20 to $2 per gigabyte, while hard drives are as inexpensive as $0.03 per gigabyte. For a given budget, you can buy an order of magnitude more hard drive capacity than flash capacity. For applications that demand performance, though, flash is required. On the other-hand, we find that the majority of storage scenarios follow an 80/20 rule, where 80% of the storage is cold and rarely accessed, while 20% is actively accessed. For cost-conscious customers (and what customer isn’t cost conscious?), a mixture of flash and hard drives often seems like the best configuration. This leads to a fun system design problem. How do we combine flash devices and hard drives to meet customer requirements? We have to meet the IOPS, latency, capacity and price requirements of varied customers. The initial solution is to add a small flash cache to accelerate some data accesses while using hard drives to provide a large capacity for colder data.
A customer requirement that gets less attention, unfortunately, is lifespan. This means that a storage system should last a certain number of years without maintenance problems, such as 4-5 years. While disk drives fail in a somewhat random manner each year, the lifespan of flash is more closely related to how many times it has been written. It is a hardware property of flash that storage cells have to be erased before being written, and flash can only be erased a limited number of times. Early flash devices supported 100,000 erasures, but that number is steadily decreasing to reduce the cost of the device. For a storage system to last 4-5 years, the flash erasures have to be used judiciously over that time. Most of my own architecture work around using flash has focused on the issues of maximizing the useful data available in flash, while controlling flash erasures to maintain its lifespan.
The team I have been a part of pursued several approaches to best utilize flash. First, we tried to optimize the data written to flash. We cached the most frequently accessed portions of the file system, such as index structures and metadata that are read frequently. For data that changes frequently, we tried to buffer it in DRAM as much as possible to prevent unnecessary writes (and erasures) to flash. Second, we removed as much redundancy as possible. This can mean deduplication (replacing identical regions with references), compression and hand-designing data structures to be as compact as possible. Enormous engineering effort goes into changing data structures to be flash-optimized. Third, we sized our writes to flash to balance performance requirements and erasure limits. As writes get larger, they tend to become a bottleneck for both writes and reads. Also, erasures decrease because the write size aligns with the internal erasure size (e.g. multiple megabytes). Depending on the flash internals, the best write size may be tens of kilobytes to tens of megabytes. Fourth, we created cache eviction algorithms specialized for internal flash erasure concerns. We throttled writes to flash and limited internal rearrangements of data (that also cause erasures) to extend flash lifespan.
Working with a strong engineering team to solve these flash-related problems is a recent highlight of my career, and flash acceleration is a major component of the 6.0 release of Data Domain OS. Besides working with engineering, I have also been fortunate to work with graduate students researching flash topics, which culminated in three publications. First we created Nitro, a deduplicated and compressed flash cache. Next, Pannier is a specially-designed flash caching algorithm that handles data with varying access patterns. Finally, we wanted to compare our techniques to an offline-optimal algorithm that maximized cache reads while minimizing erasures. Such an algorithm did not exist, so we created it ourselves.
In my next blog post, I will present technology trends for flash. For those that can’t wait, the summary is “bigger, cheaper, slower.”
~Philip Shilane @philipshilane