Category Archives: Solid State Disk

Fundamentals of Storage Systems, Solid State Storage Basics

Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Lets review what solid state storage is and where it differs from traditional hard disks. I will cover solid state storage in a general manor not favoring any specific flash manufacturer or specific type of Flash.

Outline:
Types of Flash
NAND Flash Structure
NAND Flash Read Properties
NAND Flash Write Properties
Wear-Leveling
Garbage Collection
Write Amplification
TRIM
Error Detection and Correction

Flash Memory

Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.

The Structure of NAND Flash

NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-leveling, which we will cover shortly. Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit. There can be quite a few blocks per actual chip, as many as 16 thousand blocks or 8 gigabytes worth, on a single chip. From there manufacturers group chips together usually in a parallel arrangement using controllers to make them look like one large solid state disk. They can vary in form factor. Most common are a standard 2.5” or 3.5” drive or a PCIe device.

NAND Read Properties

NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like a hard disks, that is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns.  A 15,000 RPM hard disk my have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area. It takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.

NAND Write Properties

NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.

Program Erase Cycle

NAND does writes based on a program erase(P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase.

Wear-Leveling

To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Lets take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-leveling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system. Extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.

It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can effect performance of the device.

Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.

TRIM
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.

Write Amplification
Another pit fall of wear-leveling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a stand still. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes.

Error Detection and Correction

The nature of NAND Flash makes it susceptible to several types of data corruption. Like hard drives and floppy disks are at risk near magnetic sources. NAND has several vulnerabilities, some of them occur even when just reading data.

Write Disturb
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.

Read Disturb
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.

Charge Loss/Gain
Lastly, there is an issue with data data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to.

All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues.

Bad Block Management

NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.

Summary

As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.

Takeaways
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it use to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.

Up next
We will talk about MLC vs SLC. What makes a device enterprise ready. How to effectively benchmark your solid state storage and not be caught off guard when you move into production.

SSD, The Game Changer

I’ve often described SQL Server to people new to databases as a data pump.

Just like a water pump, you have limited capacity to move water in or out of a system usually measured in gallons per hour.
If you want to upgrade your pumping systems it can be a two fold process, the physical pump and the size of the pipes.

Our database servers also have several pumps and pipes, and in general you are only as fast as your slowest or narrowest pipe, hard drives.

To feed other parts of the system we have resorted to adding lots and lots of hard drives to get the desired IO read/writes and MB/sec throughput that a single server can consume.
Everyone is familiar with Moore’s law (Often quoted, rarely understood) loosely applied says CPU transistor counts double roughly every 24 months. Hard disks haven’t come close to keeping up with that pace, performance wise.

Up until recently, hard drive capacity has been growing almost at the same rate doubling in size around every 18 months (Kryder’s Law). The problem isn’t size is speed.

Lets compare the technology from what may have been some folks first computer to the cutting edge of today.

Time Circa 1981 Today improvement
Capacity 10MB 1470MB 147x
HDD Seeks 85ms/seek 3.3ms/seek 20x
IO/Sec 11.4 IO/Sec 303 IO/Sec 26x
HDD Throughput 5mbit/sec 1000mbit/sec 200x
CPU Speed 8088 4.77Mhz (.33 MIPS) Core i7 965(18322 MIPS) 5521x

*These are theoretical maximums in the real world you mileage may vary.

 

I think you can see where this is going. I won’t go any further down memory lane lets just say that some things haven’t advanced as fast as others. As capacity has increased the speed has been constrained by the fact hard disks are just that, spinning disks.

So, what does this little chart have anything to do with SSD? I wanted you to get a feel of where the real problem lies. It isn’t capacity of hard drives it’s the ability to get to the data quickly. Seeks are the key. SSD’s have finally crossed a boundary where they are cheap enough and fast enough to make it into the enterprise space at all levels.

SSD compared to today’s best 15k.2 HDD from above.

  HDD SSD improvement
seek times 3.3ms/seek 85μs/seek 388x
IO/Sec 303 IO/Sec 35000 IO/Sec 115x
Throughput 1000mbit/sec 25000mbit/sec 2.5x

 

So, in the last few years SSD has caught up and passed HDD on the performance front by a large margin. This is comparing a 2.5” HDD to a 2.5” SSD. This gap is even wider if you look at the new generation of SSD’s that plug directly into the PCIe bus and bypass the drive cage and RAID controller all together. HOT DOG! Now we are on track. SSD has allowed us to scale much closer to the CPU than anything storage wise we have seen in a very long time.

Since this is a fairly new emerging technology I often see allot of confused faces when talking about SSD. What is in the technology and why it has now become cost effective to deploy it instead of large raid arrays?

Once you take out the spinning disks, the memory and IO controller march much more to the tune of Moore’s law than Kryder’s meaning cost goes down, capacity AND speed go up. Eventually there will be an intersection where some kind of solid state memory, maybe NAND maybe not, will reach parity with spinning hard drives.

But, like hard drives not all SSD’s are on the same playing field, just because it has SSD printed on it doesn’t make it a slam dunk to buy.

Lets take a look at two implementations of SSD based on MLC NAND. I know some of you will be saying why not SLC? I’m doing this to get a better apples to apples comparison and to put this budget wise squarely in the realm of possibility.

 

Intel x25-M priced at 750.00 for 160GB in a 2.5” SATA 3.0 form factor and the Fusion-io IoDrive Duo 640GB model priced at 9849.99 in a PCIe 8x single card.

Drive Capacity in GB Write Bandwidth Read Bandwidth Reads/sec Writes/Sec Access Latency (seek time) Wear Leveling
(writes-erase/day)
Cost per Unit Cost per GB Cost per IO Reads Cost Per IO Writes
IoDrive Duo 640 1000MB 1400MB 126601 180530 80μs 5TB $9849.99 $15.39 $0.08 $0.06
X25-M 160 70MB 250MB 35000 3300 85μs 100GB * $750.00 $4.60 $0.02 $0.22
Improvement 4x 14x 5x 4x 55x ~ 10x 13x 4x 4x -4x

* This is an estimate based on this article http://techreport.com/articles.x/15433. Intel has stated the drive should be good for at least 1 petabyte in write operations or 10,000 cycles.

 

Both of these drives use similar approaches to achieve the speed an IO numbers.They break up the NAND into multiple channels like a very small RAID array. This is an over simplification but gives you an idea of how things are changing. It is almost like having a bunch of small drives crammed into a single physical drive shell with it’s own controller a mini-array if you will.

So, not all drives are created equal. In Intel’s defense they don’t plan the X25-M to be an enterprise drive, they would push you to their X25-E which is an SLC based NAND device which is more robust in every way. But keeping things equal is what I am after today.

To get the X25-M to the same performance levels it could take as few as 4 drives and as many as 55 depending on the IO numbers you are trying to match on the IoDrive Duo.

Wear leveling is my biggest concern on NAND based SSD’s. We are charting new water and really won’t know what the reliability numbers are until the market is aged another 24 to 36 months. You can measure your current system to see how much writing you actually do to disk and get a rough estimate on the longevity of the SSD. Almost all of them are geared for 3 to 5 years of usability until the croak.

At a minimum it would take 10 X25-M drives to equal the stated longevity of a single IoDrive Duo.

Things also start to level out once you factor in RAID controllers and external enclosures if you are going to overflow the internal bays on the server. That can easily add another $3000.00 to $5000.00 dollars to the price. All the sudden the IoDrive Duo really starts looking more appealing by the minute.

 

What does all this mean?

Not all SSD’s are created equal. Being constrained to SATA/SAS bus and drive form factors can also be a real limiting factor. If you break that mold the benefits are dramatic.

Even with Fusion-io’s cost per unit it, is still pretty cost effective in some situations like write heavy OLTP systems, over other solutions out there.

I didn’t even bother to touch on something like Texas Memory System’s RamSan devices at $275000.00 for 512GB of usable space in a 4U rack mount device the cost per GB or IO is just through the roof and hard to justify for 99% of most SQL Server users.

You need to look closely at the numbers, do in house testing and make sure you understand your current IO needs before you jump off and buy something like this. It may be good to also look at leveraging SSD in conjunction with your current storage by only moving data that requires this level of performance to keep cost down.

If this article has shown you anything it’s technology marches on. In the next 6 to 12 months there will be a few more choices on the market for large SSD’s in the 512GB to 2TB range by different manufacturers at ranging prices making the choice to move to SSD even easier.

Recently, Microsoft Research in early April published a paper where they examined SSD and enterprise workloads. They don’t cover SQL Server explicitly but they do talk about Exchange. The conclusion is pretty much SSD is too expensive to bother with right now. To agree and disagree with them it was true several months ago, today not so much.

The fact that the landscape has change significantly since this was published and will continue to do so, I think we are on the verge of why not use SSD instead of do we really need it.

With that said please, do your homework before settling on a vendor or SSD solution, it will pay dividends in not having to explain to your boss that the money invested was wasted dollars.

A little light reading for you:

SSD Primer
http://en.wikipedia.org/wiki/Solid-state_drive

James Hamilton’s Blog
http://perspectives.mvdirona.com/2009/04/12/WhereSSDsDontMakeSenseInServerApplications.aspx

 

-Wes