Category Archives: SSD

Fusion-io, Flash NAND All You Can Eat

Fusion-io has announced general availability of the new Octal. This card is the largest single flash based device I’ve ever seen. The SLC version has 2.56 terabytes of raw storage and the MLC has a whopping 5.12 terabytes of raw storage. This thing is a behemoth. The throughput numbers are also impressive, both read at 6.2 Gigabytes a second using a 64KB block, you know the same size as an extent in SQL Server. They also put up impressive write numbers the SLC version doing 6 Gigabytes a second and the MCL clocks in at 4.4 Gigabytes a second.

There is a market for these drives but you really need to do your homework first. This is basically four ioDrive Duos or eight ioDrive’s using a single PCIe 2.0 16x slot. It requires a lot of power, more than the PCIe slot can provide. It needs an additional three power connectors two 6 pin and one 8 pin. EDIT: According to John C. You only need to use ether the two 6 pin OR the single 8 pin. These are pretty standard on ATX power supplies in your high end desk top machines but very rarely available in your HP, Dell or IBM server so check to see if you have any extra power leads in your box first.

Also, remember you have to have a certain amount of free memory for the ioDrive to work. They have done a lot of work in the latest driver to reduce the memory foot print but it can still be significant. I would highly recommend setting the drive up to use a 4K page instead of a 512 byte page. After that, you will still need a minimum of 284 megabytes of RAM per 80 gigabytes of storage. On the MLC Octal that comes to 18 gigabytes of RAM that you need to have available per card. To be honest with you, if you are slotting one of these bad boys into a server it won’t be a little dual processor pizza box. On the latest HP DL580G7’s you can have as much as 512 gigabytes of RAM so carving off 18 gigabytes of that isn’t such a huge deal.

Lastly, you will actually see several drives on your system each one will be a 640 gigabyte drive. If you want one monster drive you will have to stripe them at the OS level. The down side of that is loosing TRIM support which is a boon to the overall performance of the drive, but not a complete deal breaker. EDIT: John C is correct. You don’t loose TRIM for striping with the default Windows RAID stripe on Windows Server 2008 R2. I’m waiting for confirmation from Symantec if that is the case with Veritas Storage Foundation as well since that is what I am using to get software RAID 10 on my servers.

I don’t have pricing information on these yet, but I’m guessing its like a Ferrari, if you have to ask you probably can’t afford it.

SATA, SAS or Neither? SSD’s Get A Third Option

I recently wrote about solid state storage and its different form factor. Well, several major manufacturers have realized that solid state needs all the bandwidth it can get. Dell, IBM, EMC, Fujitsu and Intel have formed the SSD Form Factor Working Group bringing PCIe 3 to the same form factor that SATA and SAS use. Focusing on the same connector types and a 2.5” dive housing. I’m not sure how quickly it will make it’s way into the enterprise space but that is clearly it’s target. Reusing the physical form factor cuts down on manufacturing and R&D costs for all involved. They have an aggressive time scale for something like this. The specification hasn’t been published yet and I’ll take a deeper look into it when it becomes available. There are some key players missing though. HP and Seagate being the two in the enterprise space that give me pause. Both control a large segment of the storage space. On the controller side LSI is also absent. This could be a direct threat to their current market domination of the RAID controller chipset space if they aren’t on the ball.

Fusion-io got that early on and took a different route sticking with just PCIe to bypass the limitations of SAS/SATA and intermediate controllers. By going that route they opened up a whole other level of performance.

I asked David Flynn what he thought about the new standard. Fusion-io is a contributor to the working group.

It is quite validating that folks would be routing PCIe to the drive bays. For us it’s just another form factor that

we can easily support it.

Two things, though… First is that I believe it’s a hangover from the mechanical drive era to put such emphasis on form factors that allow easy servicing access. Solid state should not need to be serviced. It should be much more reliable than HDD’s. But, outside of Fusion-io failure rates for solid state is actually much worse than for mechanical disk drives.

The second point is that form-factor and even PCIe attachment isn’t really the key thing to higher performing, more reliable solid state. What makes the real difference is eliminating the embedded CPU bottleneck in the access path to the flash.

Fusion-io uses a memory controller approach to integrating flash. You don’t find CPU’s on DRAM modules. SSD’s (SATA or PCIe) from everyone else use embedded CPU’s and attach using storage controller methodologies.

In an upcoming post on my solid state storage series I will explore failure rates in detail. I do find it interesting that Fusion-io is one of the very few companies that have significantly higher error detection rates than a standard hard drive or other SSD’s, even enterprise branded SSD’s. Fusion-io claims 10^20 uncorrectable detectable error rate and 10^30 uncorrectable undetectable error rate. I have yet to see any hard disk or SSD with a rate better than 10^17. So, I agree with David about actually needing a form factor for ease of service if you build the device with enough error correction, which clearly you can with solid state.

Fundamentals of Storage Systems, Understanding Reliability and Performance of Solid State Storage

Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.

SLC and MLC Characteristics and Differences

Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.

SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.

Measure	SLC	MLC
Read Speed	25~ nanoseconds	50~ nanoseconds
Write Speed	220~ nanoseconds	900~ nanoseconds
P/E Cycles	100k to 300k	3k to 30k
Minimum ECC Bits required	1 bit per 512 bytes	12 bits per 512 bytes
Block Size	64KB	128KB

SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.

Enterprise Reliability

As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.

SATA, SAS or Neither?

The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing it did get a lot better but still falls short of SAS in several areas.

SATA Vs. SAS

Feature	SAS	SATA
Command Queuing	TCQ supports queue depths up to 216 usually capped at 64	NCQ supports queue depths up to 32
Error recovery and detection	Uses the SCSI command is more robust	SMART Proven to be in adequate. see Google Paper
Duplex	Full Duplex dual port per drive	Half Duplex single port
Multi-path IO	fully supported at drive level	supported in SATA II via expanders

Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this. Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.

PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.

The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.

Performance Characteristics

By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar. As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!

Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.

Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.

Summary

There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.

Fundamentals of Storage Systems, Solid State Storage Basics

Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Lets review what solid state storage is and where it differs from traditional hard disks. I will cover solid state storage in a general manor not favoring any specific flash manufacturer or specific type of Flash.

Outline:
Types of Flash
NAND Flash Structure
NAND Flash Read Properties
NAND Flash Write Properties
Wear-Leveling
Garbage Collection
Write Amplification
TRIM
Error Detection and Correction

Flash Memory

Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.

The Structure of NAND Flash

NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-leveling, which we will cover shortly. Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit. There can be quite a few blocks per actual chip, as many as 16 thousand blocks or 8 gigabytes worth, on a single chip. From there manufacturers group chips together usually in a parallel arrangement using controllers to make them look like one large solid state disk. They can vary in form factor. Most common are a standard 2.5” or 3.5” drive or a PCIe device.

NAND Read Properties

NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like a hard disks, that is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns. A 15,000 RPM hard disk my have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area. It takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.

NAND Write Properties

NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.

Program Erase Cycle

NAND does writes based on a program erase(P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase.

Wear-Leveling

To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Lets take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-leveling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system. Extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.

It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can effect performance of the device.

Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.

TRIM
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.

Write Amplification
Another pit fall of wear-leveling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a stand still. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes.

Error Detection and Correction

The nature of NAND Flash makes it susceptible to several types of data corruption. Like hard drives and floppy disks are at risk near magnetic sources. NAND has several vulnerabilities, some of them occur even when just reading data.

Write Disturb
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.

Read Disturb
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.

Charge Loss/Gain
Lastly, there is an issue with data data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to.

All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues.

Bad Block Management

NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.

Summary

As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.

Takeaways
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it use to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.

Up next
We will talk about MLC vs SLC. What makes a device enterprise ready. How to effectively benchmark your solid state storage and not be caught off guard when you move into production.