Category Archives: IO

Fundamentals of Storage Systems – The Basics of Spinning Disks

Your servers are only as fast as the slowest part, hard drives.To feed other parts of the system we have to add lots of drives to get the desired IO single server can consume.

The basics of how hard drives work has been fundamentally static since the 70’s only refinements in technique and the core technologies have improved. You have a shaft or “spindle” attached to a motor. Disks or “platters” are attached to the spindle. The motor spins the spindle and the platters. Read/write heads controlled by actuator motors move across the surface with very precise motion and access the information stored on the platters. Generally, there is one read/write head per platter surface that is useable.

Simple.

This configuration has worked so well for the last 45 years that every claim to date that X new technology would unseat it just hasn’t happened. That’s not to say it won’t happen, just that hard drives have been “good enough” for the bulk of our storage needs for a very, very long time. Since this is the core of our permanent storage in our database world it is important to have a basic understanding of them.

File:SixHardDriveFormFactors.jpg

Description
Six hard disk drives with cases opened showing platters and heads; 8, 5.25, 3.5, 2.5, 1.8, and 1 inch disk diameters are represented.
Date
1 March 2008(2008-03-01)
Author
Paul R. Potts

http://commons.wikimedia.org/wiki/File:SixHardDriveFormFactors.jpg

I love this picture. Smaller and faster yet still the same.

To give you an idea of what you are up against lets compare the growth rate of your hard drive VS. your CPU.

Our 1981 machine has a the veteran Seagate ST-412 and a Intel 8088.
Our new computer has a Seagate Cheetah 15k.6 ST3146356SS and a Core i7 965 from Intel of course.

Time Circa 1981 Today Improvement
Capacity 10MB 1470MB 147x (209715x for 2TB drive)
Seek speed 85ms 3.4ms 20x (6x for 2TB drive)
IO/Sec 11.4 303 26x
Mbit/Sec 5 (0.625 MB/Sec) 1000 (125 MB/Sec) 200x
CPU 4.77Mhz(.33 MIPS) 3200Mhz(18322 MIPS) 5521x

At first glance we can say WOW what an improvement! Right up until you see how far the processors have come.Everyone is familiar with Moore’s law (Often quoted, rarely understood) loosely applied says CPU transistor counts double roughly every 18 to 24 months.Up until recently, hard drive capacity has been growing almost at the same rate doubling in size around every 18 months (Kryder’s Law).  Hard disks haven’t come close to keeping up with that pace, performance wise. Again, the problem isn’t size is speed.

The Makeup of A Modern Hard Drive

You cannot change the laws of physics” – Scotty

As I stated in the previous section hard drives have remained relatively unchanged since the IBM Winchester drive. Lets take a closer look at the physical structure.

Head, Sectors and Cylinders

So, we have a spindle one or more platters and one or more read/write head, all of that spinning and jittering about at a pretty good clip. So, just how does the computer know where your data is? The platter is broken up into a map of sorts.

Simplistic view:

image

The platter is broken up into concentric rings and pie slices that allow the drive controller to find the region where the data is.

image

The heads all move in unison and present a view through the platters that make up a cylinder. I won’t go into great detail on how we have advanced sector and track layouts and the advent of Logical Block Addressing there are plenty of articles on the web that get into those nuts and bolts. What I’m after is to show you physically what has to happen to read the data from the disk and why that is the limiting factor. With the disk spinning at 15,000 RPM the sectors are flying by pretty quickly so the head has to be positioned above the sector and then read or write to it as the platter moves underneath it. The spinning disk, moving the heads and waiting for the data to be read all add up to latency.

Rotational latency is how long it takes the sector we are after to move under the head to be read or written to. Average rotational latency is expressed as half the time it takes for the platter to make one revolution. For our 15k hard drive that number is 2 milliseconds, 60 seconds divided by 15000 RPM divided by 2.

Seek Time is how quickly the disk head can be positioned over a sector to start reading data.

There are to kinds of seek we are interested in, average random seek time and sequential or track-to-track seek times.

In our top of the line Seagate Cheetah our random read seek time is 3.4ms that is the time it takes to get from any one sector to any other sector, usually half the distance from the inner track to the outer most track. Random write seek time is 3.9ms. It is longer due to the process of actually effecting the sector its at before moving on to the next random sector.

sequential is much much faster. If the head only has to move to the next track it can usually do so in under a millisecond.

All this adds up to an average access time. basically, you take the rotational latency plus the average random seek time and any command processing time overhead I usually throw in an additional millisecond. Our Cheetah has a random access time of 6.4ms. Sometimes it may be much faster sometimes it may be much slower but this is a good number to work with as far as planning our storage needs.

The flip side of operations per second is throughput usually expressed in megabytes a second.

This is a direct correlation to the amount of data that can be squeezed into a sector. As drive densities go up so does the average megabytes per second. There is something you should know, the inner tracks are slower on throughput but higher on IO’s and the outer tracks are higher on through put and lower on IO’s. This is just a function of the diameter of the platter getting larger the farther out you go.

It isn’t unusual to see sequential throughput average of around 110 MB/sec and that is only getting better.

Random throughput is not so rosy a picture. I haven’t seen any drive manufactures advertise these numbers from my own testing it can be as little as 15MB/sec up to 40MB/sec. You should test your system to get more accurate numbers.

What It All Means to Us

This boils down to how many I/O operations a single disk can give us. In SQL Server land random IO is king and generally one of the biggest bottlenecks on our data files.For log files, things are a little better. Since logs are written to sequentially you can effectively double the available I/O’s a drive can provide since you have cut our the random access and are much closer to the sequential or track-to-track access.

To calculate the maximum number of random operations we use 1000ms / (seek time[ms] + latency[ms]+overhead[ms])= input/output operations per second.

or

1000/(3.4+2+1) = 155 IOps

Sequential reads get much better since seek times go down from 3.4 to around 0.2.

1000(.7+2+1) = 270 IOps

Almost twice as much! Now you know why we keep our database log files separate from each other and from the data. The amount of disks needed to get the performance is about half. We do the same thing for writes and they will be a little less.

Hard drives suffer from what is known as the “hockey stick” effect the closer they get to 100% utilization the performance falls off dramatically.

image

Since running a disk at 100% capacity for IO’s introduces the maximum possible latency. The knee of the curve is around 80% we back that off a little more to 75% and that gives us the number of IO’s we have available per hard drive in the storage system in general. This reduces Queuing and keeps latency low, at the cost of maximum number of IO’s. Now our available read IO’s is down to about 117 IOps for random access and 216 IOps for sequential.  This number will get better as seek times get better and the command overhead gets better. But remember it will never ever be better than the 2.0ms for the rotational latency. Physics can be a real bummer sometimes. Along with physical spindle speed there have been large improvements with how the drive handles incoming and outgoing request. Through IO Prioritization and advanced command queuing algorithms (Native Command Queuing on SAS/SATA) access times and latencies are kept predictable and as fast as possible.

Series To Date:

  1. Introduction
  2. The Basics of Spinning Disks – You are here!
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems

The Fundamentals of Storage Systems – Introduction

At least once a year I give a large talk on disk subsystems, IO and SQL Server. It’s a ground up from the nuts and bolts of how a hard drive works through SAN’s and Solid State Disks. The reasons I give this presentation so often is it is one of the most requested topics and one of the most misunderstood. The problem often lies in the fact the DBA may not know that much about different storage systems but they do know that it is very important do their jobs. With the rise of SAN, iSCSI and other storage solutions DBA’s have less and less control over the disk system that their SQL Server relies on. It’s my goal to give them, or you, the tools they need to effectively present their needs to the storage teams hopefully without a major amount of fuss and arguments. If you know how and why it works they way it works you can make logical requests in the language that your storage folks understand.

The presentation is meant to lay the foundation that can then be built upon and expand your knowledge off all things I/O.

This article series will be slightly expanded over what my presentation normally covers, since I’m only restricted by your willingness to read what I write. It will still be a condensed version of storage systems but I’ll put up as many reference links as I can.

Series To Date:

  1. Introduction
  2. The Basics of Spinning Disks
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems
  10. Latency
  11. Solid State Storage Basics
  12. Understanding Reliability and Performance of Solid State Storage
  13. Shared Consolidated Storage Systems

Upcoming Posts :

Storage Area Networks
Network Attached Storage
iSCSI
SQL Server and The File System
Understanding Mean Time to Failure and Other Failure Metrics
Tools and Techniques To Monitor SQL Server and I/O

Some topics may be a single post some may span several I won’t know for sure until I get done writing them. As request come in I may try to post on specific questions, or at a minimum point you in the right direction.

Stay Tuned….

-Wes

It’s Beginning to Look A Lot Like Christmas……

 

We got something good in the mail last week!

 

FusionIODuo640

 

Some quick observations:

The build quality is outstanding. Nothing cheap at all about this card. The engineering that has gone into this shows in every way.

It is made up of modules that are screwed down, I can see where they really thought this through so each rev of the card doesn’t require all new PCB’s to be manufactured.

It does require an external source of power via 4 pin Molex or SATA power connector period. Make sure your server has one available, even though these are sold by HP not all HP servers have the required connectors.

PCIe expander bays are few and far between. The issue is most of these are used to expand desktops, laptops or used in non critical applications mostly AV or render farms.

http://www.magma.com/products/pciexpress/expressbox4-1u/index.html

This is a nice chassis but they are currently being retooled and won’t be available for a month or so. It is the only 1U and it has redundant power.

It exposes two drives to the OS per card. We will initially configure them two per machine in a RAID 10 array for redundancy.

 

More to come!

 

Wes

Understanding File System IO, Lessons Learned From SQL Server

I do more than just SQL Server. I enjoy programming. In my former life I have worked with C/C++ and Assembler. As I spent more and more time with SQL Server my programming took a back seat career wise. Having that background though really helps me day in and day out understanding why SQL Server does some of the things it does at the system level.

Fast forward several years and I’ve moved away from C/C++ and spent the last few years learning C#.

Now that I work mostly in C# I do look up solutions for my C# dilemmas on sites like http://www.codeplex.com and http://www.codeproject.com. I love the internet for this very reason, hit a road block do a search and let the collective knowledge of others speed you on your way. But, it can be a trap if you don’t do your own homework.

I write mostly command line or service based tools these days not having any real talent for GUI’s to speak of. Being a person obsessed with performance I build these things to be multi-threaded, especially with today’s computers having multiple cores and hyper threading it just makes since to take advantage of the processing power. This is all fine and dandy until you want to have multiple threads access a single file and all your threads hang out waiting for access.

So, I do what I always do, ask by best friend Google what the heck is going on. As usual, he gave me several quality links and everything pointed to the underlying file not being set in asynchronous mode. Now having done a lot of C++ I knew about asynchronous IO, buffered and un-buffered. I could have made unmanaged code calls to open or create the file and pass the safe handle back, but just like it sounds it is kind of a pain to setup and if you are going down that path you might as well code it all up in C++ anyway.

Doing a little reading on MSDN I found all the little bits I needed to set everything to rights. I set up everything to do asynchronous IO and I started my test run again. It ran just like it had before slow and painful. Again, I had Mr. Google go out and look for a solution for me, sometimes being lazy is a bad thing, and he came back with several hits where people had also had similar issues. I knew I wasn’t the only one! The general solution? Something I consider very, very .Net, use a background thread and a delegate to keep the file access from halting your main thread, so your app “feels” responsive. It is still doing synchronous IO. Your main thread goes along but all file access is still bottle-necked on a single reader/writer thread. Sure, it solves the issue of program “freezing” up on file access but doesn’t really solve the problem of slow file access that I am really trying to fix.

I know that SQL Server uses asynchronous un-buffered IO to get performance from the file system. I did some refresh reading on the MSDN site again and struck gold. Writes to the file system may OR may not be asynchronous depending on several factors. One of which is, if the file must be extended everything goes back to synchronous IO while it extends the file. Well, since I was working with a filestream and a newly created file every time I was pretty much guaranteeing that I would be synchronous no matter what. At this point I dropped back to C++. I started to code it up when I realized I was doing things differently in my C++ version.

I was manually creating the file and doing an initial allocation growing it out to the size the file buffer and the file length on close if need be.

I started up my C++ version of the code and watched all the IO calls using Sysinternal’s Process Monitor. I watched my C++ version, and lo, it was doing asynchronous IO in the very beginning then switching to synchronous IO as the file started growing. I fired up my instance of SQL Server and watched as the asynchronous IO trucked right along…. until a file growth happened and everything went synchronous for the duration of the growth.

AH HA!

So, taking that little extra knowledge I manually created my file in C# set an initial default size and wouldn’t you know asynchronous IO kicked right in until it had to grow the file. I had to do a little extra coding watching for how much free space was in the file when I get close I now pause any IO,  manually the file by some amount and then start up the writes again keeping things from going into a synchronous mode without me knowing.

So, there you go my little adventure and how my old skills combined with knowing how  SQL Server works helped me solve this problem. Never assume that your new skills and old skills won’t overlap.

SSD, The Game Changer

I’ve often described SQL Server to people new to databases as a data pump.

Just like a water pump, you have limited capacity to move water in or out of a system usually measured in gallons per hour.
If you want to upgrade your pumping systems it can be a two fold process, the physical pump and the size of the pipes.

Our database servers also have several pumps and pipes, and in general you are only as fast as your slowest or narrowest pipe, hard drives.

To feed other parts of the system we have resorted to adding lots and lots of hard drives to get the desired IO read/writes and MB/sec throughput that a single server can consume.
Everyone is familiar with Moore’s law (Often quoted, rarely understood) loosely applied says CPU transistor counts double roughly every 24 months. Hard disks haven’t come close to keeping up with that pace, performance wise.

Up until recently, hard drive capacity has been growing almost at the same rate doubling in size around every 18 months (Kryder’s Law). The problem isn’t size is speed.

Lets compare the technology from what may have been some folks first computer to the cutting edge of today.

Time Circa 1981 Today improvement
Capacity 10MB 1470MB 147x
HDD Seeks 85ms/seek 3.3ms/seek 20x
IO/Sec 11.4 IO/Sec 303 IO/Sec 26x
HDD Throughput 5mbit/sec 1000mbit/sec 200x
CPU Speed 8088 4.77Mhz (.33 MIPS) Core i7 965(18322 MIPS) 5521x

*These are theoretical maximums in the real world you mileage may vary.

 

I think you can see where this is going. I won’t go any further down memory lane lets just say that some things haven’t advanced as fast as others. As capacity has increased the speed has been constrained by the fact hard disks are just that, spinning disks.

So, what does this little chart have anything to do with SSD? I wanted you to get a feel of where the real problem lies. It isn’t capacity of hard drives it’s the ability to get to the data quickly. Seeks are the key. SSD’s have finally crossed a boundary where they are cheap enough and fast enough to make it into the enterprise space at all levels.

SSD compared to today’s best 15k.2 HDD from above.

  HDD SSD improvement
seek times 3.3ms/seek 85μs/seek 388x
IO/Sec 303 IO/Sec 35000 IO/Sec 115x
Throughput 1000mbit/sec 25000mbit/sec 2.5x

 

So, in the last few years SSD has caught up and passed HDD on the performance front by a large margin. This is comparing a 2.5” HDD to a 2.5” SSD. This gap is even wider if you look at the new generation of SSD’s that plug directly into the PCIe bus and bypass the drive cage and RAID controller all together. HOT DOG! Now we are on track. SSD has allowed us to scale much closer to the CPU than anything storage wise we have seen in a very long time.

Since this is a fairly new emerging technology I often see allot of confused faces when talking about SSD. What is in the technology and why it has now become cost effective to deploy it instead of large raid arrays?

Once you take out the spinning disks, the memory and IO controller march much more to the tune of Moore’s law than Kryder’s meaning cost goes down, capacity AND speed go up. Eventually there will be an intersection where some kind of solid state memory, maybe NAND maybe not, will reach parity with spinning hard drives.

But, like hard drives not all SSD’s are on the same playing field, just because it has SSD printed on it doesn’t make it a slam dunk to buy.

Lets take a look at two implementations of SSD based on MLC NAND. I know some of you will be saying why not SLC? I’m doing this to get a better apples to apples comparison and to put this budget wise squarely in the realm of possibility.

 

Intel x25-M priced at 750.00 for 160GB in a 2.5” SATA 3.0 form factor and the Fusion-io IoDrive Duo 640GB model priced at 9849.99 in a PCIe 8x single card.

Drive Capacity in GB Write Bandwidth Read Bandwidth Reads/sec Writes/Sec Access Latency (seek time) Wear Leveling
(writes-erase/day)
Cost per Unit Cost per GB Cost per IO Reads Cost Per IO Writes
IoDrive Duo 640 1000MB 1400MB 126601 180530 80μs 5TB $9849.99 $15.39 $0.08 $0.06
X25-M 160 70MB 250MB 35000 3300 85μs 100GB * $750.00 $4.60 $0.02 $0.22
Improvement 4x 14x 5x 4x 55x ~ 10x 13x 4x 4x -4x

* This is an estimate based on this article http://techreport.com/articles.x/15433. Intel has stated the drive should be good for at least 1 petabyte in write operations or 10,000 cycles.

 

Both of these drives use similar approaches to achieve the speed an IO numbers.They break up the NAND into multiple channels like a very small RAID array. This is an over simplification but gives you an idea of how things are changing. It is almost like having a bunch of small drives crammed into a single physical drive shell with it’s own controller a mini-array if you will.

So, not all drives are created equal. In Intel’s defense they don’t plan the X25-M to be an enterprise drive, they would push you to their X25-E which is an SLC based NAND device which is more robust in every way. But keeping things equal is what I am after today.

To get the X25-M to the same performance levels it could take as few as 4 drives and as many as 55 depending on the IO numbers you are trying to match on the IoDrive Duo.

Wear leveling is my biggest concern on NAND based SSD’s. We are charting new water and really won’t know what the reliability numbers are until the market is aged another 24 to 36 months. You can measure your current system to see how much writing you actually do to disk and get a rough estimate on the longevity of the SSD. Almost all of them are geared for 3 to 5 years of usability until the croak.

At a minimum it would take 10 X25-M drives to equal the stated longevity of a single IoDrive Duo.

Things also start to level out once you factor in RAID controllers and external enclosures if you are going to overflow the internal bays on the server. That can easily add another $3000.00 to $5000.00 dollars to the price. All the sudden the IoDrive Duo really starts looking more appealing by the minute.

 

What does all this mean?

Not all SSD’s are created equal. Being constrained to SATA/SAS bus and drive form factors can also be a real limiting factor. If you break that mold the benefits are dramatic.

Even with Fusion-io’s cost per unit it, is still pretty cost effective in some situations like write heavy OLTP systems, over other solutions out there.

I didn’t even bother to touch on something like Texas Memory System’s RamSan devices at $275000.00 for 512GB of usable space in a 4U rack mount device the cost per GB or IO is just through the roof and hard to justify for 99% of most SQL Server users.

You need to look closely at the numbers, do in house testing and make sure you understand your current IO needs before you jump off and buy something like this. It may be good to also look at leveraging SSD in conjunction with your current storage by only moving data that requires this level of performance to keep cost down.

If this article has shown you anything it’s technology marches on. In the next 6 to 12 months there will be a few more choices on the market for large SSD’s in the 512GB to 2TB range by different manufacturers at ranging prices making the choice to move to SSD even easier.

Recently, Microsoft Research in early April published a paper where they examined SSD and enterprise workloads. They don’t cover SQL Server explicitly but they do talk about Exchange. The conclusion is pretty much SSD is too expensive to bother with right now. To agree and disagree with them it was true several months ago, today not so much.

The fact that the landscape has change significantly since this was published and will continue to do so, I think we are on the verge of why not use SSD instead of do we really need it.

With that said please, do your homework before settling on a vendor or SSD solution, it will pay dividends in not having to explain to your boss that the money invested was wasted dollars.

A little light reading for you:

SSD Primer
http://en.wikipedia.org/wiki/Solid-state_drive

James Hamilton’s Blog
http://perspectives.mvdirona.com/2009/04/12/WhereSSDsDontMakeSenseInServerApplications.aspx

 

-Wes