Fundamentals of Storage Systems – RAID and Hard Disk Reliability, Under the Covers

In the last RAID article we covered the basics. This is a little deeper dive into the underlying mechanics of RAID. Exactly what it does, how it does it and what it doesn’t do that people assume it does. I sited David Patterson, Garth Gibson, and Randy Kats and their work at UC Berkley on RAID. They show something I’ve talked about before the “Pending I/O Crises”. Of course it isn’t pending anymore, its here. One of the concerns has to do with Amdah’s Law and speeding up execution with parallel operations. As processors and memory speed up hard disks are still an order of magnitude slower. Another aspect is Kryder’s Law, which like Moore’s Law, is a estimation of capacity growth of hard disks over time. Kryder’s Law is starting to slow down just as Moore’s law is. The problem with hard drives has never really been capacity, its speed. As areal density increases you do get an increase in data throughput, there is simply more data per square inch on the disk. You also get an improvement in I/O’s, tracks are closer together. We haven’t broken past the 15k barrier yet. I’ve still got Seagate Cheetah 15k.3 drive from 2002. It has a max sequential throughput around 80 MB/sec. I doubt we will see spinning disks faster than 15k. This is a real problem for scaling I/O up. Enter RAID. It’s simple get a bunch of disks and then stripe data across them. One little problem creeps up. Reliability goes down for each drive you add to the array. Using RAID 0 pretty much guarantees you will have an array failure. To overcome this We start adding some way to make the data more redundant.

Hard Disk Reliability

People make a lot of assumptions about hard drives and their reliability. Hard disks break down into two classes consumer grade, the drive you have in your desktop and enterprise, the kind usually in your servers. There are misconceptions around both. Recently, Google and others have written papers based on long term large batch sample failure rates and found the enterprise class drives don’t last any longer than consumer class. This study is perfectly valid from a physical reliability point of view. Most drives are manufactured the same way in the same plants. Not like the poor misunderstood lemming, hard disks do all jump off a cliff together. Studies have shown that there is a strong corollary to disk failure and a shared manufacturing batch. Simply put, if they are made around the same time if one has a failure there is a likelihood, around 30%, other drives in that batch will also suffer failures. So, what are we paying for with an enterprise drive besides speed? Data reliability. Enterprise level drives have more robust error correction than their consumer counterparts. On a normal hard drive the smallest piece of data that can be written is 512 bytes. This is the size of a sector. Enterprise drives usually have 520 byte sector 8 bytes are used to verify the data in that sector, this is the Data Integrity Field. DIF isn’t 100% ether. It is more reliable than a consumer drive without it. You can still have write corruption for several reasons. Misdirected writes occur when data is written to the wrong location on disk and reported as a successful write. When the system goes to access again you get a read fault. Torn pages, which we are familiar with, is when an 8k page write is requested but only part of the 8k is actually reported. Corruption outside the drive where the controller makes a bad request to write but it is a perfectly legitimate I/O request at the hard drive level. With larger drives the odds of hitting one of these errors becomes a real possibility. Enterprise drives add this extra layer of protection. Your RAID HBA may also have additional error correction. The last thing I would like to touch on is write catching. Without a battery backup, or if the cache non-volatile in nature, you will loose data on a power failure if a write is in progress.

RAID Host Bus Adapter Reliability

The adapter is as reliable as any other component in your system. Normally, the cache on the controller is ECC based. Also, you usually have the option of a battery module to supply the cache with power incase of an outage so the data in cache can be written to the array when everything comes back up. Most of the issues I have seen with RAID HBAs is almost always driver or firmware related. You may also see inconsistent performance due to write catching and the battery backup unit. The unit has to be taken off line and conditioned to keep it in top condition. The side effect is a temporary disabling of the write cache on the controller. You can override this setting on some controllers but it is dangerous proposition. I personal anecdote from my days at a large computer manufacturer, we started getting a larger volume of failed drive calls into support. We started doing failure analysis. It all pointed back to a particular batch of hard drives. That was when the drive manufacturer made a change in its drives removing very small component. It shaved a few cents off the cost but had a dramatic effect. All the drives were technically good and would pass validation. Under a enough load and attached to a particular RAID HBA they would randomly fall off line. It came down to the little component. It provided a little bit of electrical noise suppression on the SCSI bus. Some cards were effected and others chugged along just fine. This is also confirmed by the Google paper, they observed the same behavior. They also point out that 20% to 30% of all returned drives have no detectible problems. The point is validate your entire I/O stack. Any single component may be within specification but may not play well with others.

RAID Parity, Mirroring, and Recoverability

Not to belabor the point, RAID isn’t bullet proof. People rap RAID round themselves like Superman’s cape. There are several issues that all the RAID schemes in the world don’t protect against. With current hard disks in the two terabyte range it is possible to build even a small RAID 5 array and have potential for complete failure. The problem is the amount of data that has to be read for the rebuild process. Having a hot spare available reduces the time to replace a failed drive to zero but that is only part of the equation. The much larger part is rebuild time. Lets say you have a 14 drive RAID 5 array with the new two terabyte drives installed and suffer a failure. If you have no activity on the array and all the IO is detected to the rebuilt it could still take two or three days to rebuild the array. During that time you are effectively running on a RAID 0 array that is now under load. Your chance of total array failure is near 100%. RAID by its very nature assumes a failure is a hard failure. A drive goes off line and the redundant part of the system takes over. It also makes the assumption that if a write succeeds then, barring a hardware failure, the read will also be valid. Data is only validated on writes not on reads. If it was RAID 5 would be twice as slow on reads and four times as slow on writes as a single drive or RAID 0. With all the potential hidden write failures it is completely possible to have hidden corruption and not know it until it is way to late. RAID levels with striped parity are most susceptible to this kind of silent creeping corruption. It is possible that the corrupted data is in the parity stripe making it completely unusable for data reconstruction. If that particular piece of data doesn’t change you can go a very long time with a RAID 5 array with polluted parity. You know how to recover from a polluted parity stripe? Simple, copy all the data off the array, figure out which files are now corrupt and restore them. RAID 6 with its dual stripes makes it more likely to recover your data from a single parity stripe becoming corrupt. You do pay a price in write speed for that extra level of protection. RAID 1 and RAID 10 aren’t perfect ether. On a mirrored pair if the write is assumed good there is no way to validate that on read. Without a third piece of information, like a checksum, it would be a coin toss. If the read is successful there is no way to tell which drive has the bad data. It is possible to have a mirrored pair run just fine with one giving you corrupted data on reads all day long. It would manifest itself as file corruption or some other anomaly that could be difficult to track down. We are back to relying on the disk to tell us all is well. We often recommend RAID 10 over everything else for speed and reliability, and I still hold to that. RAID 10 can still suffer from a catastrophic failure due to a single mirrored pair failing at the same time. With the probability of correlated disk failures it can’t be ignored.

What Can We Do?

There are a few tools available to us that can help predict the failure of a drive or that something is wrong with the array. All modern drives support the SMART protocol. Even though Google found it wasn’t as useful and wasn’t 100% reliable, closer to 30%, some warning is better than none in my opinion. All modern RAID HBA’s also come with tools to detect parity errors. You do take a hit when you run these internal consistency checks. Just like you run maintenance on your databases via DBCC your RAID arrays need checkups too. They are a necessary evil if you don’t want any surprises one day when you have a failed drive in your RAID 5 array and can’t rebuild it. If you have intermittent problems with a drive, don’t mess around, replace it. The HBA almost always has the ability to send SNMP messages to something like nagios or HP Openview, Use it. If you aren’t running something like that usually you can configure email alerts on error to go out. Proactive is the name of the game.