Using RAID-5 is the primary error here. RAID-5 (or single parity RAID of any kind) is obsolete, period. The story here doesn't ring true to me to be honest; I'm currently herding several hundred multi-terabytes servers, and multiple drive failures appear in one and only case: when using Seagate Barracuda ES2 1TB or WD desktop class drives. These are the two very problematic setups. In all other cases, use RAID-6 and all will be well.
I'd add that current "enhanced format" drives are tremendously better than most older drives. If your drive is a 1 or 2 TB old (512B sectors) drive, use it only for backup or whatever menial use for unimportant data.
To add some extra emphasis to wazoox's point, RAID-6 is always a better choice than RAID-5.
If you're willing take a capacity hit for improved write performance, RAID-1+0 is great. Though you can only survive two disks failing if they are in different pairs.
You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.
Yes. The decision to use RAID5 instead of RAIDs 1, 10 or 0+1 shows a decision to cut costs. With mirroring, the mirror drives would take over when a disk failed.
Also, only one hot spare is in the set. Another cost saving.
Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say across 6 disks or 5 disks. You have two more chances for a disk to go bust and will have to be rebuilt from parity.
What if the disks are OK but the server host adapter card gets fried? Or the cable between the server and array? Some disk arrays allow for redundant access to the array, and some OS's can handle that failover.
Before I read the article, I thought it might discuss heat. Excessive heat is usually the cause when disk arrays start melting down one after another. Usually the meltdown happens in an on-site server closet/room which was never properly set up for having servers running 24/7. Usually the straw which breaks the back is a combination of more equipment added, and hot summer days. Then portable ACs are purchased to temporarily mitigate things, but if their condensation water deposits are not regularly dumped, they stop helping. This situation occurs more than you would imagine, luckily I have not been the one who had to deal with this for every time I have seen it (although sometimes I have). Usually the servers are non-production ones which don't make the cut to go into the data center.
The heat problem happens in data centers as well, believe it or not. A cheap thermometer is worth buying if you sense too much heat around your servers. Usually the heat problem is less bad, but the general data center temperature is a few degrees higher than what it should be, and this leads to more equipment failure.
Hard drives are pretty resilient to high temperatures. Google did a reliability analysis of thousands of hard drives and found:
"Overall our experiments can confirm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. This is a fairly surprising result, which could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for
equipment that contains disk drives. We can conclude
that at moderate temperature ranges it is likely that there
are other effects which affect failure rates much more
strongly than temperatures do."
Personally I've not had much luck with hot-spares, I'd prefer to have the spare in the array (in the case of RAID-6) so I can find out if there's a problem with that drive before it's the only thing standing in the way of total failure.
About a month ago, I had 4 drives scattered around the house, each in its own enclosure, and I wanted to consolidate them into one unit. Money was an issue, so I wanted to recycle as many of them as possible instead of buying new ones. A Synology NAS along with a single extra drive allowed me near-optimal use of space with 1-drive redundancy. Of course, I have weekly backups to an external drive, so even if the array fails during a drive swap, I'll still have all my important files.
Any other solution would either require me to buy more drives (a significant expense at $100+ a pop), sacrifice redundancy, or build my own NAS with ZFS (which would have significant administration overhead, cost more, and be larger than my Synology unit).
Backing up to an external drive isn't enough if your really worried about the data. If your house burns down, or the single backup drive fails, your out of luck.
Synology's devices support automatic backup to S3, use it.
EMC uses RAID5 as the default for storage arrays, and then has some number of global hot spares. Netapp uses RAID6 by default, and then also has some number of hot spares. I've never had data loss from either system as a result of multi-drive failure. RAID5 is perfectly fine in most instances.
Desktop drives will drop out of RAID arrays frequently, so you have to use RAID6 if you choose to go that route. If a disk drops into deep checking mode for physical errors, then it won't respond to the RAID controller fast enough, and then be considered a dead drive. It will subsequently be re-detected, and then array has to be rebuilt.
If you're using low capacity, enterprise class SAS drives, mostly, yes. However when using large capacity SATA drives, it most definitely isn't.
SATA drives (even "enterprise" SATA drives) have an official unrecoverable error rate of 1/10^14. From my experience, the truth is more like 1/10^13.
10^13 bits is roughly 10 terabytes. That means that every time you're reading 10 TB, you are statistically certain to encounter an unrecoverable bit error (and have a 1% chance of having 100 errors, of course). In the case of a rebuilding 10 TB array (only a couple 3 or 4 TB drives) using RAID-5, that means that you're almost sure to have an ECC error that will prevent you from ever rebuilding properly without corruption.
I have an 8 bay Drobo Pro with eight 2 TB drives, and I have a drive go out every few months. Of course this is because we went cheap with WD Green drives. However the Drobo Pro offers to use two drives "as protection" so you can have two drives go out and it keeps going.
You could be having the TLER problem. WD green drives can take a long time to do error correction - this causes the raid to drop the drive as failed. This solution is to respond with an error quickly and let the raid fix it with parity.... but at least it doesn't drop the drive.
Or buy more expensive drives. Wait isnt that the idea behind RAID: Redundant Array of Inexpensive Disks..
It may very well be the issue, if a disk takes a long time to recover from errors even a software RAID will throw the disk out or risk being the problem by itself. TLER will limit the damage.
Ofcourse, if the software was smart enough to know how to handle this properly and recover automatically it need not drop the disk completely from the array.
RAID-5 (or single parity RAID of any kind) is obsolete, period.
RAID-6 offers different compromises relative to RAID-5 (for one, twice the parity space), so it isn't quite like one is the successor of the other. And once you're talking about multiple disk failures, you're at the existential point where you should probably be talking about whole array failures (e.g. your controller has quietly been writing junk for the last hour), and how to deal with that scenario.
>it isn't quite like one is the successor of the other.
Given the current price of hard drives, I don't get how "twice the parity space" can even matter. Furthermore, modern RAID controllers perform almost exactly the same using RAID-5 or RAID-6 (verified on most 3Ware, LSI, Adaptec and Areca controllers).
So yes, RAID-6 definitely is RAID-5 successor.
> how to deal with that scenario.
RAID is not an alternative to backup and never was. You deal with that scenario through proper backup or replication.
RAID-6 is in no universe a RAID-5 successor. Simplifications of enterprise needs and risk tolerances and compromise acceptance is sophistry. It is telling enough that despite the bluster of some on Hacker News, major storage vendors (ergo - people who know much more than you) still make RAID-5 the default. Maybe they just haven't read the news.
Regarding the backup -- yeah, no kidding. That was the point. If the argument is "this is better because it can accept one more of countless possible failure modes", then "better" can continue indefinitely (why not 10 parity copies?) In the real world of compromise considerations there is a benefit return assessment that draws a line at a probability point.
It also sounds like many on here think you buy a box of disks and then make one universal logical volume on it (e.g. "if you have a spare why not just make it RAID-6?"). Because the spare(s) are usually universal, and you have many logical volumes encompassing RAID-10, 0, 5, 6, whatever the situation calls for.
Apparently you're new to the web forums and aren't aware that different posts from different people may display only partial views of varying opinions, conflating a couple of my answers to different questions as one happily, and putting in someone else's answers with it for good measure. I, for instance, made no comment in this thread about proper hot spare policy.
About RAID-5: some major storage vendors still use it for smaller arrays. Some other don't use it anymore (NetApp, DDN come to mind). The notion that RAID-5 isn't fit for arrays of large capacity drives doesn't come from me and is hardly new. You don't need any links as obviously you know all about this already.
I've set up my first terabyte SAN in the 90s back when it used to fill a whole rack, 9 GB micropolis drives where hot and SSA was the new interconnect, but I probably know less about storage than you.
Ah, the grizzled vet angle. The bit about me being new to the web is particularly adorable, especially given that I prefaced it by referring to other people (ergo, there was no confusion). When you have many logical volumes suddenly you can make choices like "does this necessitate the extra protection of RAID-6, given the compromises"? And people are making that choice to this day, and no one is saying "Oh look, there's RAID-6 which is the newer version of RAID-5 so it's my default choice".
I'd add that current "enhanced format" drives are tremendously better than most older drives. If your drive is a 1 or 2 TB old (512B sectors) drive, use it only for backup or whatever menial use for unimportant data.