Using RAID-5 is the primary error here. RAID-5 (or single parity RAID of any kin...

gav · on Dec 31, 2012

To add some extra emphasis to wazoox's point, RAID-6 is always a better choice than RAID-5.

If you're willing take a capacity hit for improved write performance, RAID-1+0 is great. Though you can only survive two disks failing if they are in different pairs.

You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

thaumaturgy · on Dec 31, 2012

> You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

Right. Because it bears repeating still: RAID is not for backup, RAID is for high availability.

vacri · on Jan 1, 2013

The simple way to support this is that RAID doesn't protect against a slip of the fingers on the commandline.

lysol · on Dec 31, 2012

And of course, depending on the application, you should be using 1+0 for performance reasons over 5 anyway, if it's a database server.

Ologn · on Dec 31, 2012

Yes. The decision to use RAID5 instead of RAIDs 1, 10 or 0+1 shows a decision to cut costs. With mirroring, the mirror drives would take over when a disk failed.

Also, only one hot spare is in the set. Another cost saving.

Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say across 6 disks or 5 disks. You have two more chances for a disk to go bust and will have to be rebuilt from parity.

What if the disks are OK but the server host adapter card gets fried? Or the cable between the server and array? Some disk arrays allow for redundant access to the array, and some OS's can handle that failover.

Before I read the article, I thought it might discuss heat. Excessive heat is usually the cause when disk arrays start melting down one after another. Usually the meltdown happens in an on-site server closet/room which was never properly set up for having servers running 24/7. Usually the straw which breaks the back is a combination of more equipment added, and hot summer days. Then portable ACs are purchased to temporarily mitigate things, but if their condensation water deposits are not regularly dumped, they stop helping. This situation occurs more than you would imagine, luckily I have not been the one who had to deal with this for every time I have seen it (although sometimes I have). Usually the servers are non-production ones which don't make the cut to go into the data center.

The heat problem happens in data centers as well, believe it or not. A cheap thermometer is worth buying if you sense too much heat around your servers. Usually the heat problem is less bad, but the general data center temperature is a few degrees higher than what it should be, and this leads to more equipment failure.

mprovost · on Dec 31, 2012

Hard drives are pretty resilient to high temperatures. Google did a reliability analysis of thousands of hard drives and found:

"Overall our experiments can conﬁrm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. This is a fairly surprising result, which could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do."

http://research.google.com/archive/disk_failures.pdf

DanBC · on Dec 31, 2012

I'd love to see the same research for SSDs.

Also, parent post does talk about higher end temps, not middle range temps.

gav · on Dec 31, 2012

Personally I've not had much luck with hot-spares, I'd prefer to have the spare in the array (in the case of RAID-6) so I can find out if there's a problem with that drive before it's the only thing standing in the way of total failure.

I'm a fan of the temperature@lert USB sensors; $130 gets you peace of mind: http://www.temperaturealert.com/

Dylan16807 · on Dec 31, 2012

>With mirroring, the mirror drives would take over when a disk failed.

The mirror drives that also have bad sectors? Then you go even longer without noticing you have problems.

archagon · on Dec 31, 2012

Maybe for the enterprise. What about home use?

About a month ago, I had 4 drives scattered around the house, each in its own enclosure, and I wanted to consolidate them into one unit. Money was an issue, so I wanted to recycle as many of them as possible instead of buying new ones. A Synology NAS along with a single extra drive allowed me near-optimal use of space with 1-drive redundancy. Of course, I have weekly backups to an external drive, so even if the array fails during a drive swap, I'll still have all my important files.

Any other solution would either require me to buy more drives (a significant expense at $100+ a pop), sacrifice redundancy, or build my own NAS with ZFS (which would have significant administration overhead, cost more, and be larger than my Synology unit).

mbell · on Jan 1, 2013

Backing up to an external drive isn't enough if your really worried about the data. If your house burns down, or the single backup drive fails, your out of luck.

Synology's devices support automatic backup to S3, use it.

caw · on Dec 31, 2012

EMC uses RAID5 as the default for storage arrays, and then has some number of global hot spares. Netapp uses RAID6 by default, and then also has some number of hot spares. I've never had data loss from either system as a result of multi-drive failure. RAID5 is perfectly fine in most instances.

Desktop drives will drop out of RAID arrays frequently, so you have to use RAID6 if you choose to go that route. If a disk drops into deep checking mode for physical errors, then it won't respond to the RAID controller fast enough, and then be considered a dead drive. It will subsequently be re-detected, and then array has to be rebuilt.

wazoox · on Jan 1, 2013

> RAID5 is perfectly fine in most instances.

If you're using low capacity, enterprise class SAS drives, mostly, yes. However when using large capacity SATA drives, it most definitely isn't.

SATA drives (even "enterprise" SATA drives) have an official unrecoverable error rate of 1/10^14. From my experience, the truth is more like 1/10^13.

10^13 bits is roughly 10 terabytes. That means that every time you're reading 10 TB, you are statistically certain to encounter an unrecoverable bit error (and have a 1% chance of having 100 errors, of course). In the case of a rebuilding 10 TB array (only a couple 3 or 4 TB drives) using RAID-5, that means that you're almost sure to have an ECC error that will prevent you from ever rebuilding properly without corruption.

lobster45 · on Dec 31, 2012

I have an 8 bay Drobo Pro with eight 2 TB drives, and I have a drive go out every few months. Of course this is because we went cheap with WD Green drives. However the Drobo Pro offers to use two drives "as protection" so you can have two drives go out and it keeps going.

wayne_h · on Dec 31, 2012

You could be having the TLER problem. WD green drives can take a long time to do error correction - this causes the raid to drop the drive as failed. This solution is to respond with an error quickly and let the raid fix it with parity.... but at least it doesn't drop the drive. Or buy more expensive drives. Wait isnt that the idea behind RAID: Redundant Array of Inexpensive Disks..

You might want to look at this... http://hardforum.com/showthread.php?t=1285254

btw, I am currently working on recovering an 8 bay drobopro thats in a reboot loop...

micro-ram · on Jan 1, 2013

The reboot loop may be a bad unit. I just had a first edition 8 bay replaced free even though it was way out of warranty. Call them and ask nicely.

mbell · on Jan 1, 2013

TLER really isn't a problem with software raid which the drobo and all other home NAS's that I know of use.

baruch · on Jan 1, 2013

It may very well be the issue, if a disk takes a long time to recover from errors even a software RAID will throw the disk out or risk being the problem by itself. TLER will limit the damage.

Ofcourse, if the software was smart enough to know how to handle this properly and recover automatically it need not drop the disk completely from the array.

corresation · on Dec 31, 2012

RAID-5 (or single parity RAID of any kind) is obsolete, period.

RAID-6 offers different compromises relative to RAID-5 (for one, twice the parity space), so it isn't quite like one is the successor of the other. And once you're talking about multiple disk failures, you're at the existential point where you should probably be talking about whole array failures (e.g. your controller has quietly been writing junk for the last hour), and how to deal with that scenario.

wazoox · on Jan 1, 2013

>it isn't quite like one is the successor of the other.

Given the current price of hard drives, I don't get how "twice the parity space" can even matter. Furthermore, modern RAID controllers perform almost exactly the same using RAID-5 or RAID-6 (verified on most 3Ware, LSI, Adaptec and Areca controllers).

So yes, RAID-6 definitely is RAID-5 successor.

> how to deal with that scenario.

RAID is not an alternative to backup and never was. You deal with that scenario through proper backup or replication.

corresation · on Jan 2, 2013

RAID-6 is in no universe a RAID-5 successor. Simplifications of enterprise needs and risk tolerances and compromise acceptance is sophistry. It is telling enough that despite the bluster of some on Hacker News, major storage vendors (ergo - people who know much more than you) still make RAID-5 the default. Maybe they just haven't read the news.

Regarding the backup -- yeah, no kidding. That was the point. If the argument is "this is better because it can accept one more of countless possible failure modes", then "better" can continue indefinitely (why not 10 parity copies?) In the real world of compromise considerations there is a benefit return assessment that draws a line at a probability point.

It also sounds like many on here think you buy a box of disks and then make one universal logical volume on it (e.g. "if you have a spare why not just make it RAID-6?"). Because the spare(s) are usually universal, and you have many logical volumes encompassing RAID-10, 0, 5, 6, whatever the situation calls for.

wazoox · on Jan 3, 2013

Apparently you're new to the web forums and aren't aware that different posts from different people may display only partial views of varying opinions, conflating a couple of my answers to different questions as one happily, and putting in someone else's answers with it for good measure. I, for instance, made no comment in this thread about proper hot spare policy.

About RAID-5: some major storage vendors still use it for smaller arrays. Some other don't use it anymore (NetApp, DDN come to mind). The notion that RAID-5 isn't fit for arrays of large capacity drives doesn't come from me and is hardly new. You don't need any links as obviously you know all about this already.

I've set up my first terabyte SAN in the 90s back when it used to fill a whole rack, 9 GB micropolis drives where hot and SSA was the new interconnect, but I probably know less about storage than you.

corresation · on Jan 3, 2013

Ah, the grizzled vet angle. The bit about me being new to the web is particularly adorable, especially given that I prefaced it by referring to other people (ergo, there was no confusion). When you have many logical volumes suddenly you can make choices like "does this necessitate the extra protection of RAID-6, given the compromises"? And people are making that choice to this day, and no one is saying "Oh look, there's RAID-6 which is the newer version of RAID-5 so it's my default choice".