How RAID system reliable? possible of raid system failure

aTun@lemm.ee · 2 months ago

How RAID system reliable? possible of raid system failure

Björn Tantau@swg-empire.de · 2 months ago

I’d stay away from hardware RAID controllers. If they fail you’re gonna have a hard time. Learned that the hard way. With a software RAID you can do what you proposed. Just put the disk in another system and use it there.

ddh@lemmy.sdf.org · 2 months ago

Seconded. Software RAID is much easier to recover from.

psycotica0@lemmy.ca · 2 months ago

I’m going to give you the benefit of the doubt and assume what you said was simply confusing, but not wrong.

So just to be clear if your raid array fails, and you’re using software raid, you can plug all of the disks into a new machine and use it there. But you can’t just take a single disk out of a raid 5 array, for example, and plug it in and use it as a normal USB hard drive that just had some of the files on it, or something. Even if you built the array using soft-raid.

catloaf@lemm.ee · 2 months ago

No, they mean that if the controller fails, you have to get a compatible controller, not just any controller. And that usually means getting another of the exact same controller. Hopefully they’re still available to buy somewhere. And hopefully it’s got a matching firmware version.

But if you’re using mdraid? Yeah just slap those drives on any disk controller and bring it up in the OS, no problem.

Possibly linux@lemmy.zip · 2 months ago

You technically can with software raid

Possibly linux@lemmy.zip · 2 months ago

Not to mention you can get important features like checksums and data validation.

anamethatisnt@lemmy.world · 2 months ago

RAID is never a replacement for backups.
Never work directly with a surviving disk, clone it and work with the cloned drive.
Are you sure you can’t rebuild the RAID? That really is the best solution in many cases.
If a RAID failure is within tolerance (1 drive in a RAID5 array) then it should still be operational. Make a backup before rebuilding if you don’t have one already.
If more disks are gone than that then don’t count on recovering all data even with data recovery tools.

r00ty@kbin.life · 2 months ago

The OP made clear it was a controller failure or entire system (I read hardware here) failure. Which does complicate things somewhat.

anamethatisnt@lemmy.world · 2 months ago

Yeah, if there’s a full system failure without any backups and no option to get the system operational again then I would land in clone the drives before trying to restore data from them.

Moonrise2473@feddit.it · 2 months ago

Raid wasn’t designed for data safety but to minimize downtime. Just swap the drive an continue operating the server seamlessly. Full backups are still required as the chance of complete failure isn’t zero

Possibly linux@lemmy.zip · 2 months ago

You really do not want hardware raid. You want software raid like LVM or better yet, ZFS.

Do your own research. Keep in mind raid isn’t a backup. It is only for convenience.

renzev@lemmy.world · 2 months ago

A lot of “hardware raid” is just a separate controller doing software raid. I thought I lost access to a bunch of data when my raid controller died, before I realized that I could just plug the disks directly into the computer and mount them with mdadm. But yes, hardware raid seems a bit pointless nowadays.

NotSpez@lemmy.world · 2 months ago

If op is considering ZFS: Do. Not. Use. RAIDz. (learned the hard way)

Possibly linux@lemmy.zip · 2 months ago

Why wouldn’t you? It is the most flexible out if all of them.

NotSpez@lemmy.world · 2 months ago

In my experience, using spinning disks, the performance is very poor, and times for scrub and resilver are very long. For example, in a raidz1 with 4x8TB, scrubbing takes 2-3 weeks and resilvering takes almost 2 months. I must also add very poor performance in degraded state. This is a very old post, but things are still the same.

Possibly linux@lemmy.zip · 2 months ago

I’ve not had that experience

Zeoic@lemmy.world · 2 months ago

You must have shingled/SMR drives. They do not work well with any type of raid array.

My array of 7x12TB drives resilvers in a few hours, as I made sure I got CMR drives

NotSpez@lemmy.world · 2 months ago

That makes total sense. Thanks!

computergeek125@lemmy.world · 2 months ago

For recovering hardware RAID: most guaranteed success is going to be a compatible controller with a similar enough firmware version. You might be able to find software that can stitch images back together, but that’s a long shot and requires a ton of disk space (which you might not have if it’s your biggest server)

I’ve used dozens of LSI-based RAID controllers in Dell servers (of both PERC and LSI name brand) for both work and homelab, and they usually recover the old array to the new controller pretty well, and also generally have a much lower failure rate than the drives themselves (I find myself replacing the cache battery more often than the controller itself)

Only twice out of the handful of times I went to a RAID controller from a different generation

first time from a mobi failed R815 (PERC H700) physically moving the disks to an R820 (PERC H710, might’ve been an H710P) and they were able to foreign import easily
Second time on homelab I went from an H710 mini mono to an H730P full size in the same chassis (don’t do that, it was a bad idea), but aside from iDRAC being very pissed off, the card ran for years with the same RAID-1 array imported.

As others have pointed out, this is where backups come into play. If you have to replace the server with one from a different generation, you run the risk that the drives won’t import. At that point, you’d have to sanitize the super block of the array and re-initialize it as a new array, then restore from backup. Now, the array might be just fine and you never notice a difference (like my users that had to replace a failed R815 with an 820), but the result pattern is really to the extremes of work or fault with no in between.

Standalone RAID controllers are usually pretty resilient and fail less often than disks, but they are very much NOT infallible as you are correct to assess. The advantage to software systems like mdadm, ZFS, and Ceph is that it removed the precise hardware compatibility requirements, but by no means does it remove the software compatible requirements - you’ll still have to do your research and make sure the new version is compatible with the old format, or make sure it’s the same version.

All that’s said, I don’t trust embedded motherboard RAIDs to the same degree that I trust standalone controllers. A friend of mine about 8-10 years ago ran a RAID-0 on a laptop that got it’s super block borked when we tried to firmware update the SSDs - stopped detecting the array at all. We did manage to recover data, but it needed multiple times the raw amount of storage to do so.

we made byte images of both disks in ddrescue to a server that had enough spare disk space
found a software package that could stitch together images with broken super blocks if we knew the order the disks were in (we did), which wrote a new byte images back to the server
copied the result again and turned it into a KVM VM to network attach and copy the data off (we could have loop mounted the disk to an SMB share and been done, but it was more fun and rewarding to boot the recovered OS afterwards as kind of a TAKE THAT LENOVO…we were younger)
took in total a bit over 3TB to recover the 2x500GB disks to a usable state - and took about a week of combined machine and human time to engineer and cook, during which my friend opted to rebuild his laptop clean after we had images captured - to one disk windows, one disk Linux, not RAID-0 this time :P

Possibly linux@lemmy.zip · 2 months ago

Why wouldn’t you just use software raid? It is way more robust and if you are using ZFS you get all the nice features that come as a part of it.

computergeek125@lemmy.world · 2 months ago

I never said I didn’t use software RAID, I just wanted to add information about hardware RAID controllers. Maybe I’m blind, but I’ve never seen a good implementation of software RAID for the EFI partition or boot sector. During boot, most systems I’ve seen will try to always access one partition directly and a second in order, which is bypassing the concept of a RAID, so the two would need to be kept manually in sync during updates.

Because of that, there’s one notable place where I won’t - I always use hardware RAID for at minimum the boot disk because Dell firmware natively understands everything about it from a detect/boot/replace perspective. Or doesn’t see anything at all in a good way. All four of my primary servers have a boot disk on either a Startech RAID card similar to a Dell BOSS or have an array to boot off of directly on the PERC. It’s only enough space to store the core OS.

Other than that, at home all my other physical devices are hypervisors (VMware ESXi for now until I can plot a migration), dedicated appliance devices (Synology DSM uses mdadm), or don’t have a redundant disks (my firewall - backed up to git, and my NUC Proxmox box, both firewalls and the PVE are all running ZFS for features).

Three of my four ESXi servers run vSAN, which is like Ceph and replaces RAID. Like Ceph and ZFS, it requires using an HBA or passthrough disks for full performance. The last one is my standalone server. Notably, ESXi does not support any software RAID natively that isn’t vSAN, so both of the standalone server’s arrays are hardware RAID.

When it comes time to replace that Synology it’s going to be on TrueNAS

Possibly linux@lemmy.zip · 2 months ago

Your information is about 10 years out of date. It is trivial to do boot with raid as in EFI you just set both drives as bootable.

I think hardware raid is only for the last resort as Windows has Storage sense and Linux has ZFS, LVM and mdadm. I’ve never heard of a hardware raid system that has the features a lot of these systems have like data integrity checking and ram caching.

Essentially I don’t really see a need for hardware raid in a home environment and there isn’t a huge need in the business.

computergeek125@lemmy.world · 2 months ago

I never said anything about EFI not supporting multi boot. I said that the had to be kept in lockstep during updates. I recognize the term “manual” might have been a bit of a misnomer there, since I included systems where the admin has to take action to enable replication. ESXi (my main hardware OS for now) doesn’t even have software RAID for single-server datastores (only vSAN). Windows and Linux both can do it, but its a non-default manual process of splicing the drives together with no apparent automatic replacement mechanism - full manual admin intervention. With a hardware RAID, you just have to plop the new disk in and it splices the drive back into the array automatically (if the drive matches)
- “EFI doesn’t understand (normal) MD RAID” - https://unix.stackexchange.com/a/742072/34724 (2023)
- (untested) “Using metadata 1.0 (end of disk) to splice EFI partitions together” - https://std.rocks/gnulinux_mdadm_uefi.html
- (untested) “splicing windows dynamic disks together” - https://learn.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/set-up-dynamic-boot-partition-mirroring
Dell and HPe both have had RAM caching for reads and writes since at least 2011. That’s why the controllers have batteries :)
- also, I said it only had to handle the boot disk. Plus you’re ignoring the fact that all modern filesystems will do page caching in the background regardless of the presence of hardware cache. That’s not unique to ZFS, Windows and Linux both do it.
mdadm and hardware RAID offer the same level of block consistency validation to my current understanding- you’d need filesystem-level checksumming no matter what, and as both mdadm and hardware RAID are both filesystem agnostic, they will almost equally support the same filesystem-level features (Synology implements BTRFS on top of mdadm - I saw a small note somewhere that they had their implementation request block rebuild from mdadm if btrfs detected issues, but I have been unable to verify this claim so I do not consider it (yet) as part of my hardware vs md comparison)

Hardware RAID just works, and for many, that’s good enough. In more advanced systems, all its got to handle is a boot partition, and if you’re doing your job as a sysadmin there’s zero important data in there that can’t be easily rebuilt or restored.

SayCyberOnceMore@feddit.uk · 2 months ago

I can confirm that moving the disks to a very similar device will work.

We recovered “enough” data from what disks remained of a Dell server that was dropped (PSU side down) from a crane. The server was destroyed, most of the disks had moved further inside the disk caddy which protected them a little more.

It was fun to struggle with that one for ~1 week

And the noise from the drives…

Possibly linux@lemmy.zip · 2 months ago

At some point you need a clean room

r00ty@kbin.life · edit-2 2 months ago

I’m sure I’ve seen paid software that will detect and read data from several popular hardware controllers. Maybe there’s something free that can do the same.

For the future, I’d say that with modern copy on write filesystems, so long as you don’t mind the long rebuild on power failures, software raid is fine for most people.

I found this, which seems to be someone trying to do something similar with a drive array built with an Intel raid controller

https://blog.bramp.net/post/2021/09/12/recovering-a-raid-5-intel-storage-matrix-on-linux-without-the-hardware/

Note, they are using drive images, you should be too.

Possibly linux@lemmy.zip · 2 months ago

Frankly hardware raid is dead and was never great. Software raid is significantly better.

r00ty@kbin.life · 2 months ago

I think it had its uses in the past, specifically if it had the memory backup to prevent full array rebuilds and cached data loss on power failure.

Also at the height of raid controller use (I would say 90s and 2000s) there probably was some compute savings by shifting the work to a dedicated controller.

In modern day, completely agree.

MangoPenguin@lemmy.blahaj.zone · 2 months ago

Keep multiple reliable (and tested) backups, if something fails restore a backup.

Don’t rely on any storage, RAID or anything else to be recoverable when something goes wrong.

tburkhol@lemmy.world · 2 months ago

RAID is more likely to fail than a single disk. You have the chance of single-disk failure, multiplied by the number of disks, plus the chance of controller failure.

RAID 1 and RAID 5 protect against that by sharing data across multiple disks, so you can re-create a failed drive, but failure of the controller may be unrecoverable, depending on availability of new, exact-same controller. With failure of 1 disk in RAID 1, you should be able to use the array ‘degraded,’ as long as your controller still works. Depending on how the controller works, that disk may or may not be recognizable to another system without the controller.

RAID 1 disks are not just 2 copies of normal disks. Example: I use software RAID 1, and if I take one of the drives to another system, that system recognizes it as a RAID disk and creates a single-disk, degraded RAID array with it. I can mount the array, but if I try to mount the single disk directly, I get filesystem errors.

Nicht BurningTurtle@feddit.org · 2 months ago

Are there differences in the context of failure, when using a controller vs software raid with mdadm?

catloaf@lemm.ee · 2 months ago

With software raid, there is no controller to fail.

Well, that’s not strictly true, because you still have a SATA/SAS controller, HBA, backplane, or whatever, but they’re more easily replaceable. (Unless it’s integrated in the motherboard, but then it’s not a separate component to fail.)

MangoPenguin@lemmy.blahaj.zone · 2 months ago

Software RAID is generally better in every way, also no hardware to fail.

Big_Boss_77@lemmynsfw.com · 2 months ago

I’ve never been a big fan of RAID for this reason… but I’ve also never had enough mission critical data that I couldn’t just store hard copy backups.

That being said… let me ask you this:

Is there a better way than RAID for data preservation/redundancy?

Björn Tantau@swg-empire.de · 2 months ago

Just for drive redundancy it’s awesome. One drive fails you just pull it out, put in a new one and let the array rebuild. I guess the upside of hardware RAID is that some even allow you to swap a disk without powering down. Either way, you have minimal downtime.

I guess a better way would be to have multiple servers. Though with features like checksums in BTRFS I guess a RAID is still better because it can protect against bitrot. And with directly connected systems in a RAID it is generally easier to ensure consistency.

Big_Boss_77@lemmynsfw.com · 2 months ago

Yeah, that’s generally my consensus as well. Just curious if someone had a better way that maybe I didn’t know about.

schizo@forum.uncomfortable.business · 2 months ago

A tool I’ve actually found way more useful than actual raid is snapraid.

It just makes a giant parity file which can be used to validate, repair, and/or restore your data in the array without needing to rely on any hardware or filesystem magic. The validation bit being a big deal, because I can scrub all the data in the array and it’ll happily tell me if something funky has happened.

It’s been super useful on my NAS, where it’s the only thing standing between my pile of random drives and data loss.

There’s a very long list of caveats as to why this may not be the right choice for any particular use case, but for someone wanting to keep their picture and linux iso collection somewhat protected (use a 321 backup strategy, for the love of god), it’s a fairly viable option.

Big_Boss_77@lemmynsfw.com · 2 months ago

Very cool, this is actually the sort of thing I was interested in. I’m looking at building a fairly heavy NAS box before long and I’d love to not have to deal with the expense of a full raid setup.

For stuff like shows/movies, how do they perform after recovery?

OneCardboardBox@lemmy.sdf.org · 2 months ago

If you’re doing it from scratch, I’d recommend starting with a filesystem that has parity checks and filesystem scrubs built in: eg BTRFS or ZFS.

The benefit of something like BRTFS is that you can always add disks down the line and turn it into a RAID cluster with a couple commands.

Big_Boss_77@lemmynsfw.com · 2 months ago

Yeah, it’s been a long time since I’ve looked at and kind of RAID/Storage/data preservation stuff… like 256GB spinning platters were the “hot new thing” last time I did.

I’m starting from scratch…in more ways than one lol

schizo@forum.uncomfortable.business · 2 months ago

I mean, recovery from parity data is how all of this works, this just doesn’t require you to have a controller, use a specific filesystem, have matching sized drives or anything else. Recovery is mostly like any other raid option I’ve ever used.

The only drawback is that the parity data is mostly equivalent in size to the actual data you’re making parity data of, and you need to keep a couple copies of indexes since if you lose the index or the parity data, no recovery for you.

In my case, I didn’t care: I’m using the oldest drives I’ve got as the parity drives, and the newer, larger drives for the data.

If i were doing the build now and not 5 years ago, I might pick a different solution but there’s something to be said for an option that’s dead simple (looking at you, zfs) and likely to be reliable because it’s not doing anything fancy (looking at you, btrfs).

From a usage (not technical) standpoint, the most equivalent commercial/prefabbed solution would probably be something like unraid.

hendrik@palaver.p3x.de · edit-2 2 months ago

Btw: With the regular Linux software mdraid, you can also swap drives without powering down. That all works fine while running. Unless your motherbard SATA controller craps out. But the mdraid itself will handle it just fine.

catloaf@lemm.ee · 2 months ago

RAID is more likely to fail than a single disk. You have the chance of single-disk failure, multiplied by the number of disks, plus the chance of controller failure.

This is poorly phrased. A raid with a bad disk is not failed, it is degraded. The entire array is not more likely to fail than a single disk.

Yes, you are more likely to experience a disk failure, but like you said, only because you have more disks in the first place. (However, there is also the phenomenon where, after replacing a failed disk, the additional load during the rebuild might cause a second disk to fail, which is why you should replace failed disks as soon as possible. And have backups.)

B0rax@feddit.org · 2 months ago

Lots of people have moved away from raid entirely because of some of these issues. There are alternatives these days. For example mergerfs or the ZFS file system.

Theoriginalthon@lemmy.world · 2 months ago

Can confirm that moving a zfs array to a new system after a failure is simply connect the disks and zpool import -f <pool_name>

Every raid card I use now is put in hba mode it’s just simpler to deal with

Possibly linux@lemmy.zip · 2 months ago

Not to mention hardware raid is dependent on a functional card. You are in trouble if the card has problems. Also hardware raid doesn’t do data validation

Theoriginalthon@lemmy.world · 2 months ago

I feel like hardware raid is relic from the pre multi core CPU days, given that was less than 20 years ago it makes me feel old

Possibly linux@lemmy.zip · 2 months ago

You are describing software raid which is the only thing really used these days. There are so many issues with hardware raid.

NeoNachtwaechter@lemmy.world · 2 months ago

recover data from unfunctioned remaining RAID disks due to RAID controller failure

In this case, you need a new RAID controller of similar type.

Can I even simply attach one of the RAID 1 disk to the desktop system

No. One disk out of a RAID array is different from a normal disk.

Recovery becomes easy if you do not use a hardware RAID controller, but a ZFS software RAID instead. It does nearly all automatically. But you need to do a little more reading tutorials for the first setup.