Hello there! I’ve just found this wonderful forum with a dizzying amount of good, expert information, including all the up-to-date reference posts. (Thanks, Fernando!) However, in the interest of saving myself about 3 years of reading I thought I’d try to see if I could get pointed in the proper direction before getting into too many deep details. So I have a simple opinion question to start with…
I’m running a two-year-old ASUS X99-A with Intel RST v13.1.0.1058 drivers & software (haven’t checked the BIOS version yet). I’m using it to run RAID5 on a set of four 2TB WD hard drives. But it seems like whenever I run a “verify and fix” I almost always get a random number (dozens to hundreds) of verification errors (that it says it fixes). So, for you guys that have seen everything under the sun here, am I looking at real parity errors that keep cropping up for no apparent reason, or am I perhaps seeing some sort of “ghost” errors that aren’t really there (perhaps because the drive is in use)?
One of the reasons that I suspect “ghost” problems is that I ran for a while with the RAID not being actively used, but occasionally copying many large files to it just to test its normal writing procedures. I didn’t find any verification errors then, but as soon as I put it back “in use” (this weekend), I started getting error reports again.
If I’m seeing real errors that keep happening (and I’m not currently suspicious of the hard drives themselves) then it sounds like I need to upgrade (or perhaps downgrade) my firmware, drivers, and RST application software, and that’s going to require a lot of my time and effort that I can’t easily afford.
So, do you think that I’ve got a real problem and I need to dive into the deep end and figure out my best course of action?
@Davin :
Welcome at Win-RAID Forum!
Since I have never created or used a RAID5 array, I am not absolutely sure regarding the reason for this behaviour. The synchronization of the RAID5 data is much more complicated than to do the same with a simple RAID0 or RAID1 array and may take more time.
Questions:
1. What is the reason to run so often the “Verify and fix” option of the Intel RAID Software? Have you realized any problems while working?
2. Which is the version of the Intel RAID ROM/EFI “RaidDriver” resp. of the Intel RAID Utility you are using?
As long as you don’t have real problems (instability of the system, degrading of the RAID array or data loss), I don’t see any reason to change anything.
According to my knowledge the “Verify and fix” option should only be used in case of a real problem and not permanently.
Regards
Dieter (alias Fernando)
True, but only the details are more complicated. The basic idea is the same: When there’s a read error, go reconstruct the proper data from other areas of the RAID.
I’m running it frequently now because it’s finding and fixing errors every time I do. I’m trying to keep it as clean as possible because RAID5 can only fix one error at a time on any given block and I don’t want to actually lose data.
The idea of using the RAID is to keep me from actually seeing any problems - they are fixed before reporting to me. So I never see any lost data, which is what I’d expect even if there are many errors.
I haven’t had an opportunity lately to reboot and check the ROM version, but I will do that eventually. I expect it to be aligned with v13 of the software that was installed at the same time.
I’m not willing to wait until the RAID can’t repair something before I take action. These verification errors are supposed to be positive indications of data going bad.
It’s my understanding that it should be run occasionally just to make sure that no problems have occurred. But when it gives me an error report, it’s saying that those ARE problems that need fixing. I’m only running it frequently now because problems are being detected frequently.
Thanks for the input. I was hoping you’d be able to tell me from experience that the verification errors aren’t really there. I guess I’m going to have to spend a lot of time to update everything, unless you have some additional information for me.
BTW, your notes had mentioned that many versions of these driver sets contained known serious problems. Do you have a list anywhere of which versions are problematic and/or what kind of problems they present?
No, I don’t have such list, but on the other hand I do not offer download links to drivers, which are generally problematic.
By the way: Each system is different and it is impossible to find out for each system, which driver is or may be problematic for it.
Hello Davin,
I worked for regional PC producer for many years and I also took care of RAIDs. Verification errors sometimes happened during PC production and such PC had to be repaired. I have met only two causes of theese errors - faulty HDD or much more frequently faulty RAM. Such faulty RAM often hadn’t any other problems (no Memtest errors, instability, etc.). My recommendation is to test your system (RAID verification) with only half DIMMs and see results. Then test the other half, etc. You should be able to find which DIMM causes the problem, if my tip is correct.
@RenierX :
Welcome at Win-RAID Forum and thanks for your contribution!
Regards
Dieter (alias Fernando)
@Davin :
Since I believe as well, that your RAID problems may not be driver related, I recommend to follow RenierX’s tip.
Another tip: Check your PSU. Especially while booting your RAID5 configuration needs a very good Power Supply Unit, which delivers a constant voltage to the various Disk Drives.
@RenierX :
Ouch - a memory problem would be bad, and time-consuming to diagnose, but if that’s all it was it sure would solve my headaches neatly. I’ll have to check that out when I can get my machine free enough to spend time on it.
Thanks for the tip.
PSU problems seem less likely since I’ve had these issues since the machine (and PSU) was new.
Does anyone know of a good disk-testing utility that’s read-only (won’t disturb the RAID data) and can run on individual drives that are part of an active RAID? I’d like to be able to detect what disk these errors are occurring on, and if it’s consistent. That would tell me if I have a bad disk.
Contrarywise this fact makes it even more likely, that it may be the PSU, which is respnsable for your RAID5 problems (not enough power/not constant simultaneous Voltage output to all the drives while booting).
What about the WD Data Lifeguard Diagnistic tool for Windows (look >here<)?
As long as you do not choose the option “WRITE ZEROS” (= Secure erasure), you data should stay untouched.
I would have to replace the whole PSU to test that out and it’s not an easy option at present. Also, I have a 500w PSU but only use about 150w (running) and not much more than that (~200w) while booting, so it certainly isn’t straining its capacity. It’s also being supplied regenerated power from a nice UPS, so the incoming power is nice and clean.
I’ve checked this out, but it will not talk to individual drives while they’re in a RAID configuration, so most of its facilities are useless without that, and I can’t feasibly shut down the RAID.
Got any better suggestions?
@Davin , create DOS bootable flash drive and use Data Lifeguard Diagnostic for DOS.
> link <
Then you must switch the SATA controller to AHCI (non-RAID) mode in BIOS setup. Boot the UFD and be sure to run read only tests on all HDDs, because any write to the them may destroy the RAID. After the tests, switch the SATA controller back to RAID mode. Nothing should happen to the RAID if you are carefull (but backing up data is always good idea).
@RenierX
That sounds good, but scary. I’ve got 2TB of critical data on that drive, important parts of which can’t be backed up while it’s on-line, and it’s changing faster than I can keep up with it. Making a good backup of everything at once is really difficult, and restoring it all would be even more difficult. But I will give that a thought if we don’t have any other options.
Thanks