RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

We are experiencing random, occasional but catastrophic array corruption on two servers that are on test before being moved to a hosting centre.

SERVER CONFIGURATION
SuperMicro SuperServer 6017R-TDLRF 1U Server
Incorporating SuperMicro X9DRD-LF Motherboard with Intel C602 Chipset with latest BIOS.
64GB ECC RAM
1x Xeon E5-2630v2 CPU
2x Intel DC S3700 800GB SSD Drives in RAID 1 (Mirror) on RSTe Hardware RAID.
Windows Server Enterprise 2008 R2, fully updated.

CORRUPTION PROBLEM
Under heavy load, after a random period of time, often when doing a Windows backup, the array corrupts and the following event log messages are generated. There are varying quantities of each message…

Event ID: 55
Description:
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume VMs.

Event ID: 12289
Description:
Volume Shadow Copy Service error: Unexpected error CreateFileW(\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy25,0x80000000,0x00000003,…). hr = 0x800703ed, The volume does not contain a recognized file system.
Please make sure that all required file system drivers are loaded and that the volume is not corrupted.

Event ID: 136
The default transaction resource manager on volume E: encountered an error while starting and its metadata was reset. The data contains the error code.

A chkdsk on a corrupted volume shows hundreds of lines of errors. I can post these two, but I do not think the exact errors are relevant, as they vary each time. They include:

The object id index entry in file 0x19 points to file 0x174c
but the file has no object id in it.

The multi-sector header signature for VCN 0x0 of index $I30
in file 0x3e is incorrect.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
Error detected in index $I30 for file 62.
The index bitmap $I30 in file 0x3e is incorrect.


TESTS PERFORMED
# We have two complete servers with identical hardware. We can repeat the fault on either server. So we know there is not a fault with a specific hardware item.
# We have tested with 800GB Intel SSDs with HP firmware. We have also tested with 200Gb Intel SSDs with Intel firmware. Configurations with both drive versions exhibit the fault.
# We have tested with Windows based software RAID and the fault does not occur. Unfortunately this halves the array read performance, as we have confirmed with drive benchmarking software. Having spent £2000 per server on drives, halving the disk performance is not something we want to do. Since the software RAID works, this suggest that the drives and connectivity are not at fault, as they are used in both hardware and software RAID. Switching to software RAID uses a standard Microsoft AHCI driver instead of the Intel RSTe driver.
# Configurations with the following Intel RSTe driver versions exhibit corruption: Version 3.8.0.1113, version 4.0.0.1045 and version 4.1.0.1047.
# Configurations with Intel ‘C600+/C200+ series chipset SATA RAID’ RSTe driver version 3.6.0.1093 does not exhibit corruption.
# Power consumption is around 130 Watts and is well within the limits of each server’s dual 500 Watt power supplies.

OBSERVATIONS
# Once corrupted, running an array verify from the Windows Intel RAID utility often results in a blue screen.
# Once an array has corrupted, if we break the array and inspect each disk of the mirror, we find that one drive is intact and the other drive is corrupt. But this is not a fault with a drive or a cable because we have run tests on two different servers, six drives and four SATA cables.

CONCLUSION
Based on 6 weeks of exhaustive testing, we have concluded that there is a fault in the Intel RSTe driver.
We are trying to find a way to get this bug fixed.
If others have had the same issue, this puts more weight to the case.

Has anybody else experienced this behaviour? If so, have you been able to fix the problem by downgrading to RSTe driver version 3.6.0.1093?

Does anyone have any good suggestions or good contacts at Intel, so we can get this information to the right people so it gets fixed?

We don’t really want the server to go live with a very old driver version, as the servers once live, will effectively remain stuck at that driver version, as it will be too risky to update them.

Any help or suggestions are appreciated.

Best regards

Stephen Done

@ StephenDone:
Hello Stephen,
welcome at Win-RAID Forum and thanks for your detailed and very interesting report.

Although I don’t have any own experience with an Intel C600 Series Chipset, I know a lot of X79 RAID users, who switched to an Intel RST RAID OROM and driver v11/12/13 series because of severe problems with the original "Enterprise Edition" RAID OROM modules and drivers.

It may be a good idea to post your report >here<, because some Intel employees are watching the threads of that Forum.

Have you ever tried to use an Intel RST RAID OROM/driver combo? You can do it even with mainboards, which do not offer the BIOS option to switch from RSTe to RST usage.
Within the start post of >this< thread I offer especially customized Intel RST RAID ROM modules of the v12.x.x.xxxx series, which can be inserted into the BIOS instead of the RSTe v3.x.x.xxxx/v4.x.x.xxxx ones. Then you can run one of the modded RST RAID drivers I am offering >here<. The DeviceID of the Intel C600 Series Chipsets will be untouched and stay at DEV_2826.

Regards
Fernando

Hi Fernando,

Thanks for the suggestions. I will post my report to the location you suggest.

This is a production server, so we need to stick with official drivers and firmware. So whilst a non-enterprise ROM and driver might cure the problem, we would no longer have support with any further issues with the servers, as the hardware/software would not be an officially supported configuration. At the moment, all our drivers are certified and the hardware is on a SuperMicro ‘approved device’ list.

Best regards

Steve

Hi Steve, what is the version of the Intel RSTe Option ROM in the UEFI?

What’s the best place to read that info back?

The easiest way is to run the Intel RST/RSTe Control Software (provided you have installed the complete RST/RSTe package).
You will find the actually working Intel RAID ROM version after having hit the "Help" tab under "System Report".

Here’s all the version info from the RSTe tool…

This is currently showing a non-corrupting configuration with the 3.6 driver.

Intel® Rapid Storage Technology enterprise Information
User interface version: 3.6.0.1094
Language: English (United Kingdom)
Intel controller: SATA (AHCI)
Number of SATA ports: 6
Intel controller: SAS
Number of phys: 4
RAID option ROM version: 3.8.0.1029
Driver version: 3.6.0.1086
ISDI version: 3.6.0.1094

Best regards

Steve

Hello Steve,
thanks for posting the details of your non-corrupting Intel RSTe RAID configuration.
As I have seen at the Intel Communities Forum (>LINK<), nobody has yet replied to your bug report. Since you reported about a severe problem, which may affect all Intel’s "High-End" mainboard chipsets, the Intel Support staff may need some more days to find an appropriate reply.

As you probably know, this is not the best RAID ROM/driver version combination. The RAID driver version can be from a later development string than the OROM/SataDriver version, but a vice versa combination may provoke incompatibily problems. Reason: The drivers usually are fully backwards compatible, but they only have a limited forward compatibility with a newer Controller "Firmware".

Hi Fernando,

Thanks for mentioning this. However, this is the only combination we have that actually works! I will give this info to my IT guy though, so he performs further tests with the firmware ‘in sync’.

The strangest thing for me, is that there are no other reports of this fault. But for us, it is perfectly repeatable.
I wonder what part of our system makes it unusual - the SuperMicro server has been out for a while, the SSDs are the de facto Enterprise SSD. But maybe few people would choose to pair the best SSDs with a chipset RAID controller - maybe that is what makes it unusual. We have to do this because the server is 1u (to minimise colocation hosting fees), and so we do not have a free slot in which to put a 3rd party RAID controller.

By the way, just as a matter of interest, using two Intel DC S-3700 drives on the mainboard RAID gives 800 MB/s read performance, as opposed to around 400 MB/s with a single drive (benchmarked using Crystal Disk Info). So whilst people may class the chipset RAID as ‘fake RAID’, it does give surprisingly good performance. It is just unfortunate that is also completely trashes your array when you try to backup the machine!

Boston (who we purchased the server through) have been very helpful. We have been given 3 more RSTe driver versions to try that are in between the one that works and the one that fails. Whilst this doesn’t help us, it might help Intel pinpoint exactly which code change has caused the driver problem, so eventually something might get fixed. However, we have yet to get any response from Intel that suggest they are taking any interest. We have logged a call directly with them, but there are several layers of support between us and the people who can actually do anything about the problem or give any useful advice. And I must say that looking at Intel’s forum, people with RST/RSTe problems don’t seem to have issues directly addressed or acknowledged. This is why we are adopting the Boston->SuperMicro->Intel approach, even though the drives, chipset, processor and drivers are all Intel.

Best regards

Steve

This is true, but only valid for reading/writing of bigger sized files. The 4K scores are quite similar to a single HDD/SSD running in AHCI mode.

Maybe it would be a good idea to additionally contact the Intel Support directly. >Here< is the related site.

Yes, we have contacted them directly a few weeks ago.

Hi Stephen, Fernando,

I am amazed that I came across this thread. I have the very same thing, 2 identical Supermicro servers, both with IRSTe and intel ssds in raid 1 form. I have experienced the very same thing. Everything goes smooth for a while and then bam!!! My raid array falls apart and I get a bsod. For now I am swapping the ssds out and putting in WD Enterprise hdds but i reall am upset as I really wanted the performance of ssds for the OS. Please notify me directly if an answer is found.
Thanks in Advance
[email protected]

@ billydv:
Welcome at Win-RAID Forum!
I hope, that StephenDone and you will find a solution.

Regards
Fernando


@ StephenDone and billydv:
The problems with Intel’s RSTe drivers are wellknown, especially if they are running in RAID mode. That is why many RAID users with an Intel C600/C600+ Series Chipset system have switched the onboard Intel RSTe SATA RAID Controller (DEV_2826) by a BIOS feature to an RST one (DEV_2822).
Just for your information: Even if your mainboard doesn’t have this DEV_2826>DEV_2822 switch option, you can use the Intel RST OROMs and drivers of the v11/v12 series. All you have to do is to replace the original Intel RSTe RAID ROM v3.x.x.xxxx of the BIOS by an especially X79 modded Intel RST ROM module. You can find them within the start post of >this< thread.

Hi Billy,

We still have not got to the bottom of the problem.

We have not yet moved our servers to the hosting centre, as we do not trust them.

Our first fix was to keep the existing SSDs, but in Windows Software RAID configuration, without using the RAID features of the chipset. This has proved considerably more reliable than using the Intel Chipset Drivers. We were given updated Intel chipset drivers by SuperMicro/Intel, but these did not fix the problem and were not accompanied by any information that would tell us what bug they had found or how they had fixed it. It was pretty much ‘try this and go away’.

Unfortunately, we have experienced a further data corruption since running on software RAID too. We are not sure if this is a related problem or not. However, with our previous servers (which ran for years) we never once had any corruption issue.

We are in the process of trying some other SSD drives. If this does not work, then our next change will be to scrap the servers entirely and buy something else that has the minimum of common components with the previous servers.

Please let us know how your tests with the other drives go.

Best regards

Steve

My setup is as follows
2 240gb intel ssd for OS in raid 1
4 1tb WD Enterprise HDD in raid 10 for data
1 1tb WD Enterprise HDD for nightly backups

What I have always noticed is that the raid 10 volume never has any issues, not with corruption nor with parity. For now I swapped out the 2 ssds for the OS with another 2 1tb WD Enterpirse HDDs, initialized and did a parity check (after restoring from an image) and Voila!! For now it seems okay. I will stress test the backup server (I have 2 identical) to see if I get any corruption with the HDDs in raid 1.
Thanks

Stephen,
Just a thought and i believe I will attempt some testing in the near future to see if I am right. If you are using a Supermicro server, you most likely have some sort of SAS backplane (sata cables from motherboard probably attach to the backplane). Could the issue be with the backplane? Although the 2nd chipset of your motherboard has a single cable that plugs into the backplane, the first should have regular sata connecions. Have you tried bypassing the backplane? With my setup I am going to attempt to bypass the 2 sata 6gbs ports (0 and 1) and plug the directly into 2 ssds that I have for the OS. I am able to reliably corrupt the raid 1 mirror by running Burn In Test (Passmark) on 60-70 percent testing for a period of 3 days.

The two drive bays on the front of your server are behind this backplane
BPN-SAS-813LT-O-P

Oh wow,
So you have got corruption using normal HDDs, not even SSDs now?
Let me know the results of your backplane test and if you do not get a failure without the backplane then I will repeat your test as proof.

Hi Billy,
Could you also detail the server model and all RSTe driver and firmware versions, so we can see what is common between our systems?
Best regards
Steve

No No No,
Sorry for any confusion, Hdds seem to be okay. Only raid 1 using ssds seems to have a problem. The 4 hdd volume that has been in the system since I built it has never had a problem. On my spare server I am currently testing 2 hdds in place of the 2 ssds in a raid 1 volume for the OS. I currently have it running a 72 hour stress test. Upon completion I will run a parity check and I will report the results back.