RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

The Real Bibo
See this here https://communities.intel.com/message/273293
Someone else is confirming that trim is not enabled in raid 1 with IRST

Just wanted to mention that I was looking at Intel’s download center for IRSTe and they now have the 3.5 series driver as recommended for server 2008r2. The 3.6 series is only recommended for web server 2008r2. Not sure if it’s a mixup or not but I haven’t seen a single reason not to be using the 3.6 series on server 2008r2. I now have that version installed on our primary server with hdds and also on our backup server with ssds. I currently have the backup server running a 30 day stress test similar to what I had run in my previous 4-5 day testing with the various driver versions. As soon as I see that even a 30 day stress test will not corrupt, I will switch to ssds on our primary server.
BTW, here is the email I received from SuperMicro, they finally are admitting that others are having similar issues with ssds in raid 1
Hi Billy,
Thanks for your feedback.
Incidently we also received similar feedback from another customer today informing that somehow TRIM seems to cause data corruption in the RAID set with SSD’s.
We will forward this information to the PM for further investigation, if Supermicro can do something to solve this we will of course come with a new BIOS.

@billydv


I finished my tests with 2008R2: using 12.8 Driver results in a stable System without verification faults (I did not test the 12.9 driver yet).
With 2012R2 and the 12.9 driver I got verification faults again, actually I downgraded to 12.8 and repeat the tests…
Thanks again!

We know that trim in raid 1 just doesn’t work. I would use the recommended driver for your OS with your sata chipset and just disable trim straight from the OS.
fsutil behavior set disabledeletenotify 1

Okay,
Now I have begun testing the very same server with new OS, Win Server 2012r2. I am using the 3.8 Intel drivers direct from Intel’s website. Trim is enabled and working as per trimcheck 0.7, I have begun the same stress test with Passmark Burn In test. I will report back next weekend as the test will go for 5 days.

Just a thought,
Thinking back to my original install with this particular server with 2008r2, I immediately got parity errors even on the first parity check. I did not get any yesterday. The in box raid driver is the irst 12.9 series. It allowed for install in raid mode but did not install device driver for the sas controller. Maybe just maybe the problem is just with server 2008r2 (or windows 7) and trim works with raid 1 on server 2012r2, we will see!!!

A new Intel OROM v4.2.0.1036 has been released, you can try it if you still get verification errors.

I unfortunately had to restart my test today because we had an extended power outage and the server was shut down but even after 24 hrs I did not have a single parity error. I have restarted the test. I will leave it until this weekend but I’m starting to think that the issue is simply with server 2008r2 or win 7.

Well our live system went exactly 6 days until it crashed on Thursday evening,

SATA SSD on Controller 0, Port 0: Failed.
Volume OS_Volume: Degraded.

System Report

System Information
OS name: Microsoft Windows Small Business Server 2011 Essentials
OS version: 6.1.7601 Service Pack 1 7601
System name: *********
System manufacturer: Supermicro
System model: X9DAi
Processor: GenuineIntel Intel64 Family 6 Model 62 Stepping 4 2.601 GHz
Processor: GenuineIntel Intel64 Family 6 Model 62 Stepping 4 2.601 GHz
BIOS: American Megatrends Inc., 3.0a

Intel® Rapid Storage Technology enterprise Information
User interface version: 3.6.0.1094
Language: English (United States)
Intel controller: SATA (AHCI)
Number of SATA ports: 6
Intel controller: SAS
Number of phys: 4
RAID option ROM version: 3.8.0.1029
Driver version: 3.6.0.1086
ISDI version: 3.6.0.1094

Storage System Information
RAID Configuration

Array Name: SATA_Array_0000
Size: 381,562 MB
Available space: 19,065 MB
Number of volumes: 1
Volume member: OS_Volume
Number of array disks: 2
Array disk: BTHV5040025H200MGN
Array disk: BTHV5040029N200MGN
Disk data cache: Enabled


I don’t know what is going on but I am back on Hdds for now. I have also updated my service request to Intel. I will wait to see what they say.

It looks as though Intel has just released a new version of the ROM, drivers and console:
- ROM v4.3.0.1018
- drivers v4.3.0.1198
- console v4.3.0.1542

I couldn’t find the changelog, has anyone tried these yet?

I have another c600/x79 motherboard by another maker (Asus). Rampage IV Extreme. Unfortunately bios release only has sata orom 3.5 but with help of UBU tool, I updated Sata Orom to 3.8 and have started testing. Any driver with ssds in raid 1 will start to fail in Server 2008r2 systems. I am now trying server 2012r2 on the same board and it seems different. Will take some days of testing but I’m starting to think that the problem lies in the OS.

Okay,
Here is my final update on this matter. Unfortunately on my systems, irregardless of the OS or driver version, I was unable to get this working so I have reverted to regular hdds. I have updated Supermicro on the issue and am waiting for a reply. What I can tell you is the following
1- Raid 1 with ssds irregardless of the driver version is problematic with server 2008r2. If you stick to the older version of IRSTE 3.6 which does not support trim in raid 1, you must rely on garbage collection which may or may not be able to keep up in a busy server. 3.6 seems to work for some people but I would not be confident without trim support long term.

2- Raid 1 with ssds in server 2012r2 with either IRSTE 3.8 or 4.1 works flawlessly so long as you motherboard does not have compatibility issues such as mine. I have been stress testing in my office an Asus Rampage IV Extreme for weeks on end with both IRSTE OROM 3.8 with IRSTE drivers 3.8 and also IRSTE OROM 4.1 with IRSTE drivers 4.1, Passmark burn in test with folding @home together and pulling out one disk at a time and putting it back in to let it rebuild… No issues whatsoever. I would be very wary of setting this up through anything but either direct sata cables to Mboard or a backplane that is a simple passthrough. I did notice different behaviour with the Supermicro 7047 a/t when it was in the backplane and when it was connected direct.

The trim issues in IRSTE 3.7 and later are probably the reason why Intel pulled the newer drivers for server 2008r2. I am very disappointed that Supermicro does not know about this or hasn’t bothered to fix this in a bios update. I hope that they do soon.

And just another point,
Figured I would try to see if on my Rampage IV Extreme which works perfectly with raid 1 ssds with either 3.8 or 4.1 orom and drivers on server 2012r2- works with the newer 4.1 sata orom and drivers with server 2008r2. Result is the same almost immediately with high stress. Got hundreds of parity errors in just a few hours of stressing.

Just wanted to give everyone watching this thread an update. The last crash that we had with Intel S3710 ssds on our live system was due to Intel’s firmware for that series ssd. The new 3.3.1 ssd toolbox corrects the issue. Although I have not tried this on our live system, on our backup system - Server 2012r2, Raid Orom 3.8 with Bios, IRSTE 3.8, S3710 200gbs Intel SSDs in raid 1 everything appears to work as it should.

I think this is what has really been learned from all this

1- Server 2008r2 will work with single ssds or ssds in raid 1 so long as an IRSTE driver is used that does not enable trim (3.6 IRSTE)
2- Server 2012r2 which is much more compatible for ssds works well with ssds in raid 1. IRSTE 3.8 and 4.1 seem very stable on my test systems
3- Check for firmware updates for your ssds

Thank you for this update!
Does anyone did a test, to see if real allocated data blocks were affected by this corruption?
I just ask because of one of my theories of what is happening here (different reaction of the SSD’s when receiving some unmap/free commands) and because I never saw a real misbehavior of such Systems…
Best regards.