RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

Hi Stephen,
What is the same about your system and mine is that they are both Supermicro boards based on the Intel C602 chipset for sata. My board and system is a bit higher end than yours as mine is a 4u and yours is a 1u, also mine has twice the memory slots (supporting up to 1tb). What is common (and where I believe the problem probably lies) is that both are Intel Mboards based on C602 chipset that connect to drives after a backplane of some sort.
I have reported the issue to Supermicro and their recommendation was to try bypassing the backplane and connect directly to the sata drives to rule out a backplane incompatibility. On your system (hopefully it is not already installed in a rack yet), remove the top cover and connect 2 sata cables directly to ports 0 and 1 (the white sata ports) and run them to the 2 ssds without using the front drive bays. Once you have done that, try testing the system again, if it still becomes corrupt, we know that the problem lies in the chipset itself or the drivers, if the problem is resolved, it’s an incompatibility with the backplane.
I can reliably cause the mirror to become corrupt by running a software called Burn In Test from Passmark. If I set it to 65-70 % (remember to set the test configuration to include all the partitions on the ssd raid volume) for 72 hours, it will read thousands of parity errors on a verify and repair from IRSTe.
I will also be trying this sometime in the next few weeks, it’s just that my server is operational and this is the business’s busiest time of the year, I will probably need to try this after the first.

Just wanted to report back,
First testing results back today.

2 WD HDDs in raid 1 stress tested for 72 hrs with Passmark Burn In Test, Testing showed no disk errors with literally millions of writes

Hdds are installed in Backplane Cartridges

Follow up Intel Rapid Storage Tech Enterprise software drivers version 4.1.0.1047, Parity verification and repair showed no errors



Volume OS_Volume: Verification and repair complete.

System Report

System Information
OS name: Microsoft Windows Small Business Server 2011 Essentials
OS version: 6.1.7601 Service Pack 1 7601
System name: BOXOFFICE
System manufacturer: Supermicro
System model: X9DAi
Processor: GenuineIntel Intel64 Family 6 Model 62 Stepping 4 2.601 GHz
Processor: GenuineIntel Intel64 Family 6 Model 62 Stepping 4 2.601 GHz
BIOS: American Megatrends Inc., 3.0a

Intel® Rapid Storage Technology enterprise Information
User interface version: 4.1.0.1046
Language: English (United States)
Intel controller: SATA (AHCI)
Number of SATA ports: 6
Intel controller: SAS
Number of phys: 4
RAID option ROM version: 3.8.0.1029
Driver version: 4.1.0.1046
ISDI version: 4.1.0.1046

Storage System Information
RAID Configuration

Array Name: SATA_Array_0000
Size: 1,907,737 MB
Available space: 95,383 MB
Number of volumes: 1
Volume member: OS_Volume
Number of array disks: 2
Array disk: WD-WCAW37390407
Array disk: WD-WCAW37357488
Disk data cache: Enabled

Array Name: SAS_Array_0000
Size: 3,815,474 MB
Available space: 190,725 MB
Number of volumes: 1
Volume member: PVM_Volume
Number of array disks: 4
Array disk: WD-WCAW37330401
Array disk: WD-WCAW37377253
Array disk: WD-WCAW37351879
Array disk: WD-WCAW37354503
Disk data cache: Enabled

Volume name: OS_Volume
Status: Normal
Type: RAID 1
Size: 906,176 MB
System volume: Yes
Data stripe size: 64 KB
Write-back cache: Disabled
Initialized: Yes
Parity errors: 0
Blocks with media errors: 0
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Volume name: PVM_Volume
Status: Normal
Type: RAID 10
Size: 1,812,374 MB
System volume: No
Data stripe size: 64 KB
Write-back cache: Disabled
Initialized: Yes
Parity errors: 0
Blocks with media errors: 0
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Hardware Information

Controller name: Intel(R) C600+/C220+ series chipset SATA RAID Controller
Type: SATA
Mode: RAID
Number of volumes: 1
Volume: OS_Volume
Number of spares: 0
Number of available disks: 1
Rebuild on Hot Insert: Enabled
Manufacturer: 8086
Model number: 2826
Product revision: 6
Direct attached disk: WD-WCAW37390407
Direct attached disk: WD-WCAW37357488
Direct attached disk: WD-WMC1P0306600

Controller name: Intel(R) C600 series chipset SAS RAID (SATA mode) Controller
Type: SAS
Mode: RAID
Number of enclosures: 0
Number of volumes: 1
Volume: PVM_Volume
Number of spares: 0
Number of available disks: 0
Read patrol: Disabled
Rebuild on Hot Insert: Disabled
Manufacturer: 8086
Model number: 1D6B
Product revision: 6
Direct attached disk: WD-WCAW37330401
Direct attached disk: WD-WCAW37377253
Direct attached disk: WD-WCAW37351879
Direct attached disk: WD-WCAW37354503

Disk on Controller 0, Port 0
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-010FB0
Serial number: WD-WCAW37390407
SCSI device ID: 0
Firmware: 01.01V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 0, Port 1
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-010FB0
Serial number: WD-WCAW37357488
SCSI device ID: 1
Firmware: 01.01V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 0, Port 2
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Available
Size: 1,863 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD2000FYYZ-01UL1B1
Serial number: WD-WMC1P0306600
SCSI device ID: 2
Firmware: 01.01K02
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 0
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37330401
SCSI device ID: 0
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 1
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37377253
SCSI device ID: 1
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 2
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37351879
SCSI device ID: 2
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 3
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37354503
SCSI device ID: 3
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Empty port
Port: 3
Controller SATA (AHCI)
Port location: Internal

Empty port
Port: 5
Controller SATA (AHCI)
Port location: Internal

I will try a second test just to confirm but it seems that hdds do not suffer from this bug

We are testing with a different type of ssd. The tests are running now.
Will keep you posted.
Best regards
Steve

@ StephenDone and billydv:

Your tests are very interesting for me, but you hopefully know, that Intel has withdrawn the Intel RSTe drivers v4.1.0.1046 from their Download Center because of some severe bugs. Look >here<.

As per today, Supermicro is still offering the 4 series drivers for use with their c602 chipset motherboards

Description: Intel PCH Driver(SATA)
Version: 4.1.0.1047
Link: Download
Description: Intel PCH Driver(SCU)
Version: 4.1.0.1047
Link: Download

And as I was at the server earlier tonight and ran a verification on the Main Production Server that has been running with a heavy load since early Saturday Morning, here is the results of the check




Volume OS_Volume: Verification and repair complete.

System Report

System Information
OS name: Microsoft Windows Small Business Server 2011 Essentials
OS version: 6.1.7601 Service Pack 1 7601
System name: BOXOFFICE
System manufacturer: Supermicro
System model: X9DAi
Processor: GenuineIntel Intel64 Family 6 Model 62 Stepping 4 2.601 GHz
Processor: GenuineIntel Intel64 Family 6 Model 62 Stepping 4 2.601 GHz
BIOS: American Megatrends Inc., 3.0

Intel® Rapid Storage Technology enterprise Information
User interface version: 4.1.0.1046
Language: English (United States)
Intel controller: SATA (AHCI)
Number of SATA ports: 6
Intel controller: SAS
Number of phys: 4
RAID option ROM version: 3.7.0.1049
Driver version: 4.1.0.1046
ISDI version: 4.1.0.1046

Storage System Information
RAID Configuration

Array Name: SATA_Array_0000
Size: 1,907,737 MB
Available space: 95,468 MB
Number of volumes: 1
Volume member: OS_Volume
Number of array disks: 2
Array disk: WD-WCAW37392705
Array disk: WD-WCAW37390907
Disk data cache: Enabled

Array Name: SAS_Array_0000
Size: 3,815,474 MB
Available space: 190,725 MB
Number of volumes: 1
Volume member: PVM_Volume
Number of array disks: 4
Array disk: WD-WCAW37333570
Array disk: WD-WCAW37352992
Array disk: WD-WCAW37362600
Array disk: WD-WCAW37341244
Disk data cache: Enabled

Volume name: OS_Volume
Status: Normal
Type: RAID 1
Size: 906,134 MB
System volume: Yes
Data stripe size: 64 KB
Write-back cache: Disabled
Initialized: Yes
Parity errors: 0
Blocks with media errors: 0
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Volume name: PVM_Volume
Status: Normal
Type: RAID 10
Size: 1,812,374 MB
System volume: No
Data stripe size: 64 KB
Write-back cache: Disabled
Initialized: Yes
Parity errors: 0
Blocks with media errors: 0
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Hardware Information

Controller name: Intel(R) C600+/C220+ series chipset SATA RAID Controller
Type: SATA
Mode: RAID
Number of volumes: 1
Volume: OS_Volume
Number of spares: 0
Number of available disks: 1
Rebuild on Hot Insert: Enabled
Manufacturer: 8086
Model number: 2826
Product revision: 6
Direct attached disk: WD-WCAW37392705
Direct attached disk: WD-WCAW37390907
Direct attached disk: WD-WMC1P0312819

Controller name: Intel(R) C600 series chipset SAS RAID (SATA mode) Controller
Type: SAS
Mode: RAID
Number of enclosures: 0
Number of volumes: 1
Volume: PVM_Volume
Number of spares: 0
Number of available disks: 0
Read patrol: Disabled
Rebuild on Hot Insert: Disabled
Manufacturer: 8086
Model number: 1D6B
Product revision: 6
Direct attached disk: WD-WCAW37333570
Direct attached disk: WD-WCAW37352992
Direct attached disk: WD-WCAW37362600
Direct attached disk: WD-WCAW37341244

Disk on Controller 0, Port 0
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-010FB0
Serial number: WD-WCAW37392705
SCSI device ID: 0
Firmware: 01.01V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 0, Port 1
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-010FB0
Serial number: WD-WCAW37390907
SCSI device ID: 1
Firmware: 01.01V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 0, Port 2
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Available
Size: 1,863 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD2000FYYZ-01UL1B1
Serial number: WD-WMC1P0312819
SCSI device ID: 2
Firmware: 01.01K02
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 0
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37333570
SCSI device ID: 0
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 1
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37352992
SCSI device ID: 1
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 2
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37362600
SCSI device ID: 2
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Disk on Controller 1, Phy 3
Status: Normal
Type: SATA disk
Location type: Internal
Usage: Array disk
Size: 932 GB
System disk: No
Disk data cache: Enabled
Command queuing: NCQ
Model: WDC WD1003FBYZ-0
Serial number: WD-WCAW37341244
SCSI device ID: 3
Firmware: 1V03
Physical sector size: 512 Bytes
Logical sector size: 512 Bytes

Empty port
Port: 3
Controller SATA (AHCI)
Port location: Internal

Empty port
Port: 5
Controller SATA (AHCI)
Port location: Internal


Although I believe there are issues with the 4 series drivers for x99 chipsets (and the newer c600 series that are pure sata 6gbs), they seem to be rock solid so far on X79, as long as you aren’t using ssds in raid 1.

Just wanted to report on the 4.1.0.1046 drivers. Running two Samsung 1TB Evo’s in RAID 0 on my Intel SCU controller with the 4.1.0.1046 driver - no issues at all and TRIM seems to be working just fine. This is on a Supermicro X9DAi motherboard (C602 Chipset).

I will report any issues, but none so far.

I have a couple of machines (X79 chipset) running ssds in raid 0 and seems just fine. I think the issues are with mirroring

Hi Stephen,
I’m just wondering if you ever got anywhere with this issue. I see that on the ssd list from supermicro, the intel dc3500 series are listed as compatible but doesn’t seem to mean anything. I was going to try with different ssds but I think that it might be a waste of time

Hi Billy,

We tested with HP/Intel 3700 drives and Intel 3500 drives.

All Intel drivers we were supplied that were newer than the 3.6 version are screwed if you run a two drive SSD mirror. The mirror will corrupt within hours or days. We never had an array that lasted a week.

Our servers our now live and running perfectly with these very old drivers.

We wasted thousands in testing, wasted time and worn down SSDs. Not to mention replacing all the drives with another version, just to prove a point.

We unequivocally proved the presence of the bugs with exhaustive testing. We never once got more of a reply than ‘try this version - it might work’. No version we were supplied fixed the fault. No details were supplied with the drivers. No contact with Intel was ever forthcoming.

Frankly, I think the situation is shameful. Bugs are to be expected - that’s life. But for Intel to hide behind the OEMs, not talk to the customer and not supply any release notes with their drivers to acknowledge known existing and fixed bugs with their drivers is unacceptable and in my mind criminal. These are server products, not home PC parts that people fiddle with at home for fun. Known bugs should be published, not hidden to maximise both sales and consequently hacked off customers.

I write this in the hope that someone at Intel who actually cares about the quality of their products and upholding a reputation of integrity reads this and thinks that a policy change is in order.

What is the firmware version for your intel raid controller? If I remember correctly, after flashing bios on server it was at 3.7* something. Not sure the effect of having a firmware version newer than the driver being used. It’s usually not reccomended.

EDIT: Actually, firmware version is 3.8 on my mboard after last bios update

We went with 3.6 driver and 3.8 RAID Option ROM and, though not recommended, that behaves itself. We are no longer paying much attention to what is recommmended, as the recommended drivers don’t work!

I will begin testing with the 3.6 driver tonight
Thanks

3.6 driver will not install on 2nd scu raid controller. Do you have 1 or 2 firmwares on your board?

Just one RAID controller.
The server has two drive bays only.
It’s a 1U energy efficient server, designed for hosting centres, hence the reason it will only ever be run in RAID 1 with the inbuilt RAID… which is buggy.

When I ran installer for 3.6 irste, it installed 3.6 driver on sata firmware and somehow I ended up with a 4.0 driver for the scu firmware. I’m not sure how stable that could be, two different drivers , one for sata one for scu trying to control from gui. I noticed there is a new 4.2 version. I installed and will test both with hdds and ssds. Currently using burn in test to test with hdds, if that goes well I already have a raid 1 volume of ssds I will swap them in and test again next weekend.

Hi Stephen,
just wanted to report back. 4.2 drivers running passmark burn in test for 6 days, upon stopping the test- no errors in test but when I ran a verification, mirror showed as unrepairable with several blocks with media errors. Swapped out the disk and began rebuild to another disk. Put bad disk in another machine to check smart status, no smart errors and upon deleting the raid volume, disk shows as normal and healthy.

I think 4.1 was the newest driver we ever tested with.
So it looks like 4.2 is also broken then.
Thanks for letting me know.

Hi Stephen and Billy,
first of all thanks for sharing your tests (that I following from some days) and I would point out that I’m not an expert.

I have a similar problem (a lot of parity errors) with the workstation that I would build for my office.
My configuration is:
- PC HP Z440
- chipset C612
- HDD not in RAID (OS = win 7 installed by HP)
- two SSD (SAMSUNG 850 PRO) in RAID 1 (OS = win 7 installed by me)
- RAID ROM 4.1.0.1026


I found problems with:

1_ driver 4.1.0.1026 (preinstalled by HP on HDD)
GUI 4.1.0.1026
parity errors with RSTe verification

2_ driver 3.6.0.1086 (installed by me on SSD RAID 1)
GUI 3.6.0.1094
parity errors with RSTe verification

3_ driver 4.2.0.1136 (installed by me on SSD RAID 1)
GUI 4.2.0.1142
from the GUI disappeared the “parity errors” report
(but I think that this not solved the problem because if do a verification booting from HDD (GUI 4.1.0.1026) parity errors come out)

I not really need RAID 1 on my machine so probably I’ll go for a more frequent/bare metal backup.

P.S.: if I don’t care of parity errors found by RSTe verification all looks like OK (RAID status normal and windows works), if I unplug one drive or the other one all looks like OK.
Could be this a false posive error segnalation? Or I will found problems in the future if I undervalue this problem?

Thanks and have a nice weekend
Fabio

@ fabiobonfa:

Hello Fabio,
welcome at Win-RAID Forum and thanks for your interesting contribution!

Regards
Dieter (alias Fernando)