[Problem] Dell R720xd iDRAC BIOS Recovery

Hello,

So I purchased a used Dell R720xd server off ebay last week and I’ve run into some problems with it.
When I first booted it up, I found out that Dell’s hardware management system (called iDRAC) is dead. It is essentially a separate computer with its own CPU, memory, storage and BIOS on the same system board that enables hardware management features over the network like fan speed, kvm over ip, temperature sensors, power management, remote console, etc. It has constant power as long as one of the PSU’s are powered.

When the whole system first gets power, iDRAC starts first and then is supposed to talk with the BIOS to provide various values like temperatures and fan speed. However, if the iDRAC doesn’t start, you basically loose most hardware management features and fans are stuck at jet speed all the time. This obviously won’t work in a homelab, and the seller agreed to take it back for a refund, but I got a killer deal so I’d rather fix it and keep it if I can. The iDRAC failures are usually due to bad iDRAC firmware/BIOS flashes, and Dell’s only recovery process for this is to drain residual power from the system to kick the iDRAC into recovery mode to flash back the previous firmware from its own storage. But if this fails, the motherboard is essentially bricked as the iDRAC will no longer boot with corrupted firmware and you can’t communicate with it.




I was poking around on the motherboard by where it’s CPU/memory/NIC are located and saw an SOP8 chip that looked like it might hold the boot firmware for the iDRAC, so I googled the chip model and board marking for it. The chip itself is an MXIC MX25L3206E NOR serial flash IC and in Dell’s Statement of Volatility, they list this chip’s purpose as containing the bootloader, mac address, boot variables and other specifics related to iDRAC. The chip’s datasheet claims it contains 32MB of storage but Dell’s SOV states 4MB. Maybe they only use 4MB? It also states only the bootloader is stored on the U_IDRAC_SPI chip, but the operational firmware is stored on BMC_EMMC.




My idea was to purchase an SOP8 clip and try to re-flash what is on the U_IDRAC_SPI chip with a new ROM image from Dell’s website. However, the firmware image from here is called firmimg.d7 and is 61MB. I’ve tried viewing it in HxD but the only desernable text is right at the beginning and way at the end of the file. I’m assuming the SOP8 chip’s ROM file is somewhere in this image but I have no clue how to find it. Does anyone have any ideas??

Nevermind, found a solution. Thanks!

@kwaleeb - Please provide the solution you found for the next guy looking, thanks!

Welp, back to square one.

So originally when I posted this, the iDRAC subsystem wasn’t booting properly because of a failed flash. However, the part of the firmware that was a bad flash was the Operational iDRAC FW that is located in the U_EMMC and not the iDRAC bootloader, located on the U_IDRAC_SPI chip. This was evident because when I would boot the system, the SystemID light (controlled by the iDRAC subsystem) was blinking amber in a recognizable pattern. The blinking amber SystemID light signals that the iDRAC bootloader (U_IDRAC_SPI) has failed to load the operational firmware from the EMMC (U_EMMC). When this happens, you can follow this process and insert an SD card with Dell’s iDRAC firmware update file called firmimg.d7 into the SD card reader that’s directly connected to the iDRAC’s subsystem and it will boot from the SD card, reinstall the operational firmware to the U_EMMC and reboot back to iDRAC running properly. I’m assuming this means the U_IDRAC_SPI chip must still be operating properly under this condition (flashing amber) or it would have no way of re-flashing itself from the SD card.

Well after I finally got everything fixed and running, I decided to update the iDRAC firmware. However, I made the stupid decision of jumping from version 1.57.57 to 2.60.60, and missing all the urgent updates between, which seems to have really bricked the iDRAC subsystem. Now the rear SystemID light stays off completely and the front SystemID light stays blue from boot, which shouldn’t be on at all during boot/until pressed. Ignoring the status lights, I’m assuming that now I’ve bricked the U_IDRAC_SPI too since the same SD recovery card process no longer works. I’ve tried draining system power to ensure iDRAC is fully reset, but it still won’t boot to the SD card or give the amber flashing light.

Now my ideas is the same as in the original post, try to figure out how to repair/reflash the U_IDRAC_SPI chip. If I’m able to at least get the iDRAC Uboot bootloader (on U_IDRAC_SPI) in an operational state/restored, I’m hoping it will give me the amber flashing light and the ability to re-flash the original 1.57.57 version back onto the SPI/EMMC chips through the normal recovery process. Now this is harder than it seems since Dell doesn’t provide a ROM file specifically for this chip. The firmware for this chip is located within the firmimg.d7 file, which is 59MB and obviously way bigger than the 4MB on the chip. I used a SOP8 clip, dumped what is on the U_IDRAC_SPI chip currently, compared it to the firmimg.d7 and found areas of hex values that were the exact same in both files, so I know that the full firmware for this chip is somewhere in the firmimg.d7 file, but I don’t know where. I’ve also found references to the firmimg.d7 file in my ROM dump so this chip is definitely responsible for going through the recovery process. You can also see boot variables and error messages as text but this hasn’t helped me much.

It seems like I have all the pieces to the puzzle, but I have never dealt with hex or bios editing and really have no clue where to go from here. I’ve attached both the firmimg.d7 (Dell iDRAC firmware update file) and my iDRAC dump rom. If someone has a R720 or R720xd that could hook up a SOP8 clip and dump their U_IDRAC_SPI chip, that would be ideal as I could either just change the MAC address in that file then re-flash it and hope that it works. Or I could at least compare a working U_IDRAC_SPI dump to my broken one and maybe get some hints. Either way, any ideas are better than nothing since the only other option is buying a whole new motherboard, which is expensive.

firmimg.d7 (Google Drive)
U_IDRAC_ROM dump (Google Drive)

I will look into this for you, but I do not have any of that hardware, nor have ever used. Can you find a working dump to compare with your current bricked one? If you can, then I can possibly, maybe, fix by comparing those two and the update file you posted.

https://mega.nz/#F!Noh0CSII!YJp6qHlHyUPQ0r2Uw04vfA Here I have the dump of a working R720XD and of a working R420. I have the same issue as kwaleeb for the R620, whose dump is included too

Thank you for the R720XD dump, I will check and see if I can figure anything out. Please explain, what is the R420 and R620 in regards to this thread? If you need similar help with 620, I’ll probably need a working dump from that too.

I may not be able to do anything with any of the files, I have never looked at or used these devices, but I can take a look,

I provided those two extra dumps in case they are of any use.

I also meant to note that there are multiple revisions for the motherboard in pretty much all of the Dell servers, with different part numbers assigned to each; I’m not certain if dumps are revision-specific.

Regarding the R620, I don’t have at least at the time being a working machine from which to take the dump. I just have a faint hope that if a process is found to restore one machine model, it may be generic enough to operate across different models.

To add more info to this, iDRAC v1.x had split packages for iDRAC proper and Lifecycle Controller. Starting with version 2.x (2.10.10 I believe is the first available) these have been integrated into a single package. I’m fairly sure that to update machines it’s required to step through v1.66.65 -> v2.10 -> latest. It is also possible that the iDRAC requires a compatible BIOS version and that updates have to be performed in cross-steps until both are sufficiently recent. I had to step through them both for BIOS and iDRAC at work when a R420’s BIOS bricked and the motherboard had to be replaced and came in never updated on either side.

The firmimg.d7 can be obtained via support.dell.com by selecting the machine affected and then downloading the exe update package, older packages are also provided. The exe files bundles the firmimg.d7 file in the payload subfolder, and can be further expanded into a compressed linux kernel and two squashfs images (~the bigger one for the eMMC and the smaller one for the SPI chip?~ definitively not the case, one goes to the eMMC, the other seems to be a bundle of default settings and binary blobs).

The iDRAC processor seems to be a SH4A (http://blog.ignoranthack.me/?p=86) in case some form of disassembly is required. I have not yet tried to attach cables to the J_IDRAC_UART.

OK, thanks @Nannerkins I wasn’t sure is why I asked.

@kwaleeb - can you use the dump above to fix yours? I think you should be able to, it’s the 4MB dump you were looking for, just put in your MAC and maybe serial if that’s required.

I have tried to attach a UART-to-USB converter to the pins marked J_IDRAC_UART. From the front, counting left to right, pin 3 is TX and pin 4 is GND. I have been unable to find whether RX is pin 1 or 2 as neither seems to make the system accept messages and I lack a multimeter currently to determine which provides 3.3V.

It seems the iDRAC is stuck in a boot loop and unable to recover, the SD card is appropriately sensed when inserted but it’s not used to recover the system.

Here is the output I get:
https://pastebin.com/yFDN23jj

Have you asked Dell for any guidance or pinout examples etc? I would ask in forum and via email, maybe even facebook too since that is all everyone’s rage these days.

That output looks OK, as in it appears to all be working properly, until the very end. Does that mean driver stack on source image is messed up or on board SPI? Is that from a read or write function? If read, can you write new image and then is readout the same error?

I’m not on a support contract or in a condition to ask free tips from Dell, though I suppose it can’t hurt to try.

I’m fairly sure that the SPI chip’s contents (which should be just u-boot, a few lifecycle controller logs and settings such as the iDRAC’s dedicated NIC’s MAC) are fine, and it’s the eMMC that’s either dead or otherwise broken, but this is only off my intuition. As I’ve been unable to dump properly the SPI chip from the replacement board I got (parts of the dump end up corrupted or with bits stuck), I’ve not attempted to flash that onto the current content.

I would ask nicely, hopefully they will be nice in return to keep a customer!

Have you tried more dumps, with different software versions/methods, maybe that will help you get one dumped properly.

I’ll give that a try, though considering I’m not a customer, as my servers all come off ebay, I don’t expect them to care.

I suspect the reason the dump doesn’t succeed is due to the programmer, a black ch341a. The new board has a different chip (MX25L3206E) than the bricked one, so I suspect that also matters.

They should care, otherwise your next systems from ebay will not be Dell

Black programmer should have no problem with either of those roms. Did you try various software versions? Latest one I know of is included in this package 1.31/1.40, but of course newer is not always better so also try older versions too.
Rom version shouldn’t matter either, they both only contain data, if same size, same model item, data contained should be similar and CH341A should be able to dump either just fine once you find software that works with each
https://www.sendspace.com/file/gtcmvd

At this point I got the original board with the damaged iDRAC, a second one I got off eBay but whose RAM slots were busted (and wasn’t worth shipping back, according to the seller) and a third one which works. I have tried dumping each of them using different machines and the various versions you packaged - thanks for that btw, getting the right software for it isn’t as easy as one would naively think - as well as flashrom from Linux, and the result is still the same. None of the dumps verifies correctly against the chip it was taken from for the second and third board whose iDRAC works, some parts are not read correctly and filled with 0xFF or 0x00, or differ between dumps for the same board.

My doubt stemmed from some reading I did regarding the black ch341a model versus the green one - I’m not good enough at electronics to properly grasp the issue, but it seems the black model has been wired/cloned wrong and provides incorrect power. I noticed that the board’s LED indicator tends to dim lightly once about every second.

You only verify against the chip you are taking image from, any other chip is not going to match. So read, then verify, then if matches save. That’s what you do on the working one, then figure out how to take to non-working, ie is serial or some other numbers needed, or does straight erase, program, verify, test work?

That info you mentioned about the black one is not valid in the way you think, I used to think that too, but even if it was it’s only overvoltage causing chip to be warmer, it does not affect data. And I found later, this does not really apply when using like we use anyway, so it can be safely forgotten about.

Keep at it, you will get verified dump. Maybe remove the PCB if you did not solder it to the pins, and then make sure connection on the BIOS chip end is secure and does not move while you are doing the read/verify process before hitting save.

I’m verifying each dump against its own chip, and the verification fails. If the voltage is not an issue that could make certain parts unreadable or get certain bits stuck, then I can only guess there’s an incompatibility between the programmer and the chip. The iDRAC SPI chip is soldered on and not removable, and I don’t trust my hand to be steady enough to re-solder it again.

OK, I was not sure Is the chip a 1.8v chip, that can matter, look up it’s PDF and see (I checked, it’s not, it’s 3.3v). I doubt there is a compatibility issue, I will check now though, I have some of the same roms on hand.
I checked MX25L3206E with black CH341A and software 1.30 (erase, blank check, open file then auto write, verify) did not choose any chip ID left default 25L3205D, and then same steps again with 1.31/1.40 except write/verify manually due to no auto (Detecting chip, set 4MB, detects chip ID C2201615) and no issues with either version

So the issue you are having has to be some connection issue.

It seems odd to me that it happens only with a specific chip whereas the 4-5 others I’ve tried don’t have this issue. Eventually, in chinese-snailmail-time, I should get a new clip to try