Faulty RAID controller in login node #130

Closed
opened 2025-07-08 18:04:00 +02:00 by rarias · 2 comments
Owner

The HW RAID controller seems to be having problems. It only manages to boot correctly about 1/10 of the times.

I attempted to upgrade the firmware, but from EFI it seems to get stuck. From Linux, it failed as well:

# dmesg
...
[ 1828.423255] megaraid_sas 0000:01:00.0: Iop2SysDoorbellIntfor scsi0
[ 1829.427269] megaraid_sas 0000:01:00.0: Found FW in FAULT state, will reset adapter scsi0.
[ 1829.427275] megaraid_sas 0000:01:00.0: resetting fusion adapter scsi0.
[ 1834.927600] megaraid_sas 0000:01:00.0: Waiting for FW to come to ready state
[ 1854.272685] megaraid_sas 0000:01:00.0: FW now in Ready state
[ 1854.272919] megaraid_sas 0000:01:00.0: Current firmware maximum commands: 928         LDIO threshold: 0
[ 1854.440701] megaraid_sas 0000:01:00.0: Init cmd success
[ 1854.512705] megaraid_sas 0000:01:00.0: firmware type : Legacy(64 VD) firmware
[ 1854.512710] megaraid_sas 0000:01:00.0: controller type       : MR(1024MB)
[ 1854.512712] megaraid_sas 0000:01:00.0: Online Controller Reset(OCR)  : Enabled
[ 1854.512713] megaraid_sas 0000:01:00.0: Secure JBOD support   : No
[ 1854.897079] megaraid_sas 0000:01:00.0: Jbod map is not supported megasas_setup_jbod_map 4949
[ 1854.897092] megaraid_sas 0000:01:00.0: Reset successful for scsi0.
[ 1854.904809] megaraid_sas 0000:01:00.0: 27168 (805305604s/0x0020/CRIT) - Controller encountered a fatal error and was reset


# /opt/MegaRAID/storcli/storcli64 /c0 download file=MR_614p6.rom
Download Completed.
Flashing image to adapter...
CLI Version = 007.3404.0000.0000 April 18, 2025
Operating system = Linux 4.4.49-92.14-default
Controller = 0
Status = Failure
Description = command sequence incorrect or previous operation terminated
The HW RAID controller seems to be having problems. It only manages to boot correctly about 1/10 of the times. I attempted to upgrade the firmware, but from EFI it seems to get stuck. From Linux, it failed as well: ``` # dmesg ... [ 1828.423255] megaraid_sas 0000:01:00.0: Iop2SysDoorbellIntfor scsi0 [ 1829.427269] megaraid_sas 0000:01:00.0: Found FW in FAULT state, will reset adapter scsi0. [ 1829.427275] megaraid_sas 0000:01:00.0: resetting fusion adapter scsi0. [ 1834.927600] megaraid_sas 0000:01:00.0: Waiting for FW to come to ready state [ 1854.272685] megaraid_sas 0000:01:00.0: FW now in Ready state [ 1854.272919] megaraid_sas 0000:01:00.0: Current firmware maximum commands: 928 LDIO threshold: 0 [ 1854.440701] megaraid_sas 0000:01:00.0: Init cmd success [ 1854.512705] megaraid_sas 0000:01:00.0: firmware type : Legacy(64 VD) firmware [ 1854.512710] megaraid_sas 0000:01:00.0: controller type : MR(1024MB) [ 1854.512712] megaraid_sas 0000:01:00.0: Online Controller Reset(OCR) : Enabled [ 1854.512713] megaraid_sas 0000:01:00.0: Secure JBOD support : No [ 1854.897079] megaraid_sas 0000:01:00.0: Jbod map is not supported megasas_setup_jbod_map 4949 [ 1854.897092] megaraid_sas 0000:01:00.0: Reset successful for scsi0. [ 1854.904809] megaraid_sas 0000:01:00.0: 27168 (805305604s/0x0020/CRIT) - Controller encountered a fatal error and was reset # /opt/MegaRAID/storcli/storcli64 /c0 download file=MR_614p6.rom Download Completed. Flashing image to adapter... CLI Version = 007.3404.0000.0000 April 18, 2025 Operating system = Linux 4.4.49-92.14-default Controller = 0 Status = Failure Description = command sequence incorrect or previous operation terminated ```
Author
Owner

Heh, second attempt worked!

# /opt/MegaRAID/storcli/storcli64 /c0 download file=MR_614p6.rom
Download Completed.
Flashing image to adapter...
CLI Version = 007.3404.0000.0000 April 18, 2025
Operating system = Linux 4.4.49-92.14-default
Controller = 0
Status = Success
Description = F/W Flash Completed. Please reboot the system for the changes to take effect

Current package version = 24.3.0-0062
New package version = 24.21.0-0132
Heh, second attempt worked! ``` # /opt/MegaRAID/storcli/storcli64 /c0 download file=MR_614p6.rom Download Completed. Flashing image to adapter... CLI Version = 007.3404.0000.0000 April 18, 2025 Operating system = Linux 4.4.49-92.14-default Controller = 0 Status = Success Description = F/W Flash Completed. Please reboot the system for the changes to take effect Current package version = 24.3.0-0062 New package version = 24.21.0-0132 ```
Author
Owner

Seven reboots later and it seems to be working okay. My hypothesis was that the flash memory cells which store charge in a floating gate were leaking some charge over the years (the firmware was 10 years old), so it was starting to produce incorrect reads.

Flashing a new firmware causes all cells to be recharged, so we reduce the probability of reading errors. This is just an hypothesis (by a long shot!), but it seems to be compatible with the observations.

Seven reboots later and it seems to be working okay. My hypothesis was that the flash memory cells which [store charge in a floating gate](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation) were leaking some charge over the years (the firmware was 10 years old), so it was starting to produce incorrect reads. Flashing a new firmware causes all cells to be recharged, so we reduce the probability of reading errors. This is just an hypothesis (by a long shot!), but it seems to be compatible with the observations.
rarias added the hwiorepair labels 2025-07-09 10:55:01 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#130