Xeon08 storage error in DRAM memory #8
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
dmesg outputs the following error when running a context-switch benchmark. It seems fairly common. It might be related to my new fcs system call, or the context switch by itself. I need to take a further look.
mentioned in issue #3
After rebooting and running the context switch benchmark
Only showing today's logs
Today I spotted the EDAC error while compiling nosv, so fcs was not being used.
The provided logs suggest that one of the RAM modules is getting these errors, and we should replace it. Do you have an idea of which one is the bad module? From the log it seems hard to determine.
I'm thinking we can swap the suspected bad module with one from the oss01 node, which currently doesn't have a power supply (see https://pm.bsc.es/gitlab/rarias/owl/-/issues/2#note_72526) and check that the error doesn't happen again. I will also make a resource petition to get a new one, but that will take a bit longer.
I have not yet run the memtest86+ check, I hope that I will know the module number after running it. I will run them tonight. Once we know the faulty module, it looks great to me to use the oss01 ram module. But I also think that we should tell someone who cares that we are touching the hardware, no?
I'm not sure if memtest86+ is able to detect those ECC errors. They are provided to the host as statistics, because the ECC hardware already corrected it. In fact, rebooting the machine will cause you to lose the information regarding which module is bad, as stated in
grep . /sys/devices/system/edac/mc/*/ce_count
.Post those counters before rebooting, so we are sure is only the mc1 module.
Yes, I commented it with @vbeltran, but nobody else seems to care about these nodes anymore. The oss* nodes have been constantly rebooting for the past years.
But the point is that there are cells failing. If a single cell fails, ECC can correct it. If multiple cells fall, it won't, and memtest will detect it. Anyways, here is today's count, module 1 is still failing:
Yeah, we know module mc1 is bad. Do you know which physical DIMM is that? I will try to replace it next week.
I'm looking into it, but it doesn't seem clear to me at this point.
The relevant sysfs doc is here and here. But it all seems to point to the same slot, which makes no sense to me.
Intel has a troubleshooting guide for this errors here and they provide an utility to decode the dimm location (sysinfo for the 61X chipset) but the tool is no longer online, links are broken. I think that we can decode the location manually using our server board guide here on table 75. But I first need to see the logged event on SEL. I will post it the next time I see it.
Anyways the guide recommends first to update the bios, second to reseat the ram modules. I guess that we could do this for all ram modules before anything else.
Those SEL events can be seen in the BIOS, which I believe decodes them properly.
Can you dump your
dmidecode -t memory
?A pragmatic approach is to remove one and check that the label is no longer there:
From you slack log, you have two at 303 and 305:
https://www.intel.com/content/www/us/en/download/19034/system-event-log-sel-viewer-utility-for-intel-server-boards-and-intel-server-systems.html
Maybe you will get lucky with impiutil, see this post.
Last two bytes are 0, so DIMM 1.A?
xeon08-dmidecode-2023-04-24.txt
The DIMM modules have a LED light, which should be turned on on the bad module:
That's convenient :) Since I rebooted the machine last week no more errors have occurred, maybe these leds are turned off. I will let you know once I detect a new error.
Okay, then I think we should wait until the error LED is on before replacing, so we are sure we are removing the bad module.
We have some more errors!
and dmesg
but nothing on ipmitool
However I need to restart, we will have to wait a little bit more :)
New record!
Intel took down the download page of selviewer: https://www.intel.com/content/www/us/en/404.html?ref=https://www.intel.com/content/www/us/en/download/19034/system-event-log-sel-viewer-utility-for-intel-server-boards-and-intel-server-systems.html
Thankfully I got a copy of the software: selviewer_v14_1_build32_allos.zipCorrection, my copy was broken. It is available here https://drivers.softpedia.com/get/MOTHERBOARD/Intel/Intel-S2600WT2-Server-Board-SEL-Viewer-Utility-14-1-32.shtml
I think the linux reporting is unable to determine precisely module is bad:
So we should try the details reported by the BIOS, which should be accurate. With the ipmiutil tool I can dump the raw ECC event:
Which could be decoded by the selviewer program from intel, but I was unable to do so.
However, in the https://pm.bsc.es/gitlab/rarias/jungle/-/blob/master/doc/SEL_TroubleshootingGuide.pdf file, in table 74 the details of the ECC code are explained, so we can decode it manually:
Here is the raw code:
First, we can see that the numbering starts at 1, as the memory code (0x0c) matches at position 11 and sensor number (0x02) at 12.
The important part is the location of the DIMM, bytes 15 and 16 which are all 0 (suspicious). Assuming the data is good, we can decode it as:
So either is A1, B1, C1 or D1.
Now, I'm not sure if we can use the rank to further identify the slot in that group.
The BIOS has also a SEL decoder, so we can try to read the error from there via the serial port to verify it.
I also put a link to the reseat procedure recommended by Intel: https://www.intel.com/content/www/us/en/support/articles/000024007/server-products.html
They also recommend to enable the AMT memory test included in the BIOS fo further investigation. Here are more details: https://www.intel.com/content/dam/support/us/en/documents/server-products/intel-active-system-console-and-intel-multi-server-manger-replacement.pdf
Digging into the ipmiutil source code, I can see that they implement the DIMM decoding logic, but is only available if they can determine the version of the BIOS, which can be only accessed from the node itself, not via IPMI.
So, running the command on the node reveals the DIMM directly:
The bad DIMM is A1.
Awesome! Let me know when you are available to reseat the module!
I think I would not reseat it for now as 1) that wouldn't assure the bad module is A1 and 2) it may introduce other errors from other modules. Instead, I will replace it with one of the good DIMM modules from owl1 which is dissasembled. This way we can verify that no more ECC errors are occuring in eudy and be sure that the bad module is A1. If ECC errors continue to happen, the bad module was another one (unlikely).
I plan to replace it next Tuesday if is okay for you.
I will keep track of the bad module and reseat it in owl1, so errors will appear there once I reassemble it. If they become uncorrectable I will either remove it completely or order another one.
Perfect for me! I will come with you to lend a hand!
I removed these two DIMM modules from oss01, which can be used to replace the bad module:
Replaced by the one on the bottom (they have the same numbers, but they differ in the square matrix code). Here is the bad ram from A1:
No more errors for a week:
Closing for now. If they appear again, reopen the issue and we take a closer look.