Xeon08 storage error in DRAM memory #8

Closed
opened 2023-04-18 11:00:07 +02:00 by arocanon · 34 comments
arocanon commented 2023-04-18 11:00:07 +02:00 (Migrated from pm.bsc.es)

dmesg outputs the following error when running a context-switch benchmark. It seems fairly common. It might be related to my new fcs system call, or the context switch by itself. I need to take a further look.

[442217.414300] mce: [Hardware Error]: Machine check events logged
[442217.414469] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0                                                                       
[442217.414471] {10}[Hardware Error]: It has been corrected by h/w and requires no further action                                                                     
[442217.414473] {10}[Hardware Error]: event severity: corrected
[442217.414474] {10}[Hardware Error]:  Error 0, type: corrected
[442217.414476] {10}[Hardware Error]:  fru_text: DIMM ??
[442217.414477] {10}[Hardware Error]:   section_type: memory error
[442217.414479] {10}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)                                                              
[442217.414481] {10}[Hardware Error]:   node:0 
[450875.146974] RAS: Soft-offlining pfn: 0x1a87b2
[450875.146980] mce: [Hardware Error]: Machine check events logged
[450875.146983] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[450875.146985] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 9: 8c00004e000800c0                                                                              
[450875.146987] EDAC sbridge MC1: TSC a8de8302d033a
[450875.146989] EDAC sbridge MC1: ADDR 1a87b2000
[450875.146991] EDAC sbridge MC1: MISC 900000040005c8c
[450875.146992] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1681762851 SOCKET 0 APIC 0
[450875.147004] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 page:0x1a87b2 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:255 \xffffffc0\xffffff82Fj\xffffffcd\xffffffff\xffffffff)                                                                
[450875.147148] soft_offline_page: 0x1a87b2 page already poisoned
[450875.147156] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0                                                                       
[450875.147157] {11}[Hardware Error]: It has been corrected by h/w and requires no further action                                                                     
[450875.147159] {11}[Hardware Error]: event severity: corrected
[450875.147160] {11}[Hardware Error]:  Error 0, type: corrected
[450875.147162] {11}[Hardware Error]:  fru_text: DIMM ??
[450875.147163] {11}[Hardware Error]:   section_type: memory error
[450875.147164] {11}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)                                                              
[450875.147166] {11}[Hardware Error]:   node:0 
[459532.936119] mce: [Hardware Error]: Machine check events logged
[459532.936270] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0                                                                       
[459532.936272] {12}[Hardware Error]: It has been corrected by h/w and requires no further action                                                                     
[459532.936273] {12}[Hardware Error]: event severity: corrected
[459532.936275] {12}[Hardware Error]:  Error 0, type: corrected
[459532.936276] {12}[Hardware Error]:  fru_text: DIMM ??
[459532.936278] {12}[Hardware Error]:   section_type: memory error
[459532.936279] {12}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)                                                              
[459532.936281] {12}[Hardware Error]:   node:0 
dmesg outputs the following error when running a context-switch benchmark. It seems fairly common. It might be related to my new fcs system call, or the context switch by itself. I need to take a further look. ``` [442217.414300] mce: [Hardware Error]: Machine check events logged [442217.414469] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [442217.414471] {10}[Hardware Error]: It has been corrected by h/w and requires no further action [442217.414473] {10}[Hardware Error]: event severity: corrected [442217.414474] {10}[Hardware Error]: Error 0, type: corrected [442217.414476] {10}[Hardware Error]: fru_text: DIMM ?? [442217.414477] {10}[Hardware Error]: section_type: memory error [442217.414479] {10}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [442217.414481] {10}[Hardware Error]: node:0 [450875.146974] RAS: Soft-offlining pfn: 0x1a87b2 [450875.146980] mce: [Hardware Error]: Machine check events logged [450875.146983] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [450875.146985] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 9: 8c00004e000800c0 [450875.146987] EDAC sbridge MC1: TSC a8de8302d033a [450875.146989] EDAC sbridge MC1: ADDR 1a87b2000 [450875.146991] EDAC sbridge MC1: MISC 900000040005c8c [450875.146992] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1681762851 SOCKET 0 APIC 0 [450875.147004] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 page:0x1a87b2 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:255 \xffffffc0\xffffff82Fj\xffffffcd\xffffffff\xffffffff) [450875.147148] soft_offline_page: 0x1a87b2 page already poisoned [450875.147156] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [450875.147157] {11}[Hardware Error]: It has been corrected by h/w and requires no further action [450875.147159] {11}[Hardware Error]: event severity: corrected [450875.147160] {11}[Hardware Error]: Error 0, type: corrected [450875.147162] {11}[Hardware Error]: fru_text: DIMM ?? [450875.147163] {11}[Hardware Error]: section_type: memory error [450875.147164] {11}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [450875.147166] {11}[Hardware Error]: node:0 [459532.936119] mce: [Hardware Error]: Machine check events logged [459532.936270] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [459532.936272] {12}[Hardware Error]: It has been corrected by h/w and requires no further action [459532.936273] {12}[Hardware Error]: event severity: corrected [459532.936275] {12}[Hardware Error]: Error 0, type: corrected [459532.936276] {12}[Hardware Error]: fru_text: DIMM ?? [459532.936278] {12}[Hardware Error]: section_type: memory error [459532.936279] {12}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [459532.936281] {12}[Hardware Error]: node:0 ```
rarias commented 2023-04-18 15:06:57 +02:00 (Migrated from pm.bsc.es)

mentioned in issue #3

mentioned in issue #3
arocanon commented 2023-04-18 16:55:55 +02:00 (Migrated from pm.bsc.es)

After rebooting and running the context switch benchmark

[   33.653895] RPC: Registered tcp transport module.
[   33.653897] RPC: Registered tcp NFSv4.1 backchannel transport module.
[  970.462846] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1
[ 2041.635890] mce: [Hardware Error]: Machine check events logged
[ 2091.203655] RAS: Soft-offlining pfn: 0x1a87b2
[ 2091.203665] mce: [Hardware Error]: Machine check events logged
[ 2091.203669] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[ 2091.203672] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090                                                                               
[ 2091.203678] EDAC sbridge MC1: TSC b288562e101f8
[ 2091.203680] EDAC sbridge MC1: ADDR 1a87b2dc0
[ 2091.203683] EDAC sbridge MC1: MISC 4406ea886
[ 2091.203685] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1681828387 SOCKET 0 APIC 0
[ 2091.203712] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1a87b2 offset:0xdc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 row:0x1505 col:0x2b0 bank_addr:3 bank_group:3)                                                                      
[ 2091.211035] soft_offline: 0x1a87b2: invalidated

+arocanon@xeon08:~/bsc/projects/sc-bench/exp/fcs/log/plots/plot-1> grep . /sys/devices/system/edac/mc/*/ce_count
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_count:1
/sys/devices/system/edac/mc/mc2/ce_count:0
/sys/devices/system/edac/mc/mc3/ce_count:0
After rebooting and running the context switch benchmark ``` [ 33.653895] RPC: Registered tcp transport module. [ 33.653897] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 970.462846] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1 [ 2041.635890] mce: [Hardware Error]: Machine check events logged [ 2091.203655] RAS: Soft-offlining pfn: 0x1a87b2 [ 2091.203665] mce: [Hardware Error]: Machine check events logged [ 2091.203669] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [ 2091.203672] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 [ 2091.203678] EDAC sbridge MC1: TSC b288562e101f8 [ 2091.203680] EDAC sbridge MC1: ADDR 1a87b2dc0 [ 2091.203683] EDAC sbridge MC1: MISC 4406ea886 [ 2091.203685] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1681828387 SOCKET 0 APIC 0 [ 2091.203712] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1a87b2 offset:0xdc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 row:0x1505 col:0x2b0 bank_addr:3 bank_group:3) [ 2091.211035] soft_offline: 0x1a87b2: invalidated +arocanon@xeon08:~/bsc/projects/sc-bench/exp/fcs/log/plots/plot-1> grep . /sys/devices/system/edac/mc/*/ce_count /sys/devices/system/edac/mc/mc0/ce_count:0 /sys/devices/system/edac/mc/mc1/ce_count:1 /sys/devices/system/edac/mc/mc2/ce_count:0 /sys/devices/system/edac/mc/mc3/ce_count:0 ```
arocanon commented 2023-04-18 17:06:08 +02:00 (Migrated from pm.bsc.es)

Only showing today's logs

ipmitool sel list
 306 | 04/18/2023 | 13:31:39 | System Event #0x83 | OEM System boot event | Asserted
 307 | 04/18/2023 | 13:34:30 | System Event #0x83 | OEM System boot event | Asserted
 308 | 04/18/2023 | 13:41:34 | System Event #0x83 | OEM System boot event | Asserted
 309 | 04/18/2023 | 13:44:31 | System Event #0x83 | OEM System boot event | Asserted
 30a | 04/18/2023 | 13:58:10 | System Event #0x83 | OEM System boot event | Asserted

Only showing today's logs ``` ipmitool sel list 306 | 04/18/2023 | 13:31:39 | System Event #0x83 | OEM System boot event | Asserted 307 | 04/18/2023 | 13:34:30 | System Event #0x83 | OEM System boot event | Asserted 308 | 04/18/2023 | 13:41:34 | System Event #0x83 | OEM System boot event | Asserted 309 | 04/18/2023 | 13:44:31 | System Event #0x83 | OEM System boot event | Asserted 30a | 04/18/2023 | 13:58:10 | System Event #0x83 | OEM System boot event | Asserted ```
arocanon commented 2023-04-20 15:38:15 +02:00 (Migrated from pm.bsc.es)

Today I spotted the EDAC error while compiling nosv, so fcs was not being used.

Today I spotted the EDAC error while compiling nosv, so fcs was not being used.
rarias commented 2023-04-20 16:04:35 +02:00 (Migrated from pm.bsc.es)

The provided logs suggest that one of the RAM modules is getting these errors, and we should replace it. Do you have an idea of which one is the bad module? From the log it seems hard to determine.

I'm thinking we can swap the suspected bad module with one from the oss01 node, which currently doesn't have a power supply (see https://pm.bsc.es/gitlab/rarias/owl/-/issues/2#note_72526) and check that the error doesn't happen again. I will also make a resource petition to get a new one, but that will take a bit longer.

The provided logs suggest that one of the RAM modules is getting these errors, and we should replace it. Do you have an idea of which one is the bad module? From the log it seems hard to determine. I'm thinking we can swap the suspected bad module with one from the oss01 node, which currently doesn't have a power supply (see https://pm.bsc.es/gitlab/rarias/owl/-/issues/2#note_72526) and check that the error doesn't happen again. I will also make a resource petition to get a new one, but that will take a bit longer.
arocanon commented 2023-04-20 16:41:41 +02:00 (Migrated from pm.bsc.es)

I have not yet run the memtest86+ check, I hope that I will know the module number after running it. I will run them tonight. Once we know the faulty module, it looks great to me to use the oss01 ram module. But I also think that we should tell someone who cares that we are touching the hardware, no?

I have not yet run the memtest86+ check, I hope that I will know the module number after running it. I will run them tonight. Once we know the faulty module, it looks great to me to use the oss01 ram module. But I also think that we should tell someone who cares that we are touching the hardware, no?
rarias commented 2023-04-20 17:03:09 +02:00 (Migrated from pm.bsc.es)

I'm not sure if memtest86+ is able to detect those ECC errors. They are provided to the host as statistics, because the ECC hardware already corrected it. In fact, rebooting the machine will cause you to lose the information regarding which module is bad, as stated in grep . /sys/devices/system/edac/mc/*/ce_count.

Post those counters before rebooting, so we are sure is only the mc1 module.

But I also think that we should tell someone who cares that we are touching the hardware, no?

Yes, I commented it with @vbeltran, but nobody else seems to care about these nodes anymore. The oss* nodes have been constantly rebooting for the past years.

I'm not sure if memtest86+ is able to detect those ECC errors. They are provided to the host as statistics, because the ECC hardware already corrected it. In fact, rebooting the machine will cause you to lose the information regarding which module is bad, as stated in `grep . /sys/devices/system/edac/mc/*/ce_count`. Post those counters before rebooting, so we are sure is only the mc1 module. > But I also think that we should tell someone who cares that we are touching the hardware, no? Yes, I commented it with @vbeltran, but nobody else seems to care about these nodes anymore. The oss* nodes have been constantly rebooting for the past years.
arocanon commented 2023-04-20 18:02:43 +02:00 (Migrated from pm.bsc.es)

But the point is that there are cells failing. If a single cell fails, ECC can correct it. If multiple cells fall, it won't, and memtest will detect it. Anyways, here is today's count, module 1 is still failing:

+arocanon@xeon08:~/bsc/projects/sc-bench/benchmarks/fcs/nested> grep . /sys/devices/system/edac/mc/*/ce_count
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_count:1
/sys/devices/system/edac/mc/mc2/ce_count:0
/sys/devices/system/edac/mc/mc3/ce_count:0
But the point is that there are cells failing. If a single cell fails, ECC can correct it. If multiple cells fall, it won't, and memtest will detect it. Anyways, here is today's count, module 1 is still failing: ``` +arocanon@xeon08:~/bsc/projects/sc-bench/benchmarks/fcs/nested> grep . /sys/devices/system/edac/mc/*/ce_count /sys/devices/system/edac/mc/mc0/ce_count:0 /sys/devices/system/edac/mc/mc1/ce_count:1 /sys/devices/system/edac/mc/mc2/ce_count:0 /sys/devices/system/edac/mc/mc3/ce_count:0 ```
arocanon commented 2023-04-21 10:17:49 +02:00 (Migrated from pm.bsc.es)

image

![image](/uploads/e4c8656f5c11d11e5d08cc1f097306cb/image.png)
rarias commented 2023-04-21 10:54:25 +02:00 (Migrated from pm.bsc.es)

But the point is that there are cells failing.

Yeah, we know module mc1 is bad. Do you know which physical DIMM is that? I will try to replace it next week.

> But the point is that there are cells failing. Yeah, we know module mc1 is bad. Do you know which physical DIMM is that? I will try to replace it next week.
arocanon commented 2023-04-21 13:10:04 +02:00 (Migrated from pm.bsc.es)

I'm looking into it, but it doesn't seem clear to me at this point.

The relevant sysfs doc is here and here. But it all seems to point to the same slot, which makes no sense to me.

+arocanon@xeon08:~> grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_location
/sys/devices/system/edac/mc/mc0/dimm0/dimm_location:channel 0 slot 0 
/sys/devices/system/edac/mc/mc0/dimm3/dimm_location:channel 1 slot 0 
/sys/devices/system/edac/mc/mc1/dimm0/dimm_location:channel 0 slot 0 
/sys/devices/system/edac/mc/mc1/dimm3/dimm_location:channel 1 slot 0 
/sys/devices/system/edac/mc/mc2/dimm0/dimm_location:channel 0 slot 0 
/sys/devices/system/edac/mc/mc2/dimm3/dimm_location:channel 1 slot 0 
/sys/devices/system/edac/mc/mc3/dimm0/dimm_location:channel 0 slot 0 
/sys/devices/system/edac/mc/mc3/dimm3/dimm_location:channel 1 slot 0 

Intel has a troubleshooting guide for this errors here and they provide an utility to decode the dimm location (sysinfo for the 61X chipset) but the tool is no longer online, links are broken. I think that we can decode the location manually using our server board guide here on table 75. But I first need to see the logged event on SEL. I will post it the next time I see it.

Anyways the guide recommends first to update the bios, second to reseat the ram modules. I guess that we could do this for all ram modules before anything else.

I'm looking into it, but it doesn't seem clear to me at this point. The relevant sysfs doc is [here](https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-edac) and [here](https://www.kernel.org/doc/html/latest/driver-api/edac.html). But it all seems to point to the same slot, which makes no sense to me. ``` +arocanon@xeon08:~> grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_location /sys/devices/system/edac/mc/mc0/dimm0/dimm_location:channel 0 slot 0 /sys/devices/system/edac/mc/mc0/dimm3/dimm_location:channel 1 slot 0 /sys/devices/system/edac/mc/mc1/dimm0/dimm_location:channel 0 slot 0 /sys/devices/system/edac/mc/mc1/dimm3/dimm_location:channel 1 slot 0 /sys/devices/system/edac/mc/mc2/dimm0/dimm_location:channel 0 slot 0 /sys/devices/system/edac/mc/mc2/dimm3/dimm_location:channel 1 slot 0 /sys/devices/system/edac/mc/mc3/dimm0/dimm_location:channel 0 slot 0 /sys/devices/system/edac/mc/mc3/dimm3/dimm_location:channel 1 slot 0 ``` Intel has a troubleshooting guide for this errors [here](https://www.intel.com/content/www/us/en/support/articles/000024007/server-products.html]) and they provide an utility to decode the dimm location (sysinfo for the 61X chipset) but the tool is no longer online, links are broken. I think that we can decode the location manually using our server board guide [here](https://www.intel.com/content/dam/support/us/en/documents/server-products/SEL_TroubleshootingGuide.pdf) on table 75. But I first need to see the logged event on SEL. I will post it the next time I see it. Anyways the guide recommends first to update the bios, second to reseat the ram modules. I guess that we could do this for all ram modules before anything else.
rarias commented 2023-04-21 13:23:07 +02:00 (Migrated from pm.bsc.es)

Those SEL events can be seen in the BIOS, which I believe decodes them properly.

Can you dump your dmidecode -t memory?

A pragmatic approach is to remove one and check that the label is no longer there:

ssh xeon08 'grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label'
/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
/sys/devices/system/edac/mc/mc0/dimm3/dimm_label:CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
/sys/devices/system/edac/mc/mc1/dimm0/dimm_label:CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 <-- bad
/sys/devices/system/edac/mc/mc1/dimm3/dimm_label:CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
/sys/devices/system/edac/mc/mc2/dimm0/dimm_label:CPU_SrcID#1_Ha#1_Chan#0_DIMM#0
/sys/devices/system/edac/mc/mc2/dimm3/dimm_label:CPU_SrcID#1_Ha#1_Chan#1_DIMM#0
/sys/devices/system/edac/mc/mc3/dimm0/dimm_label:CPU_SrcID#0_Ha#1_Chan#0_DIMM#0
/sys/devices/system/edac/mc/mc3/dimm3/dimm_label:CPU_SrcID#0_Ha#1_Chan#1_DIMM#0

1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0

Those SEL events can be seen in the BIOS, which I believe decodes them properly. Can you dump your `dmidecode -t memory`? A pragmatic approach is to remove one and check that the label is no longer there: ``` ssh xeon08 'grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label' /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 /sys/devices/system/edac/mc/mc1/dimm0/dimm_label:CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 <-- bad /sys/devices/system/edac/mc/mc1/dimm3/dimm_label:CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 /sys/devices/system/edac/mc/mc2/dimm0/dimm_label:CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 /sys/devices/system/edac/mc/mc2/dimm3/dimm_label:CPU_SrcID#1_Ha#1_Chan#1_DIMM#0 /sys/devices/system/edac/mc/mc3/dimm0/dimm_label:CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 /sys/devices/system/edac/mc/mc3/dimm3/dimm_label:CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 ``` > 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
rarias commented 2023-04-21 13:29:33 +02:00 (Migrated from pm.bsc.es)

From you slack log, you have two at 303 and 305:

303 | 04/11/2023 | 15:04:33 | Memory #0x02 | Correctable ECC | Asserted
304 | 04/12/2023 | 15:06:16 | System Event #0x83 | OEM System boot event | Asserted
305 | 04/15/2023 | 07:53:53 | Memory #0x02 | Correctable ECC | Asserted
From you slack log, you have two at 303 and 305: ``` 303 | 04/11/2023 | 15:04:33 | Memory #0x02 | Correctable ECC | Asserted 304 | 04/12/2023 | 15:06:16 | System Event #0x83 | OEM System boot event | Asserted 305 | 04/15/2023 | 07:53:53 | Memory #0x02 | Correctable ECC | Asserted ```
rarias commented 2023-04-21 13:41:37 +02:00 (Migrated from pm.bsc.es)
https://www.intel.com/content/www/us/en/download/19034/system-event-log-sel-viewer-utility-for-intel-server-boards-and-intel-server-systems.html
rarias commented 2023-04-21 13:48:13 +02:00 (Migrated from pm.bsc.es)

Maybe you will get lucky with impiutil, see this post.

Maybe you will get lucky with impiutil, see [this post](https://www.xkyle.com/an-ipmi-sel-viewing-shootout/).
rarias commented 2023-04-21 14:05:00 +02:00 (Migrated from pm.bsc.es)
[nix-shell:~]$ sudo ipmiutil sel -l5 -N xeon08-ipmi0
ipmiutil sel version 3.16
-- BMC version 1.43, IPMI version 2.0
SEL Ver 37 Support 07, Size = 3639 records (Used=781, Free=2858)
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
030d 04/21/23 11:18:42 INF EFI  System Event #83  OEM System Booted 6f [01 ff ff]
030c 04/20/23 19:13:21 MIN Bios Memory #02  Correctable ECC, DIMM(0) 6f [a0 00 00]
030b 04/20/23 19:05:38 INF EFI  System Event #83  OEM System Booted 6f [01 ff ff]
030a 04/18/23 15:58:10 INF EFI  System Event #83  OEM System Booted 6f [01 ff ff]
0309 04/18/23 15:44:31 INF EFI  System Event #83  OEM System Booted 6f [01 ff ff]
ipmiutil sel, completed successfully

[nix-shell:~]$ sudo ipmiutil sel -l5 -r -N xeon08-ipmi0
ipmiutil sel version 3.16
-- BMC version 1.43, IPMI version 2.0
SEL Ver 37 Support 07, Size = 3639 records (Used=781, Free=2858)
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
0d 03 02 f2 54 42 64 01 00 04 12 83 6f 01 ff ff
0c 03 02 b1 72 41 64 33 00 04 0c 02 6f a0 00 00
0b 03 02 e2 70 41 64 01 00 04 12 83 6f 01 ff ff
0a 03 02 f2 a1 3e 64 01 00 04 12 83 6f 01 ff ff
09 03 02 bf 9e 3e 64 01 00 04 12 83 6f 01 ff ff
ipmiutil sel, completed successfully

``` [nix-shell:~]$ sudo ipmiutil sel -l5 -N xeon08-ipmi0 ipmiutil sel version 3.16 -- BMC version 1.43, IPMI version 2.0 SEL Ver 37 Support 07, Size = 3639 records (Used=781, Free=2858) RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 030d 04/21/23 11:18:42 INF EFI System Event #83 OEM System Booted 6f [01 ff ff] 030c 04/20/23 19:13:21 MIN Bios Memory #02 Correctable ECC, DIMM(0) 6f [a0 00 00] 030b 04/20/23 19:05:38 INF EFI System Event #83 OEM System Booted 6f [01 ff ff] 030a 04/18/23 15:58:10 INF EFI System Event #83 OEM System Booted 6f [01 ff ff] 0309 04/18/23 15:44:31 INF EFI System Event #83 OEM System Booted 6f [01 ff ff] ipmiutil sel, completed successfully [nix-shell:~]$ sudo ipmiutil sel -l5 -r -N xeon08-ipmi0 ipmiutil sel version 3.16 -- BMC version 1.43, IPMI version 2.0 SEL Ver 37 Support 07, Size = 3639 records (Used=781, Free=2858) RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 0d 03 02 f2 54 42 64 01 00 04 12 83 6f 01 ff ff 0c 03 02 b1 72 41 64 33 00 04 0c 02 6f a0 00 00 0b 03 02 e2 70 41 64 01 00 04 12 83 6f 01 ff ff 0a 03 02 f2 a1 3e 64 01 00 04 12 83 6f 01 ff ff 09 03 02 bf 9e 3e 64 01 00 04 12 83 6f 01 ff ff ipmiutil sel, completed successfully ```
rarias commented 2023-04-21 14:08:36 +02:00 (Migrated from pm.bsc.es)

Last two bytes are 0, so DIMM 1.A?

Last two bytes are 0, so DIMM 1.A?
rarias commented 2023-04-24 09:57:52 +02:00 (Migrated from pm.bsc.es)
[xeon08-dmidecode-2023-04-24.txt](/uploads/9c413574ec8a77503c3741352c5951f9/xeon08-dmidecode-2023-04-24.txt)
rarias commented 2023-04-24 10:28:54 +02:00 (Migrated from pm.bsc.es)

The DIMM modules have a LED light, which should be turned on on the bad module:

dimm

The DIMM modules have a LED light, which should be turned on on the bad module: ![dimm](/uploads/199833e1c8a0a1cf3eb5a17f39894fa1/dimm.png)
arocanon commented 2023-04-24 10:49:28 +02:00 (Migrated from pm.bsc.es)

That's convenient :) Since I rebooted the machine last week no more errors have occurred, maybe these leds are turned off. I will let you know once I detect a new error.

That's convenient :) Since I rebooted the machine last week no more errors have occurred, maybe these leds are turned off. I will let you know once I detect a new error.
rarias commented 2023-04-24 11:35:28 +02:00 (Migrated from pm.bsc.es)

Okay, then I think we should wait until the error LED is on before replacing, so we are sure we are removing the bad module.

Okay, then I think we should wait until the error LED is on before replacing, so we are sure we are removing the bad module.
arocanon commented 2023-05-04 09:47:22 +02:00 (Migrated from pm.bsc.es)

We have some more errors!

+arocanon@xeon08:~>  grep . /sys/devices/system/edac/mc/*/ce_count
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_count:4
/sys/devices/system/edac/mc/mc2/ce_count:0
/sys/devices/system/edac/mc/mc3/ce_count:0

and dmesg

[   34.395215] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 3635.203403] mce: [Hardware Error]: Machine check events logged
[ 3635.204027] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 3635.204034] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 3635.204039] {1}[Hardware Error]: event severity: corrected
[ 3635.204044] {1}[Hardware Error]:  Error 0, type: corrected
[ 3635.204049] {1}[Hardware Error]:  fru_text: DIMM ??
[ 3635.204054] {1}[Hardware Error]:   section_type: memory error
[ 3635.204058] {1}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[ 3635.204065] {1}[Hardware Error]:   node:0 
[ 6287.406916] RAS: Soft-offlining pfn: 0x1a87b2
[ 6287.407118] mce: [Hardware Error]: Machine check events logged
[ 6287.407128] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[ 6287.407136] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
[ 6287.407145] EDAC sbridge MC1: TSC 126573873627ce 
[ 6287.407150] EDAC sbridge MC1: ADDR 1a87b2dc0 
[ 6287.407155] EDAC sbridge MC1: MISC 1426e0086 
[ 6287.407159] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682613781 SOCKET 0 APIC 0
[ 6287.407196] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1a87b2 offset:0xdc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 row:0x1505 col:0x2b0 bank_addr:3 bank_group:3)
[ 6287.413450] soft offline: 0x1a87b2: page migration failed 1, type 0x2ffff800002004(uptodate|private|node=0|zone=2|lastcpupid=0x1ffff)
[ 6479.353473] mce: [Hardware Error]: Machine check events logged
[ 6499.868899] RAS: Soft-offlining pfn: 0x1a87b2
[ 6499.868917] mce: [Hardware Error]: Machine check events logged
[ 6499.868922] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[ 6499.868927] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
[ 6499.868934] EDAC sbridge MC1: TSC 1265f3d8d11868 
[ 6499.868937] EDAC sbridge MC1: ADDR 1a87b2dc0 
[ 6499.868942] EDAC sbridge MC1: MISC 1406e1a86 
[ 6499.868946] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682613994 SOCKET 0 APIC 0
[ 6499.868978] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1a87b2 offset:0xdc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 row:0x1505 col:0x2b0 bank_addr:3 bank_group:3)
[16099.595476] mce: [Hardware Error]: Machine check events logged
[16099.595755] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[16099.595759] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[16099.595761] {2}[Hardware Error]: event severity: corrected
[16099.595763] {2}[Hardware Error]:  Error 0, type: corrected
[16099.595766] {2}[Hardware Error]:  fru_text: DIMM ??
[16099.595768] {2}[Hardware Error]:   section_type: memory error
[16099.595770] {2}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[16099.595774] {2}[Hardware Error]:   node:0 
[24758.719599] RAS: Soft-offlining pfn: 0x1a87b2
[24758.719612] mce: [Hardware Error]: Machine check events logged
[24758.719618] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[24758.719622] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 9: 8c00004e000800c0
[24758.719629] EDAC sbridge MC1: TSC 1291077b3f9352 
[24758.719633] EDAC sbridge MC1: ADDR 1a87b2000 
[24758.719637] EDAC sbridge MC1: MISC 900000040005c8c 
[24758.719641] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682632253 SOCKET 0 APIC 0
[24758.719660] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 page:0x1a87b2 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:255  BZ\xffffffb3\xffffffff\xffffffff\xffffffff\xffffffff)
[24758.719686] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[24758.719692] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[24758.719697] {3}[Hardware Error]: event severity: corrected
[24758.719702] {3}[Hardware Error]:  Error 0, type: corrected
[24758.719706] {3}[Hardware Error]:  fru_text: DIMM ??
[24758.719711] {3}[Hardware Error]:   section_type: memory error
[24758.719715] {3}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[24758.719721] {3}[Hardware Error]:   node:0 
[24758.719735] soft_offline_page: 0x1a87b2 page already poisoned
[33418.945711] mce: [Hardware Error]: Machine check events logged
[33418.945988] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[33418.945992] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[33418.945994] {4}[Hardware Error]: event severity: corrected
[33418.945997] {4}[Hardware Error]:  Error 0, type: corrected
[33418.945999] {4}[Hardware Error]:  fru_text: DIMM ??
[33418.946001] {4}[Hardware Error]:   section_type: memory error
[33418.946004] {4}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[33418.946007] {4}[Hardware Error]:   node:0 
[42078.571826] RAS: Soft-offlining pfn: 0x1a87b2
[42078.571832] mce: [Hardware Error]: Machine check events logged
[42078.571835] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[42078.571836] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 9: 8c00004e000800c0
[42078.571839] EDAC sbridge MC1: TSC 12b9e3ff83a6c1 
[42078.571841] EDAC sbridge MC1: ADDR 1a87b2000 
[42078.571843] EDAC sbridge MC1: MISC 900000040005c8c 
[42078.571845] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682649572 SOCKET 0 APIC 0
[42078.571855] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 page:0x1a87b2 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:255 )
[42078.572094] soft_offline_page: 0x1a87b2 page already poisoned
[42078.572105] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[42078.572108] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[42078.572110] {5}[Hardware Error]: event severity: corrected
[42078.572113] {5}[Hardware Error]:  Error 0, type: corrected
[42078.572115] {5}[Hardware Error]:  fru_text: DIMM ??
[42078.572117] {5}[Hardware Error]:   section_type: memory error
[42078.572119] {5}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[42078.572122] {5}[Hardware Error]:   node:0 
[91737.360963] perf: interrupt took too long (2525 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[91737.390844] perf: interrupt took too long (3167 > 3156), lowering kernel.perf_event_max_sample_rate to 63000
[91737.401513] perf: interrupt took too long (3960 > 3958), lowering kernel.perf_event_max_sample_rate to 50500
[91737.624695] perf: interrupt took too long (4967 > 4950), lowering kernel.perf_event_max_sample_rate to 40250
[91738.047826] perf: interrupt took too long (6214 > 6208), lowering kernel.perf_event_max_sample_rate to 32000
[394311.162143] mce: [Hardware Error]: Machine check events logged
[394311.162175] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[394311.162181] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[394311.162186] {6}[Hardware Error]: event severity: corrected
[394311.162191] {6}[Hardware Error]:  Error 0, type: corrected
[394311.162196] {6}[Hardware Error]:  fru_text: DIMM ??
[394311.162201] {6}[Hardware Error]:   section_type: memory error
[394311.162206] {6}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[394311.162212] {6}[Hardware Error]:   node:0 
[577801.806820] process '/usr/sbin/grub2-probe' started with executable stack
[577801.858552] device-mapper: uevent: version 1.0.3
[577801.858881] device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised: dm-devel@redhat.com
[577803.829449] fuse: init (API version 7.38)

but nothing on ipmitool

We have some more errors! ``` +arocanon@xeon08:~> grep . /sys/devices/system/edac/mc/*/ce_count /sys/devices/system/edac/mc/mc0/ce_count:0 /sys/devices/system/edac/mc/mc1/ce_count:4 /sys/devices/system/edac/mc/mc2/ce_count:0 /sys/devices/system/edac/mc/mc3/ce_count:0 ``` and dmesg ``` [ 34.395215] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 3635.203403] mce: [Hardware Error]: Machine check events logged [ 3635.204027] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 3635.204034] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [ 3635.204039] {1}[Hardware Error]: event severity: corrected [ 3635.204044] {1}[Hardware Error]: Error 0, type: corrected [ 3635.204049] {1}[Hardware Error]: fru_text: DIMM ?? [ 3635.204054] {1}[Hardware Error]: section_type: memory error [ 3635.204058] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [ 3635.204065] {1}[Hardware Error]: node:0 [ 6287.406916] RAS: Soft-offlining pfn: 0x1a87b2 [ 6287.407118] mce: [Hardware Error]: Machine check events logged [ 6287.407128] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [ 6287.407136] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 [ 6287.407145] EDAC sbridge MC1: TSC 126573873627ce [ 6287.407150] EDAC sbridge MC1: ADDR 1a87b2dc0 [ 6287.407155] EDAC sbridge MC1: MISC 1426e0086 [ 6287.407159] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682613781 SOCKET 0 APIC 0 [ 6287.407196] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1a87b2 offset:0xdc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 row:0x1505 col:0x2b0 bank_addr:3 bank_group:3) [ 6287.413450] soft offline: 0x1a87b2: page migration failed 1, type 0x2ffff800002004(uptodate|private|node=0|zone=2|lastcpupid=0x1ffff) [ 6479.353473] mce: [Hardware Error]: Machine check events logged [ 6499.868899] RAS: Soft-offlining pfn: 0x1a87b2 [ 6499.868917] mce: [Hardware Error]: Machine check events logged [ 6499.868922] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [ 6499.868927] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 [ 6499.868934] EDAC sbridge MC1: TSC 1265f3d8d11868 [ 6499.868937] EDAC sbridge MC1: ADDR 1a87b2dc0 [ 6499.868942] EDAC sbridge MC1: MISC 1406e1a86 [ 6499.868946] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682613994 SOCKET 0 APIC 0 [ 6499.868978] EDAC MC1: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1a87b2 offset:0xdc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0 row:0x1505 col:0x2b0 bank_addr:3 bank_group:3) [16099.595476] mce: [Hardware Error]: Machine check events logged [16099.595755] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [16099.595759] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [16099.595761] {2}[Hardware Error]: event severity: corrected [16099.595763] {2}[Hardware Error]: Error 0, type: corrected [16099.595766] {2}[Hardware Error]: fru_text: DIMM ?? [16099.595768] {2}[Hardware Error]: section_type: memory error [16099.595770] {2}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [16099.595774] {2}[Hardware Error]: node:0 [24758.719599] RAS: Soft-offlining pfn: 0x1a87b2 [24758.719612] mce: [Hardware Error]: Machine check events logged [24758.719618] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [24758.719622] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 9: 8c00004e000800c0 [24758.719629] EDAC sbridge MC1: TSC 1291077b3f9352 [24758.719633] EDAC sbridge MC1: ADDR 1a87b2000 [24758.719637] EDAC sbridge MC1: MISC 900000040005c8c [24758.719641] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682632253 SOCKET 0 APIC 0 [24758.719660] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 page:0x1a87b2 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:255 BZ\xffffffb3\xffffffff\xffffffff\xffffffff\xffffffff) [24758.719686] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [24758.719692] {3}[Hardware Error]: It has been corrected by h/w and requires no further action [24758.719697] {3}[Hardware Error]: event severity: corrected [24758.719702] {3}[Hardware Error]: Error 0, type: corrected [24758.719706] {3}[Hardware Error]: fru_text: DIMM ?? [24758.719711] {3}[Hardware Error]: section_type: memory error [24758.719715] {3}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [24758.719721] {3}[Hardware Error]: node:0 [24758.719735] soft_offline_page: 0x1a87b2 page already poisoned [33418.945711] mce: [Hardware Error]: Machine check events logged [33418.945988] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [33418.945992] {4}[Hardware Error]: It has been corrected by h/w and requires no further action [33418.945994] {4}[Hardware Error]: event severity: corrected [33418.945997] {4}[Hardware Error]: Error 0, type: corrected [33418.945999] {4}[Hardware Error]: fru_text: DIMM ?? [33418.946001] {4}[Hardware Error]: section_type: memory error [33418.946004] {4}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [33418.946007] {4}[Hardware Error]: node:0 [42078.571826] RAS: Soft-offlining pfn: 0x1a87b2 [42078.571832] mce: [Hardware Error]: Machine check events logged [42078.571835] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [42078.571836] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 9: 8c00004e000800c0 [42078.571839] EDAC sbridge MC1: TSC 12b9e3ff83a6c1 [42078.571841] EDAC sbridge MC1: ADDR 1a87b2000 [42078.571843] EDAC sbridge MC1: MISC 900000040005c8c [42078.571845] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1682649572 SOCKET 0 APIC 0 [42078.571855] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 page:0x1a87b2 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:255 ) [42078.572094] soft_offline_page: 0x1a87b2 page already poisoned [42078.572105] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [42078.572108] {5}[Hardware Error]: It has been corrected by h/w and requires no further action [42078.572110] {5}[Hardware Error]: event severity: corrected [42078.572113] {5}[Hardware Error]: Error 0, type: corrected [42078.572115] {5}[Hardware Error]: fru_text: DIMM ?? [42078.572117] {5}[Hardware Error]: section_type: memory error [42078.572119] {5}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [42078.572122] {5}[Hardware Error]: node:0 [91737.360963] perf: interrupt took too long (2525 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [91737.390844] perf: interrupt took too long (3167 > 3156), lowering kernel.perf_event_max_sample_rate to 63000 [91737.401513] perf: interrupt took too long (3960 > 3958), lowering kernel.perf_event_max_sample_rate to 50500 [91737.624695] perf: interrupt took too long (4967 > 4950), lowering kernel.perf_event_max_sample_rate to 40250 [91738.047826] perf: interrupt took too long (6214 > 6208), lowering kernel.perf_event_max_sample_rate to 32000 [394311.162143] mce: [Hardware Error]: Machine check events logged [394311.162175] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [394311.162181] {6}[Hardware Error]: It has been corrected by h/w and requires no further action [394311.162186] {6}[Hardware Error]: event severity: corrected [394311.162191] {6}[Hardware Error]: Error 0, type: corrected [394311.162196] {6}[Hardware Error]: fru_text: DIMM ?? [394311.162201] {6}[Hardware Error]: section_type: memory error [394311.162206] {6}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [394311.162212] {6}[Hardware Error]: node:0 [577801.806820] process '/usr/sbin/grub2-probe' started with executable stack [577801.858552] device-mapper: uevent: version 1.0.3 [577801.858881] device-mapper: ioctl: 4.47.0-ioctl (2022-07-28) initialised: dm-devel@redhat.com [577803.829449] fuse: init (API version 7.38) ``` but nothing on ipmitool
arocanon commented 2023-05-04 09:54:04 +02:00 (Migrated from pm.bsc.es)

However I need to restart, we will have to wait a little bit more :)

However I need to restart, we will have to wait a little bit more :)
arocanon commented 2023-09-12 10:52:03 +02:00 (Migrated from pm.bsc.es)

New record!

(ins)eudy$ grep . /sys/devices/system/edac/mc/*/ce_count
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_count:100
/sys/devices/system/edac/mc/mc2/ce_count:0
/sys/devices/system/edac/mc/mc3/ce_count:0
New record! ``` (ins)eudy$ grep . /sys/devices/system/edac/mc/*/ce_count /sys/devices/system/edac/mc/mc0/ce_count:0 /sys/devices/system/edac/mc/mc1/ce_count:100 /sys/devices/system/edac/mc/mc2/ce_count:0 /sys/devices/system/edac/mc/mc3/ce_count:0 ```
rarias commented 2023-09-12 13:03:27 +02:00 (Migrated from pm.bsc.es)
Intel took down the download page of selviewer: https://www.intel.com/content/www/us/en/404.html?ref=https://www.intel.com/content/www/us/en/download/19034/system-event-log-sel-viewer-utility-for-intel-server-boards-and-intel-server-systems.html ~~Thankfully I got a copy of the software: [selviewer_v14_1_build32_allos.zip](/uploads/f013fcf6ba6245218c06f794c24a1507/selviewer_v14_1_build32_allos.zip)~~ Correction, my copy was broken. It is available here https://drivers.softpedia.com/get/MOTHERBOARD/Intel/Intel-S2600WT2-Server-Board-SEL-Viewer-Utility-14-1-32.shtml
rarias commented 2023-09-12 14:09:26 +02:00 (Migrated from pm.bsc.es)

I think the linux reporting is unable to determine precisely module is bad:

1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0

fru_text: DIMM ??

So we should try the details reported by the BIOS, which should be accurate. With the ipmiutil tool I can dump the raw ECC event:

[nix-shell:~/jungle]$ ipmiutil sel -l5 -N xeon08-ipmi0
ipmiutil sel version 3.16
-- BMC version 1.43, IPMI version 2.0
SEL Ver 37 Support 07, Size = 3639 records (Used=860, Free=2779)
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
035c 08/23/23 17:15:24 MIN Bios Memory #02  Correctable ECC, DIMM(0) 6f [a0 00 00]
035b 08/08/23 14:47:23 INF EFI  System Event #83  OEM System Booted 6f [01 ff ff]
035a 08/08/23 14:46:20 INF BMC  Drive Slot #f3  Drive present 6f [00 ff ff]
0359 08/08/23 14:46:20 INF BMC  Drive Slot #f2  Drive present 6f [00 ff ff]
0358 08/08/23 14:46:20 INF BMC  Drive Slot #f1  Drive present 6f [00 ff ff]
ipmiutil sel, completed successfully

[nix-shell:~/jungle]$ ipmiutil sel -r -l5 -N xeon08-ipmi0
ipmiutil sel version 3.16
-- BMC version 1.43, IPMI version 2.0
SEL Ver 37 Support 07, Size = 3639 records (Used=860, Free=2779)
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
5c 03 02 8c 22 e6 64 33 00 04 0c 02 6f a0 00 00
5b 03 02 5b 39 d2 64 01 00 04 12 83 6f 01 ff ff
5a 03 02 1c 39 d2 64 20 00 04 0d f3 6f 00 ff ff
59 03 02 1c 39 d2 64 20 00 04 0d f2 6f 00 ff ff
58 03 02 1c 39 d2 64 20 00 04 0d f1 6f 00 ff ff
ipmiutil sel, completed successfully

Which could be decoded by the selviewer program from intel, but I was unable to do so.

However, in the https://pm.bsc.es/gitlab/rarias/jungle/-/blob/master/doc/SEL_TroubleshootingGuide.pdf file, in table 74 the details of the ECC code are explained, so we can decode it manually:

sel

Here is the raw code:

Position   1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16
Value      5c 03 02 8c 22 e6 64 33 00 04 0c 02 6f a0 00 00

First, we can see that the numbering starts at 1, as the memory code (0x0c) matches at position 11 and sensor number (0x02) at 12.

The important part is the location of the DIMM, bytes 15 and 16 which are all 0 (suspicious). Assuming the data is good, we can decode it as:

Rank number 0b00 =  0
Socket ID 0b00 = CPU 1 (left)
Channel 0b00 = "Channel A, B, C, D for CPU1"
DIMM 0b00 = "DIMM 1 on Channel"

So either is A1, B1, C1 or D1.

Now, I'm not sure if we can use the rank to further identify the slot in that group.

The BIOS has also a SEL decoder, so we can try to read the error from there via the serial port to verify it.

I also put a link to the reseat procedure recommended by Intel: https://www.intel.com/content/www/us/en/support/articles/000024007/server-products.html

They also recommend to enable the AMT memory test included in the BIOS fo further investigation. Here are more details: https://www.intel.com/content/dam/support/us/en/documents/server-products/intel-active-system-console-and-intel-multi-server-manger-replacement.pdf

  • A.If performance is not impacted, no further action is required. The error was corrected by the ECC
    mechanism.
    Recommended action–Increase the threshold at which the SEL records correctable errors. The BIOS
    defaults to <10>; recommendation is <500> when there is no performance impact
    F2 > Advanced > Memory Configuration > Memory RAS and Performance
    Configuration > Correctable Error Threshold <500>

  • B.If system performance has degraded, test the memory for potential issues
    Recommended action–Run the Advanced Memory Test. See Chapter 5 for details

I think the linux reporting is unable to determine precisely module is bad: > 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 > fru_text: DIMM ?? So we should try the details reported by the BIOS, which should be accurate. With the ipmiutil tool I can dump the raw ECC event: ``` [nix-shell:~/jungle]$ ipmiutil sel -l5 -N xeon08-ipmi0 ipmiutil sel version 3.16 -- BMC version 1.43, IPMI version 2.0 SEL Ver 37 Support 07, Size = 3639 records (Used=860, Free=2779) RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 035c 08/23/23 17:15:24 MIN Bios Memory #02 Correctable ECC, DIMM(0) 6f [a0 00 00] 035b 08/08/23 14:47:23 INF EFI System Event #83 OEM System Booted 6f [01 ff ff] 035a 08/08/23 14:46:20 INF BMC Drive Slot #f3 Drive present 6f [00 ff ff] 0359 08/08/23 14:46:20 INF BMC Drive Slot #f2 Drive present 6f [00 ff ff] 0358 08/08/23 14:46:20 INF BMC Drive Slot #f1 Drive present 6f [00 ff ff] ipmiutil sel, completed successfully [nix-shell:~/jungle]$ ipmiutil sel -r -l5 -N xeon08-ipmi0 ipmiutil sel version 3.16 -- BMC version 1.43, IPMI version 2.0 SEL Ver 37 Support 07, Size = 3639 records (Used=860, Free=2779) RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 5c 03 02 8c 22 e6 64 33 00 04 0c 02 6f a0 00 00 5b 03 02 5b 39 d2 64 01 00 04 12 83 6f 01 ff ff 5a 03 02 1c 39 d2 64 20 00 04 0d f3 6f 00 ff ff 59 03 02 1c 39 d2 64 20 00 04 0d f2 6f 00 ff ff 58 03 02 1c 39 d2 64 20 00 04 0d f1 6f 00 ff ff ipmiutil sel, completed successfully ``` Which could be decoded by the selviewer program from intel, but I was unable to do so. However, in the https://pm.bsc.es/gitlab/rarias/jungle/-/blob/master/doc/SEL_TroubleshootingGuide.pdf file, in table 74 the details of the ECC code are explained, so we can decode it manually: ![sel](/uploads/bbee1c4ce310e126aed021ac7e9fc5e5/sel.png) Here is the raw code: ``` Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Value 5c 03 02 8c 22 e6 64 33 00 04 0c 02 6f a0 00 00 ``` First, we can see that the numbering starts at 1, as the memory code (0x0c) matches at position 11 and sensor number (0x02) at 12. The important part is the location of the DIMM, bytes 15 and 16 which are all 0 (suspicious). Assuming the data is good, we can decode it as: ``` Rank number 0b00 = 0 Socket ID 0b00 = CPU 1 (left) Channel 0b00 = "Channel A, B, C, D for CPU1" DIMM 0b00 = "DIMM 1 on Channel" ``` So either is A1, B1, C1 or D1. Now, I'm not sure if we can use the rank to further identify the slot in that group. The BIOS has also a SEL decoder, so we can try to read the error from there via the serial port to verify it. I also put a link to the reseat procedure recommended by Intel: https://www.intel.com/content/www/us/en/support/articles/000024007/server-products.html They also recommend to enable the AMT memory test included in the BIOS fo further investigation. Here are more details: https://www.intel.com/content/dam/support/us/en/documents/server-products/intel-active-system-console-and-intel-multi-server-manger-replacement.pdf > ## 2.2 Recommended Action When a Correctable Error Is Logged > > - A.If performance is not impacted, no further action is required. The error was corrected by the ECC > mechanism. > Recommended action–Increase the threshold at which the SEL records correctable errors. The BIOS > defaults to <10>; recommendation is <500> when there is no performance impact > F2 > Advanced > Memory Configuration > Memory RAS and Performance > Configuration > Correctable Error Threshold <500> > > - B.If system performance has degraded, test the memory for potential issues > Recommended action–Run the Advanced Memory Test. See Chapter 5 for details
rarias commented 2023-09-12 14:11:13 +02:00 (Migrated from pm.bsc.es)

Digging into the ipmiutil source code, I can see that they implement the DIMM decoding logic, but is only available if they can determine the version of the BIOS, which can be only accessed from the node itself, not via IPMI.

So, running the command on the node reveals the DIMM directly:

eudy$ sudo /nix/store/1k8xg1y5lrpsfq90as3kgflsdj5x9vsl-ipmiutil-3.1.6/bin/ipmiutil sel -l5
ipmiutil sel version 3.16
-- BMC version 1.43, IPMI version 2.0
SEL Ver 37 Support 07, Size = 3639 records (Used=860, Free=2779)
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
035c 08/23/23 17:15:24 MIN Bios Memory #02  Correctable ECC, NODE 1/DIMM_A1 6f [a0 00 00]
035b 08/08/23 14:47:23 INF EFI  System Event #83  OEM System Booted 6f [01 ff ff]
035a 08/08/23 14:46:20 INF BMC  Drive Slot #f3  Drive present 6f [00 ff ff]
0359 08/08/23 14:46:20 INF BMC  Drive Slot #f2  Drive present 6f [00 ff ff]
0358 08/08/23 14:46:20 INF BMC  Drive Slot #f1  Drive present 6f [00 ff ff]
ipmiutil sel, completed successfully

The bad DIMM is A1.

Digging into the ipmiutil source code, I can see that they implement the DIMM decoding logic, but is only available if they can determine the version of the BIOS, which can be only accessed from the node itself, not via IPMI. So, running the command on the node reveals the DIMM directly: ``` eudy$ sudo /nix/store/1k8xg1y5lrpsfq90as3kgflsdj5x9vsl-ipmiutil-3.1.6/bin/ipmiutil sel -l5 ipmiutil sel version 3.16 -- BMC version 1.43, IPMI version 2.0 SEL Ver 37 Support 07, Size = 3639 records (Used=860, Free=2779) RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 035c 08/23/23 17:15:24 MIN Bios Memory #02 Correctable ECC, NODE 1/DIMM_A1 6f [a0 00 00] 035b 08/08/23 14:47:23 INF EFI System Event #83 OEM System Booted 6f [01 ff ff] 035a 08/08/23 14:46:20 INF BMC Drive Slot #f3 Drive present 6f [00 ff ff] 0359 08/08/23 14:46:20 INF BMC Drive Slot #f2 Drive present 6f [00 ff ff] 0358 08/08/23 14:46:20 INF BMC Drive Slot #f1 Drive present 6f [00 ff ff] ipmiutil sel, completed successfully ``` **The bad DIMM is A1.**
arocanon commented 2023-09-12 15:12:13 +02:00 (Migrated from pm.bsc.es)

Awesome! Let me know when you are available to reseat the module!

Awesome! Let me know when you are available to reseat the module!
rarias commented 2023-09-12 15:20:58 +02:00 (Migrated from pm.bsc.es)

I think I would not reseat it for now as 1) that wouldn't assure the bad module is A1 and 2) it may introduce other errors from other modules. Instead, I will replace it with one of the good DIMM modules from owl1 which is dissasembled. This way we can verify that no more ECC errors are occuring in eudy and be sure that the bad module is A1. If ECC errors continue to happen, the bad module was another one (unlikely).

I plan to replace it next Tuesday if is okay for you.

I will keep track of the bad module and reseat it in owl1, so errors will appear there once I reassemble it. If they become uncorrectable I will either remove it completely or order another one.

I think I would not reseat it for now as 1) that wouldn't assure the bad module is A1 and 2) it may introduce other errors from other modules. Instead, I will replace it with one of the good DIMM modules from owl1 which is dissasembled. This way we can verify that no more ECC errors are occuring in eudy and be sure that the bad module is A1. If ECC errors continue to happen, the bad module was another one (unlikely). I plan to replace it next Tuesday if is okay for you. I will keep track of the bad module and reseat it in owl1, so errors will appear there once I reassemble it. If they become uncorrectable I will either remove it completely or order another one.
arocanon commented 2023-09-12 15:24:30 +02:00 (Migrated from pm.bsc.es)

Perfect for me! I will come with you to lend a hand!

Perfect for me! I will come with you to lend a hand!
rarias commented 2023-09-13 16:20:59 +02:00 (Migrated from pm.bsc.es)

I removed these two DIMM modules from oss01, which can be used to replace the bad module:

A906A030-DDA8-452E-8C83-09ACC621F719

I removed these two DIMM modules from oss01, which can be used to replace the bad module: ![A906A030-DDA8-452E-8C83-09ACC621F719](/uploads/6e2d928d5cd3c187ded9d504557c38d9/A906A030-DDA8-452E-8C83-09ACC621F719.jpeg)
rarias commented 2023-09-19 17:30:16 +02:00 (Migrated from pm.bsc.es)

Replaced by the one on the bottom (they have the same numbers, but they differ in the square matrix code). Here is the bad ram from A1:

D6D11093-9DCE-43ED-A395-0DBA9B311B2A

D40AB944-1A65-48C9-9030-4196DBF78A4B

FE2D8528-1C46-42A5-8856-5767591F75C2

C520E133-8A5B-4939-B2FE-571B0D6FA8B0

Replaced by the one on the bottom (they have the same numbers, but they differ in the square matrix code). Here is the bad ram from A1: ![D6D11093-9DCE-43ED-A395-0DBA9B311B2A](/uploads/c83a1d73132473b249d9f745c88d4e6d/D6D11093-9DCE-43ED-A395-0DBA9B311B2A.jpeg) ![D40AB944-1A65-48C9-9030-4196DBF78A4B](/uploads/84bed47fdacf28e0052224e8ebdace26/D40AB944-1A65-48C9-9030-4196DBF78A4B.jpeg) ![FE2D8528-1C46-42A5-8856-5767591F75C2](/uploads/f702d969fffeff734b1d70a74fdafa26/FE2D8528-1C46-42A5-8856-5767591F75C2.jpeg) ![C520E133-8A5B-4939-B2FE-571B0D6FA8B0](/uploads/bfff125eac2e2efef1c787f5392bd4bb/C520E133-8A5B-4939-B2FE-571B0D6FA8B0.jpeg)
rarias commented 2023-09-26 17:14:38 +02:00 (Migrated from pm.bsc.es)

No more errors for a week:

hut% ssh eudy uptime
 17:14:00  up 7 days  1:10,  0 users,  load average: 0,00, 0,00, 0,00

hut% ssh eudy grep . '/sys/devices/system/edac/mc/*/ce_count'
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_count:0
/sys/devices/system/edac/mc/mc2/ce_count:0
/sys/devices/system/edac/mc/mc3/ce_count:0
No more errors for a week: ``` hut% ssh eudy uptime 17:14:00 up 7 days 1:10, 0 users, load average: 0,00, 0,00, 0,00 hut% ssh eudy grep . '/sys/devices/system/edac/mc/*/ce_count' /sys/devices/system/edac/mc/mc0/ce_count:0 /sys/devices/system/edac/mc/mc1/ce_count:0 /sys/devices/system/edac/mc/mc2/ce_count:0 /sys/devices/system/edac/mc/mc3/ce_count:0 ```
rarias commented 2023-09-26 17:22:20 +02:00 (Migrated from pm.bsc.es)

Closing for now. If they appear again, reopen the issue and we take a closer look.

Closing for now. If they appear again, reopen the issue and we take a closer look.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#8
No description provided.