BMC of lake2 has erratic behavior #43

Open
opened 2023-09-26 17:38:50 +02:00 by rarias · 1 comment
rarias commented 2023-09-26 17:38:50 +02:00 (Migrated from pm.bsc.es)

The fans in lake2 are constantly increasing and decreasing their speed and the node is quite warm. The BMC fails to reply to the IPMI commands via LAN to report statistics from time to time, leaving some wholes in the Grafana plots:

image

This seems to happen after around 2 minutes since the BMC starts to reply:

hut% ping lake2-ipmi
PING lake2-ipmi (10.0.40.143) 56(84) bytes of data.
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=1 ttl=64 time=0.451 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=2 ttl=64 time=0.410 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=3 ttl=64 time=6.41 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=4 ttl=64 time=0.351 ms
...
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=114 ttl=64 time=995 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=115 ttl=64 time=2015 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=116 ttl=64 time=966 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=117 ttl=64 time=1985 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=118 ttl=64 time=937 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=119 ttl=64 time=1956 ms
64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=120 ttl=64 time=908 ms
From hut (10.0.40.7) icmp_seq=129 Destination Host Unreachable
ping: sendmsg: No route to host
From hut (10.0.40.7) icmp_seq=130 Destination Host Unreachable
From hut (10.0.40.7) icmp_seq=131 Destination Host Unreachable
From hut (10.0.40.7) icmp_seq=133 Destination Host Unreachable
From hut (10.0.40.7) icmp_seq=134 Destination Host Unreachable
From hut (10.0.40.7) icmp_seq=135 Destination Host Unreachable

I tried doing a cold BMC reset, but it doesn't have any effect.

The fans in lake2 are constantly increasing and decreasing their speed and the node is quite warm. The BMC fails to reply to the IPMI commands via LAN to report statistics from time to time, leaving some wholes in the Grafana plots: ![image](/uploads/00521880854768520f68520170d4072c/image.png) This seems to happen after around 2 minutes since the BMC starts to reply: ``` hut% ping lake2-ipmi PING lake2-ipmi (10.0.40.143) 56(84) bytes of data. 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=1 ttl=64 time=0.451 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=2 ttl=64 time=0.410 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=3 ttl=64 time=6.41 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=4 ttl=64 time=0.351 ms ... 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=114 ttl=64 time=995 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=115 ttl=64 time=2015 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=116 ttl=64 time=966 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=117 ttl=64 time=1985 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=118 ttl=64 time=937 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=119 ttl=64 time=1956 ms 64 bytes from lake2-ipmi (10.0.40.143): icmp_seq=120 ttl=64 time=908 ms From hut (10.0.40.7) icmp_seq=129 Destination Host Unreachable ping: sendmsg: No route to host From hut (10.0.40.7) icmp_seq=130 Destination Host Unreachable From hut (10.0.40.7) icmp_seq=131 Destination Host Unreachable From hut (10.0.40.7) icmp_seq=133 Destination Host Unreachable From hut (10.0.40.7) icmp_seq=134 Destination Host Unreachable From hut (10.0.40.7) icmp_seq=135 Destination Host Unreachable ``` I tried doing a cold BMC reset, but it doesn't have any effect.

It is also affecting the IPMI watchdog:

Apr 30 13:07:37 lake2 kernel: ipmi_si dmi-ipmi-si.0: KCS in invalid state 6
Apr 30 13:07:37 lake2 kernel: IPMI Watchdog: response: Error d5 on cmd 22
Apr 30 13:07:37 lake2 systemd[1]: Failed to ping hardware watchdog, ignoring: Invalid argument
Apr 30 13:07:37 lake2 kernel: ipmi_si dmi-ipmi-si.0: KCS in invalid state 6
Apr 30 13:07:37 lake2 kernel: IPMI Watchdog: response: Error d5 on cmd 22
Apr 30 13:07:37 lake2 systemd[1]: Failed to ping hardware watchdog, ignoring: Invalid argument
It is also affecting the IPMI watchdog: ``` Apr 30 13:07:37 lake2 kernel: ipmi_si dmi-ipmi-si.0: KCS in invalid state 6 Apr 30 13:07:37 lake2 kernel: IPMI Watchdog: response: Error d5 on cmd 22 Apr 30 13:07:37 lake2 systemd[1]: Failed to ping hardware watchdog, ignoring: Invalid argument Apr 30 13:07:37 lake2 kernel: ipmi_si dmi-ipmi-si.0: KCS in invalid state 6 Apr 30 13:07:37 lake2 kernel: IPMI Watchdog: response: Error d5 on cmd 22 Apr 30 13:07:37 lake2 systemd[1]: Failed to ping hardware watchdog, ignoring: Invalid argument ```
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#43
No description provided.