Add watchdog service #31

Closed
opened 2023-08-29 16:52:46 +02:00 by rarias · 2 comments
rarias commented 2023-08-29 16:52:46 +02:00 (Migrated from pm.bsc.es)

All nodes must have a watchdog, as we can run into a kernel panic that crashes the node and becomes unresponsive.

All nodes must have a watchdog, as we can run into a kernel panic that crashes the node and becomes unresponsive.
rarias commented 2023-08-30 13:25:30 +02:00 (Migrated from pm.bsc.es)

Apparently, the watchdog is disabled by the BIOS:

[   11.461342] iTCO_wdt iTCO_wdt.1.auto: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS

Loading the ipmi_watchdog successfully enables the watchdog, but this relies on the BMC. And the BMC is not working properly in the lake2 machine that is having kernel panics (poetic):

[   66.933605] IPMI Watchdog: response: Error ff on cmd 22
[  121.981073] IPMI Watchdog: response: Error ff on cmd 24
[  177.007542] IPMI Watchdog: response: Error ff on cmd 22
[  192.263945] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[  207.701825] usb 3-9: device descriptor read/64, error -110
[  222.239953] usb 3-9: device descriptor read/64, error -110
[  222.461953] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[  234.583037] IPMI Watchdog: response: Error ff on cmd 24
[  237.688993] FS-Cache: Loaded
[  237.974586] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'
[  241.687465] igb 0000:03:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[  341.417610] systemd-journald[1056]: File /var/log/journal/31283075e95246c3b6b6b9d02832cb65/user-1880.journal corrupted or uncleanly shut down, renaming and replacing.
[  570.218220] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[  586.073148] usb 3-9: device descriptor read/64, error -110
[  600.099266] usb 3-9: device descriptor read/64, error -110
[  600.321273] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[  948.222586] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[  963.932527] usb 3-9: device descriptor read/64, error -110
[  977.958650] usb 3-9: device descriptor read/64, error -110
[  978.180655] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[ 1326.231422] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[ 1341.792319] usb 3-9: device descriptor read/64, error -110
[ 1356.330462] usb 3-9: device descriptor read/64, error -110
[ 1356.552462] usb 3-9: reset full-speed USB device number 2 using xhci_hcd
[ 1704.236035] usb 3-9: reset full-speed USB device number 2 using xhci_hcd`
Apparently, the watchdog is disabled by the BIOS: ``` [ 11.461342] iTCO_wdt iTCO_wdt.1.auto: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS ``` Loading the `ipmi_watchdog` successfully enables the watchdog, but this relies on the BMC. And the BMC is not working properly in the lake2 machine that is having kernel panics (poetic): ``` [ 66.933605] IPMI Watchdog: response: Error ff on cmd 22 [ 121.981073] IPMI Watchdog: response: Error ff on cmd 24 [ 177.007542] IPMI Watchdog: response: Error ff on cmd 22 [ 192.263945] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 207.701825] usb 3-9: device descriptor read/64, error -110 [ 222.239953] usb 3-9: device descriptor read/64, error -110 [ 222.461953] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 234.583037] IPMI Watchdog: response: Error ff on cmd 24 [ 237.688993] FS-Cache: Loaded [ 237.974586] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd' [ 241.687465] igb 0000:03:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [ 341.417610] systemd-journald[1056]: File /var/log/journal/31283075e95246c3b6b6b9d02832cb65/user-1880.journal corrupted or uncleanly shut down, renaming and replacing. [ 570.218220] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 586.073148] usb 3-9: device descriptor read/64, error -110 [ 600.099266] usb 3-9: device descriptor read/64, error -110 [ 600.321273] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 948.222586] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 963.932527] usb 3-9: device descriptor read/64, error -110 [ 977.958650] usb 3-9: device descriptor read/64, error -110 [ 978.180655] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 1326.231422] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 1341.792319] usb 3-9: device descriptor read/64, error -110 [ 1356.330462] usb 3-9: device descriptor read/64, error -110 [ 1356.552462] usb 3-9: reset full-speed USB device number 2 using xhci_hcd [ 1704.236035] usb 3-9: reset full-speed USB device number 2 using xhci_hcd` ```
rarias commented 2023-09-12 12:33:43 +02:00 (Migrated from pm.bsc.es)

mentioned in merge request !20

mentioned in merge request !20
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#31