Restart slurmd on failure #180

Manually merged
rarias merged 1 commits from restart-slurmd into master 2025-09-30 17:26:58 +02:00
Owner

A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restored. Instead,
try to restart the service every 30 seconds, forever:

owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137

Fixes: #177

CC @varcila

A failure to reach the control node can cause slurmd to fail and the unit remains in the failed state until is manually restored. Instead, try to restart the service every 30 seconds, forever: owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec=' Restart=on-failure RestartUSec=30s owl1% pgrep slurmd 5903 owl1% sudo kill -SEGV 5903 owl1% pgrep slurmd 6137 Fixes: https://jungle.bsc.es/git/rarias/jungle/issues/177 CC @varcila
rarias added 1 commit 2025-09-29 19:26:50 +02:00
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restored. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
rarias requested review from abonerib 2025-09-29 19:26:57 +02:00
abonerib reviewed 2025-09-29 19:44:22 +02:00
@ -15,0 +16,4 @@
# If slurmd fails to contact the control server it will fail, causing the
# node to remain out of service until manually restarted. Always try to
# restart it.
Restart = "on-failure";
Collaborator

Are there any situations where we want a clean exit to happen or could we do always?

Are there any situations where we want a clean exit to happen or could we do `always`?
rarias marked this conversation as resolved
rarias force-pushed restart-slurmd from 8e3634f062 to fdb148a0da 2025-09-30 15:18:53 +02:00 Compare
abonerib approved these changes 2025-09-30 15:22:49 +02:00
rarias force-pushed restart-slurmd from fdb148a0da to 79940876c3 2025-09-30 17:20:51 +02:00 Compare
rarias manually merged commit 79940876c3 into master 2025-09-30 17:26:58 +02:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#180
No description provided.