Restart slurmd on failure #180
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "restart-slurmd"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restored. Instead,
try to restart the service every 30 seconds, forever:
Fixes: #177
CC @varcila
A failure to reach the control node can cause slurmd to fail and the unit remains in the failed state until is manually restored. Instead, try to restart the service every 30 seconds, forever: owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec=' Restart=on-failure RestartUSec=30s owl1% pgrep slurmd 5903 owl1% sudo kill -SEGV 5903 owl1% pgrep slurmd 6137 Fixes: #177@ -15,0 +16,4 @@# If slurmd fails to contact the control server it will fail, causing the# node to remain out of service until manually restarted. Always try to# restart it.Restart = "on-failure";Are there any situations where we want a clean exit to happen or could we do
always?8e3634f062tofdb148a0dafdb148a0dato79940876c3