In order to reduce the traffic of the secondary Ethernet device we need
to be able to directly use the physical device instead of the virtual
one. For now use the host mode and see later if we can revert it.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
It is already included in the base list of packages, which is now only
"perf" and doesn't depend on the kernel version.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
The option 'systemd.watchdog.runtimeTime' has been renamed to
'systemd.settings.Manager.RuntimeWatchdogSec'.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
We are seeing a lot of failed attempts from the same IPs:
apex% sudo journalctl -u sshd -b0 | grep 'Failed password' | wc -l
2441
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Currently the owl nodes are located on top of the rack and turning them
off causes a high temperature increase at that region, which accumulates
heat from the whole rack. To maximize airflow we will leave them on at
all times. This also makes allocations immediate at the extra cost of
around 200 W.
In the future, if we include more nodes in SLURM we can configure those
to turn off if needed.
Fixes: rarias/jungle#156
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Allows any user to be able to send mail from the robot account as long
as it is added to the mail-robot group.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
By default, salloc will open a new shell in the *current* node instead
of in the allocated node. This often causes users to leave the extra
shell running once the allocation ends. Repeating this process several
times causes chains of shells.
By running the shell in the remote node, once the allocation ends the
shell finishes as well.
Fixes: rarias/jungle#174
See: https://slurm.schedmd.com/faq.html#prompt
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Avoids adding an extra flake input only to fetch a single module and
package.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: rarias/jungle#177
Reviewed-by: Aleix Boné <abonerib@bsc.es>