Currently the owl nodes are located on top of the rack and turning them
off causes a high temperature increase at that region, which accumulates
heat from the whole rack. To maximize airflow we will leave them on at
all times. This also makes allocations immediate at the extra cost of
around 200 W.
In the future, if we include more nodes in SLURM we can configure those
to turn off if needed.
Fixes: rarias/jungle#156
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Allows any user to be able to send mail from the robot account as long
as it is added to the mail-robot group.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
By default, salloc will open a new shell in the *current* node instead
of in the allocated node. This often causes users to leave the extra
shell running once the allocation ends. Repeating this process several
times causes chains of shells.
By running the shell in the remote node, once the allocation ends the
shell finishes as well.
Fixes: rarias/jungle#174
See: https://slurm.schedmd.com/faq.html#prompt
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Avoids adding an extra flake input only to fetch a single module and
package.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: rarias/jungle#177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es>
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>