A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: rarias/jungle#177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es>
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:
> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es>