A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: rarias/jungle#177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Vincent Arcila <vincent.arcila@bsc.es>
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Currently, we can use bscpkgs similarly to nixpkgs either through
the flake outputs or with import bscpkgs:
```nix
# currently supported:
bscpkgs.legacyPackages.x86_64-linux.hello
let pkgs = import bscpkgs { system = "x86_64-linux"; }; in pkgs.hello
```
The missing piece is nixpkgs.lib (not pkgs.lib, the system agnostic
one). The workaround is to do bscpkgs.inputs.nixpkgs.lib instead. We can
simplify this by forwarding the lib to our outputs.
This enables us to use bscpkgs as a drop-in
replacing the inputs to our flake from nixpkgs to bscpkgs.
(inputs.nixpkgs.url = "<*BSC*pkgs url>").
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Tested-by: Aleix Boné <abonerib@bsc.es>
The rdma-core driver.h include is no longer installed:
56dd87acd2
So ibv_read_sysfs_file() is not defined. As the symbols is still
distributed, we simply add the missing prototype manually.
Similarly, the gaspi_get_system_mem() function is not available from the
gaspi public headers, so we define it in the max_mem.c test.
Fixes: rarias/bscpkgs#7
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
To support the new instrumentation for HWC it would be useful to already
build nOS-V with PAPI support enabled. The enablePapi switch allows it
to be disabled with `nosv.override { enablePapi = false; }`.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Tested-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es>
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>