Commit Graph

458 Commits

Author SHA1 Message Date
ef914953d4 Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-29 19:17:33 +02:00
98abb3edf2 Lower connect timeout when using hut substituter
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 09:41:34 +02:00
0cbcdcbe38 Use hut substituter in all nodes
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-25 17:10:10 +02:00
fce7cb795c Remove machine access for user csiringo
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 17:30:02 +02:00
bf69d242d0 Mount apex /home via NFS in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:48:50 +02:00
e4c0f95906 Remove extra SSH jump configuration
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-25 15:15:43 +02:00
57f6f7bb10 Add raccoon peer to wireguard
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-25 15:01:33 +02:00
9c39ce006a Add raccoon host key
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:26:56 +02:00
405a7a7415 Restrict fox peer to a single IP
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:20:54 +02:00
04b094a627 Use lowercase peer hostnames
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 13:18:12 +02:00
f2c38f9316 Share a public folder for documents
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-17 13:08:48 +02:00
3d344a5a4d Fix AMDuProfPcm so it finds libnuma.so
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Vincent Arcila <vincent.arcila@bsc.es>
2025-09-18 13:15:44 +02:00
e50fb05df7 Add amd_hsmp module in fox for AMD uProf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-18 11:44:49 +02:00
66068bc412 Fix hidden dependencies for AMDuProfSys
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-16 15:57:04 +02:00
ff5db631f7 Disable NMI watchdog in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-16 15:53:28 +02:00
e8a3d6d647 Fix amd-uprof dependencies with patchelf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-05 13:01:11 +02:00
6c544f79c4 Fix hrtimer new interface
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.

See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-04 12:20:42 +02:00
3b7cf58aad Use CFLAGS_MODULE instead of EXTRA_CFLAGS
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-04 12:00:33 +02:00
87bae5b9df Add AMD uProf module and enable it in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 15:51:46 +02:00
6f958c14cd Add AMD uProf package and driver
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 14:55:43 +02:00
dcffeed542 Mount home via NFS from apex in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 13:24:06 +02:00
a22d0d4135 Allow access to NFS via wireguard subnet
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 13:16:27 +02:00
7d4ebd8495 Use 10.106.0.0/24 subnet to avoid collisions
The 106 byte is the code for 'j' (jungle) in ASCII:

	% printf j | od -t d
	0000000         106
	0000001

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 11:12:25 +02:00
3a917f75c7 Revert "Remove pam_slurm_adopt from fox"
This reverts commit 64a52801ed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-02 17:12:56 +02:00
7657b860a8 Enable fail2ban in fox
Protect fox against ssh bruteforce attacks:

fox% sudo lastb | head
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-01 11:25:29 +02:00
50ae3ab4f0 Accept connections from apex to fox slurmd
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:55:53 +02:00
02e2470c1a Accept fox connection to slurm controller
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:46:24 +02:00
3f67bc4a2e Add fox machine to SLURM
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:40:43 +02:00
71a23ec68b Rekey secrets with trusted fox key
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:39:28 +02:00
11f52da199 Trust fox for compute node secrets
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:35:51 +02:00
f1a98190b5 Make apex host specific to each machine
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:29:14 +02:00
2fbf3ee8b6 Add local host fox in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:11:19 +02:00
dd4ad901df Enable wireguard in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 13:52:05 +02:00
c9669408c5 Add wireguard server in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 13:38:47 +02:00
ddfb26be5a Use writeShellScript for suspend.sh and resume.sh
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:02:12 +02:00
1b21a398a8 Add firewall rules to slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:59:21 +02:00
4d16e794cd Remove hut from slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:43:12 +02:00
38a45f20b4 Only configure apex as slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:37:21 +02:00
0cc76fc98d Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:36:52 +02:00
70da186d15 Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 11:56:20 +02:00
d71831016e Fix typo in csiringo ssh key
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 17:21:23 +02:00
0fb3cec09c Enable nix-ld in weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-16 16:20:40 +02:00
5ccfc2411f Add csiringo user with access to apex and weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-08-27 12:42:08 +02:00
dbb7e1fe36 Access gitlab via raccoon in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-27 15:20:34 +02:00
d1f58a62f5 Move StartLimit* options to unit section
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:

> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.

When using [Unit], the limits are properly set:

  apex% systemctl show power-policy.service | grep StartLimit
  StartLimitIntervalUSec=10min
  StartLimitBurst=10
  StartLimitAction=none

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 12:21:05 +02:00
0642df0bbd Set power policy to always turn on
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 15:25:47 +02:00
3d7e8b8a07 Add NixOS module to control power policy
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 14:07:06 +02:00
2e429bf09e Move August shutdown to 3rd at 22h
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 13:42:57 +02:00
9e22760628 Disable automatic August shutdown for Fox
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 13:40:33 +02:00
8bb09dd061 Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-22 15:24:55 +02:00