dbeb973863
Mount apex /home via NFS in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
cb8f27071d
Remove extra SSH jump configuration
...
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
4021e98def
Add raccoon peer to wireguard
...
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
5afcdbba5e
Add raccoon host key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
71b24e1b28
Restrict fox peer to a single IP
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
407e93eff3
Use lowercase peer hostnames
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
36c5befc5b
Share a public folder for documents
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
59fd50fbee
Fix AMDuProfPcm so it finds libnuma.so
...
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Vincent Arcila <vincent.arcila@bsc.es>
2025-10-01 16:40:18 +02:00
113097f2fd
Add amd_hsmp module in fox for AMD uProf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
6d4ec8dedc
Add AMD uProf section to fox documentation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9fcb7c4caa
Fix hidden dependencies for AMDuProfSys
...
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
723860aea0
Disable NMI watchdog in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9ea0d5f7ac
Fix amd-uprof dependencies with patchelf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
ceb93bdd6b
Fix hrtimer new interface
...
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
7f8a8c8ea3
Use CFLAGS_MODULE instead of EXTRA_CFLAGS
...
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
814e3b4f0b
Add AMD uProf module and enable it in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
db57b52db4
Add AMD uProf package and driver
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
d43d432a98
Add /nfs/home to fox documentation
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
f793d3efdc
Mount home via NFS from apex in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
76c36b1c41
Allow access to NFS via wireguard subnet
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
0de9a4c9c4
Use 10.106.0.0/24 subnet to avoid collisions
...
The 106 byte is the code for 'j' (jungle) in ASCII:
% printf j | od -t d
0000000 106
0000001
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
c1dea8fdf0
Update fox documentation for SLURM and FS
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
85996dd334
Revert "Remove pam_slurm_adopt from fox"
...
This reverts commit 64a52801ed8d5c4a57650c2c434254a9986c1901.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
35cac2964c
Enable fail2ban in fox
...
Protect fox against ssh bruteforce attacks:
fox% sudo lastb | head
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
2a1782c8ad
Accept connections from apex to fox slurmd
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
6bde1db500
Accept fox connection to slurm controller
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
cf0692795c
Add fox machine to SLURM
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
cbfaa4cb6f
Rekey secrets with trusted fox key
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
7e60c2c0dd
Trust fox for compute node secrets
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
d41aeba923
Make apex host specific to each machine
...
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
8de6b0c1c3
Add local host fox in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
710982378b
Enable wireguard in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
725a826ea4
Add wireguard server in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
c8444b696a
Use writeShellScript for suspend.sh and resume.sh
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
b5d22b2d27
Add firewall rules to slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
8d0b759524
Remove hut from slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
6a99b096e8
Only configure apex as slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9daa2a2649
Split slurm configuration for client and server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
86ec606121
Move slurm control server to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
b1a2effdd6
Fix typo in csiringo ssh key
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
96a85c107b
Enable nix-ld in weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
f19f343330
Add csiringo user with access to apex and weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
3e378ae523
Access gitlab via raccoon in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
5eb2e2516a
Move StartLimit* options to unit section
...
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
0805684887
Set power policy to always turn on
...
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
a433a1686b
Add NixOS module to control power policy
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9fa66aaa41
Move August shutdown to 3rd at 22h
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
49ffca7492
Disable automatic August shutdown for Fox
...
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
eaeff6a564
Add cudainfo program to test CUDA
...
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
4db07cb9c3
Add missing symlink in cuda sandbox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00