d89e556ba9
Remove machine access for user csiringo
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
350e0ebecf
Mount apex /home via NFS in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
b3c8e96f6a
Remove extra SSH jump configuration
...
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
c069f36013
Add raccoon peer to wireguard
...
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
2c8b8354a3
Add raccoon host key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
f2c0dcd437
Restrict fox peer to a single IP
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
32d3d26abd
Use lowercase peer hostnames
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
f587b5cb69
Share a public folder for documents
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
a5ace9b663
Fix AMDuProfPcm so it finds libnuma.so
...
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Tested-by: Vincent Arcila <vincent.arcila@bsc.es >
2025-10-01 16:40:18 +02:00
eccefeac10
Add amd_hsmp module in fox for AMD uProf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
51f54a7821
Fix hidden dependencies for AMDuProfSys
...
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
fc8f7b7a6d
Disable NMI watchdog in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
6abfdee535
Fix amd-uprof dependencies with patchelf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
60acbeddcd
Fix hrtimer new interface
...
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
29a67a6822
Use CFLAGS_MODULE instead of EXTRA_CFLAGS
...
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
bb79a44202
Add AMD uProf module and enable it in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
5ce012ab8c
Add AMD uProf package and driver
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
d5b6199b82
Mount home via NFS from apex in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
e259b6cd2a
Allow access to NFS via wireguard subnet
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
6ad78795d6
Use 10.106.0.0/24 subnet to avoid collisions
...
The 106 byte is the code for 'j' (jungle) in ASCII:
% printf j | od -t d
0000000 106
0000001
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
333e24d80b
Revert "Remove pam_slurm_adopt from fox"
...
This reverts commit 64a52801ed .
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
8ebc51b33e
Enable fail2ban in fox
...
Protect fox against ssh bruteforce attacks:
fox% sudo lastb | head
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
11d389f5c3
Accept connections from apex to fox slurmd
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
064e57d53f
Accept fox connection to slurm controller
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
871abab0eb
Add fox machine to SLURM
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
6f10e0ca89
Rekey secrets with trusted fox key
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
0315616af8
Trust fox for compute node secrets
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
b0fb3c7be3
Make apex host specific to each machine
...
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
9912348c74
Add local host fox in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
e19694fae2
Enable wireguard in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
6c0010e100
Add wireguard server in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
1f74cf2482
Use writeShellScript for suspend.sh and resume.sh
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
5a51b28268
Add firewall rules to slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
46ef8f8aea
Remove hut from slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
f0f5712cd0
Only configure apex as slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
d5d065772f
Split slurm configuration for client and server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
076684b97f
Move slurm control server to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
3d99e9282a
Fix typo in csiringo ssh key
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
b043d21161
Enable nix-ld in weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
8c89093e40
Add csiringo user with access to apex and weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
06496a3b06
Access gitlab via raccoon in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
a1fd57b67d
Move StartLimit* options to unit section
...
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
c48e261b41
Set power policy to always turn on
...
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
3e0156d7a0
Add NixOS module to control power policy
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
ef28582bd8
Move August shutdown to 3rd at 22h
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
91fd1756e8
Disable automatic August shutdown for Fox
...
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
f7bf8e632d
Add cudainfo program to test CUDA
...
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
c57aaef2ce
Add missing symlink in cuda sandbox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
df9cec496a
Enable cuda systemFeature in raccoon and fox
...
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
348ebb5053
Move shared nvidia settings to a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00