8e0345b866
Add acinca user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
7350983bb3
Restart slurmd on failure
...
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:
owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
Restart=on-failure
RestartUSec=30s
owl1% pgrep slurmd
5903
owl1% sudo kill -SEGV 5903
owl1% pgrep slurmd
6137
Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
7c34907f52
Lower connect timeout when using hut substituter
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
15098cb89f
Use hut substituter in all nodes
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
31a3ac4a4d
Remove machine access for user csiringo
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
8cbb9dd58a
Add web post update for 2025
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
dbeb973863
Mount apex /home via NFS in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
cb8f27071d
Remove extra SSH jump configuration
...
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
4021e98def
Add raccoon peer to wireguard
...
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
5afcdbba5e
Add raccoon host key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
71b24e1b28
Restrict fox peer to a single IP
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
407e93eff3
Use lowercase peer hostnames
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
36c5befc5b
Share a public folder for documents
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
59fd50fbee
Fix AMDuProfPcm so it finds libnuma.so
...
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Tested-by: Vincent Arcila <vincent.arcila@bsc.es >
2025-10-01 16:40:18 +02:00
113097f2fd
Add amd_hsmp module in fox for AMD uProf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
6d4ec8dedc
Add AMD uProf section to fox documentation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
9fcb7c4caa
Fix hidden dependencies for AMDuProfSys
...
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
723860aea0
Disable NMI watchdog in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
9ea0d5f7ac
Fix amd-uprof dependencies with patchelf
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
ceb93bdd6b
Fix hrtimer new interface
...
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.
See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
7f8a8c8ea3
Use CFLAGS_MODULE instead of EXTRA_CFLAGS
...
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
814e3b4f0b
Add AMD uProf module and enable it in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
db57b52db4
Add AMD uProf package and driver
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
d43d432a98
Add /nfs/home to fox documentation
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
f793d3efdc
Mount home via NFS from apex in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
76c36b1c41
Allow access to NFS via wireguard subnet
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
0de9a4c9c4
Use 10.106.0.0/24 subnet to avoid collisions
...
The 106 byte is the code for 'j' (jungle) in ASCII:
% printf j | od -t d
0000000 106
0000001
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
c1dea8fdf0
Update fox documentation for SLURM and FS
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
85996dd334
Revert "Remove pam_slurm_adopt from fox"
...
This reverts commit 64a52801ed .
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
35cac2964c
Enable fail2ban in fox
...
Protect fox against ssh bruteforce attacks:
fox% sudo lastb | head
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
2a1782c8ad
Accept connections from apex to fox slurmd
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
6bde1db500
Accept fox connection to slurm controller
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
cf0692795c
Add fox machine to SLURM
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
cbfaa4cb6f
Rekey secrets with trusted fox key
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
7e60c2c0dd
Trust fox for compute node secrets
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
d41aeba923
Make apex host specific to each machine
...
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
8de6b0c1c3
Add local host fox in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
710982378b
Enable wireguard in apex
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
725a826ea4
Add wireguard server in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
c8444b696a
Use writeShellScript for suspend.sh and resume.sh
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
b5d22b2d27
Add firewall rules to slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
8d0b759524
Remove hut from slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
6a99b096e8
Only configure apex as slurm server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
9daa2a2649
Split slurm configuration for client and server
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
86ec606121
Move slurm control server to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
b1a2effdd6
Fix typo in csiringo ssh key
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
96a85c107b
Enable nix-ld in weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
f19f343330
Add csiringo user with access to apex and weasel
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:18 +02:00
3e378ae523
Access gitlab via raccoon in fox
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:18 +02:00
5eb2e2516a
Move StartLimit* options to unit section
...
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:
> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.
When using [Unit], the limits are properly set:
apex% systemctl show power-policy.service | grep StartLimit
StartLimitIntervalUSec=10min
StartLimitBurst=10
StartLimitAction=none
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
0805684887
Set power policy to always turn on
...
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
a433a1686b
Add NixOS module to control power policy
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
9fa66aaa41
Move August shutdown to 3rd at 22h
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:18 +02:00
49ffca7492
Disable automatic August shutdown for Fox
...
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
eaeff6a564
Add cudainfo program to test CUDA
...
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
4db07cb9c3
Add missing symlink in cuda sandbox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
e2d8a17fad
Enable cuda systemFeature in raccoon and fox
...
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
a05ef0c3eb
Move shared nvidia settings to a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
ef2f2115de
Replace xeon07 by hut in ssh config
...
The xeon07 machine has been renamed to hut.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
deb3370cdc
Enable automatic Nix GC in raccoon
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
1d2dca5869
Select proprietary NVIDIA driver in raccoon
...
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:
> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
111c13d200
Enable open source NVidia driver in fox
...
It is recommended for newer versions.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
ef166563d8
Remove option allowUnfree from fox and raccoon
...
It is already set to true for all machines.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
9e5ce89600
Ban another scanner trying to connect via SSH
...
It is constantly spamming out logs:
apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
9c11076a43
Update weasel IPMI hostname for monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
a81aebc788
Remove merged MPICH patch
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
0a39169d17
Remove package ix as it is gone
...
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
957da4b1fd
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
→ 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
→ 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
→ 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405 ' (2025-06-11)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
→ 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
acd8fd6c51
Upgrade nixpkgs to nixos 25.05
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
99741382ab
Silently ban OpenVAS BSC scanner from apex
...
It is spamming our logs with refused connection lines:
apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
d487241db2
Rotate anavarro password and SSH key
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
24128e22f4
Add weasel machine configuration
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
a36b403bd4
Remove extra flush commands on firewall stop
...
They are not needed as they are already flushed when the firewall
starts or stops.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
c7a52e2999
Prevent accidental use of nftables
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
88d2de454b
Add proxy configuration for internal hosts
...
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.
This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
1899ad89db
Remove unused blackbox configuration modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
cd37f86b09
Use IPv4 in blackbox probes
...
Otherwise they simply fail as IPv6 doesn't work.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
d5a9086afa
Make NFS mount async to improve latency
...
Don't wait to flush writes, as we don't care about consistency on a
crash:
> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
7be793da3d
Disable root_squash from NFS
...
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
327c0d61a2
Remove SSH proxy to access BSC clusters
...
We now have direct connection to them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
f70074b216
Add users to apex machine
...
They need to be able to login to apex to access any other machine from
the SSF rack.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
fd99204913
Remove proxy from hut HTTP probes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
6826a372d5
Remove proxy configuration from environment
...
All machines have now direct connection with the outside world.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
3d02a231a5
Add storcli utility to apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
52bd019cdd
Add new configuration for apex
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
57b9450a59
Add pmartin1 user with access to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
daba3eca18
Add access to fox for rpenacob user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
c47c880ff2
Revert "Only allow Vincent to access fox for now"
...
This reverts commit efac36b186 .
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
5318fb1a0d
Add all terminfo files in environment
...
Fixes problems with the kitty terminal when opening vim or kakoune.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
ad834cebd6
Monitor Fox BMC with ICMP probes too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
43c20c544d
Restrict DAC VPN to fox-ipmi machine only
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
5928c68720
Monitor fox via VPN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
8eb14acdf9
Add OpenVPN service to connect to fox BMC
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
d83ad5decb
Add ac.upc.edu as name search server
...
Allows referring to fox.ac.upc.edu directly as fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
bc64d62a7f
Update access instructions
...
We no longer need to request a petition through BSC, as we will be in
charge of the login. Remove link to the old repository as well and
prefer only email.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
d07855e5c5
Disable kptr_restrict in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
98b882507b
Disable NUMA balancing in fox
...
See: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
e6f526bbc6
Load amd_uncore module in fox
...
Needed for L3 events in perf.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
df70515bc8
Enable SSH X11 forwarding
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
dfee33038b
Disable registration in Gitea
...
Get rid of all the spam accounts they are trying to register.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
6e316e57be
Enable msmtp configuration in tent
...
Allows gitea to send notifications via email.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
28e094d4c1
Add GitLab runner with debian docker for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
1210d96ae9
Monitor nix-daemon in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
d5227db996
Move nix-daemon exporter to modules
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
eeb8557b96
Add p service for pastes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
e94ef5a08d
Enable public-inbox service in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
e1359d134e
Enable gitea in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
2e94b6795b
Add bsc.es to resolve domain names
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
965dca422a
Monitor AXLE machine too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
03209b6bfc
Use IPv4 for blackbox exporter
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
a4328fe380
Add public html files to tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
085b92ce0f
Add docker GitLab runner for BSC GitLab
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
96d7f186d2
Add GitLab shell runner in tent for PM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
01891b9bef
Enable jungle robot emails for Grafana in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
71bbdd5922
Add tent key for nix-serve
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
c51ca035b7
Remove jungle nix cache from tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
8b2c9dcacd
Enable nix cache
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
017b57670b
Serve Grafana from subpath
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
de99ff3414
Add nginx server in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
5d25805c6a
Add monitoring in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
ef2f31510c
Disable nix garbage collector in tent
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
59961d1351
Rekey secrets with tent keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
3cb9563738
Add tent host key and admin keys
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
2006a5fb05
Create directories in /vault/home for tent users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
ddcf158758
Add software RAID in tent using 3 disks
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
df0ce98526
Add access to tent to all hut users too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
0022dfab63
Add hut SSH configuration from outside SSF LAN
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
5dce13c512
Don't use proxy in base preset
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
144d87008d
Add tent machine from xeon04
...
We moved the tent machine to the server room in the BSC building and is
now directly connected to the raccoon via NAT.
Fixes: #106
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
7a312bd01c
Create specific SSF rack configuration
...
Allow xeon machines to optionally inherit SSF configuration such as the
NFS mount point and the network configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
7905969874
Only allow Vincent to access fox for now
...
Needed to run benchmarks without interference.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
3797a8ecaf
Use performance governor in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
9c3274d068
Add hut as nix cache in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
ce4c653eb2
Use extra- for substituters and trusted-public-keys
...
From the nix manual:
> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
d76f38b502
Use DHCP for Ethernet in fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
d6b7421f3f
Use UPC time servers as others are blocked
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
2760355358
Create tracing group and add arocanon in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
a8c68a630f
Extend perf support in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
0b330e5274
Enable nixdebuginfod in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
3e080465a4
Make raccoon use performance governor
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
ac46401243
Enable binfmt emulation in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
54a30e063c
Disable nix garbage collector in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
384ef9e9df
Add dbautist user to raccoon machine
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
fe344ea31a
Add node exporter monitoring in raccoon
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
0271ba399f
Allow X11 forwarding via SSH
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
d53c7a3acb
Enable linger for user rarias
...
Allows services to run without a login session.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
a3e48bc83c
Only proxy SSH git remotes via hut in xeon
...
Other machines like raccoon have direct access.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
c2bfe806fe
Add machine map file
...
Documents the location, board and serial numbers so we can track the
machines if they move around. Some information is unkown.
Using the Nix language to encode the machines location and properties
allows us to later use that information in the configuration of the
machines themselves.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
6f4fe9bb22
Remove fox monitoring via IPMI
...
We will need to setup an VPN to be able to access fox in its new
location, so for now we simply remove the IPMI monitoring.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
dad5225486
Monitor fox, gateway and UPC anella via ICMP
...
Fox should reply once the machine is connected to the UPC network.
Monitoring also the gateway and UPC anella allows us to estimate if the
whole network is down or just fox.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
1b090785a0
Update configuration for UPC network
...
The fox machine will be placed in the UPC network, so we update the
configuration with the new IP and gateway. We won't be able to reach hut
directly so we also remove the host entry and proxy.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
dbedeb3613
Disable home via NFS in fox
...
It won't be accesible anymore as we won't be in the same LAN.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
8887b9e1f8
Rekey all secrets
...
Fox is no longer able to use munge or ceph, so we remove the key and
rekey them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
2af538e99b
Rotate fox SSH host key
...
Prevent decrypting old secrets by reading the git history.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
7f16280fd5
Distrust fox SSH key
...
We no longer will share secrets with fox until we can regain our trust.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
f7cad2381c
Remove Ceph module from fox
...
It will no longer be accesible from the UPC.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
40ef1d4886
Remove fox from SLURM
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
3e0674f872
Remove pam_slurm_adopt from fox
...
We no longer will be able to use SLURM from jungle.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
5dbc100738
Add UPC temperature sensor monitoring
...
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
371b8b6a23
Add meteocat exporter
...
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
102d8706dd
Add custom nix-daemon exporter
...
Allows us to see which derivations are being built in realtime. It is a
bit of a hack, but it seems to work. We simply look at the environment
of the child processes of nix-daemon (usually bash) and then look for
the $name variable which should hold the current derivation being
built. Needs root to be able to read the environ file of the different
nix-daemon processes as they are owned by the nixbld* users.
See: https://discourse.nixos.org/t/query-ongoing-builds/23486
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
edca5972ee
Set keep-outputs to true in all machines
...
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:
If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).
In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.
See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:17 +02:00
b7e1e4faa8
Add raccoon node exporter monitoring
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
f2cfc4a415
Increase data retention to 5 years
...
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:
# du -sh /var/lib/prometheus2
13G /var/lib/prometheus2
So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
41e83fc5ee
Don't forward any docker traffic
...
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
166508fc4f
Allow traffic from docker to enter port 23080
...
Before:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out
After:
hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
+ true
+ nc -w 3 -v 10.0.40.7 23080
10.0.40.7 (10.0.40.7:23080) open
Fixes: #94
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
b1795ba5be
Add bscpm04.bsc.es SSH host and public key
...
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
dcfd74f387
Add nix cache documentation section
...
Include usage from NixOS and non-NixOS hosts and a test with curl to
ensure it can be reached.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
03cdf10cbc
Use hut nix cache in owl1, owl2 and raccoon
...
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
5b14172646
Clean all iptables rules on stop
...
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
85d4e8ad5c
Make nginx listen on all interfaces
...
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
0ecf221730
Fix nginx /cache regex
...
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```
This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info ` but
`curl https://jungle.bsc.es/cachenix-cache-info ` worked.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:17 +02:00
61df5d4ddb
Add new GitLab runner for gitlab.bsc.es
...
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
fc75b32c5f
Remove SLURM partition all
...
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
97c1fb240d
Add varcila user to hut and fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
2938acc3e4
Adjust fox slurm config after disabling SMT
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
a886d6c943
Add abonerib user to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
5abdc0da89
Don't move doc in web output
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
9839206f4e
Add quickstart guide
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
a7aa3b79a1
Reject SSH connections without SLURM allocation
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
094801a362
Add users to fox
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
e8d65e70e9
Add dalvare1 user
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
4adfc0297f
Add fox page in jungle website
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
8b64f53dac
Mount NVME disks in /nvme{0,1}
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
6af214dfa3
Exclude fox from being suspended by slurm
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
226dba428e
Use IPMI host names instead of IP addresses
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
b18a9f99ef
Add fox IPMI monitoring
...
Use agenix to store the credentials safely.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:17 +02:00
c0f5db745b
Add new fox machine
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
7d84c9e088
Update PM GitLab tokens to new URL
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
7e5211d049
Fix MPICH build by fetching upstream patches too
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
c0531bbe8a
Fix papermod theme in website for new hugo
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
32367fbc07
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
→ 'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41' (2024-08-10)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f ' (2024-11-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
→ 'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc' (2025-01-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
ae2194debe
Set nixpkgs to track nixos-24.11
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
693a96878a
Add script to monitor GPFS
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
37ed60eb09
Add BSC machines to ssh config
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
23b58839de
Collect statistics from logged users
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
93546953aa
Add custom GPFS exporter for MN5
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
3ce5dd7c68
Remove exception to fetch task endpoint
...
It causes the request to go to the website rather than the Gitea
service.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
74c0ad07ad
Use SSD for boot, then switch to NVME
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
4d7c8378bf
Use NVME as root
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
2d487e9722
Keep host header for Grafana requests
...
This was breaking requests due to CSRF check.
See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
53e7ce6b64
Ignore logging requests from the gitea runner
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
0a3429ed8f
Log the client IP not the proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
3ebd3852f6
Ignore misc directory
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
8789f6d1fe
Create paste directories in /ceph/p
...
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
a5b512dd67
Add paste documentation in jungle website
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
b8db4ad3cd
Add p command to paste files
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
8a150555a6
Use nginx to serve website and other services
...
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
e76e10ec19
Mount the NVME disk in /nvme
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
c4e12872d9
Delay nix-gc until /home is mounted
...
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.
See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
7c381b2b65
Add dbautist user with access to hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
92482721b4
Set the serial console to ttyS1 in raccoon
...
Apparently the ttyS0 console doesn't exist but ttyS1 does:
raccoon% sudo stty -F /dev/ttyS0
stty: /dev/ttyS0: Input/output error
raccoon% sudo stty -F /dev/ttyS1
speed 9600 baud; line = 0;
-brkint -imaxbel
The dmesg line agrees:
00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
fd23d84da8
Remove setLdLibraryPath and driSupport options
...
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".
See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
291feda7ff
Add documentation section about GRUB chain loading
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
6f22f683a9
Add 10 min shutdown jitter to avoid spikes
...
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
240b26d82e
Don't mount the nix store in owl nodes
...
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
fd76a36c36
Emulate other architectures in owl nodes too
...
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
f1373e5227
Program shutdown for August 2nd for all machines
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
d8ca283b80
Enable debuginfod daemon in owl nodes
...
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
cdabc58c09
Set gitea and grafana log level to warn
...
Prevents filling the journal logs with information messages.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
c737993c9c
Set default SLURM job time limit to one hour
...
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
60c094030b
Allow other jobs to run in unused cores
...
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.
As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
8b8fc73225
Use authentication tokens for PM GitLab runner
...
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead. Older tokens and
runners are also removed, as they are no longer used.
With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.
See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
f095067deb
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
→ 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
→ 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
90d44a95eb
Allow ptrace to any process of the same user
...
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
4873c881a9
Add abonerib user to hut, raccon, owl1 and owl2
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
d6d7516e12
Grant rpenacob access to owl1 and owl2 nodes
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
ae96be6915
Access private repositories via hut SSH proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
bb10c47c2e
Set the default proxy to point to hut
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
87598e74ae
Allow incoming traffic to hut proxy
...
Reviewed-by: Aleix Boné <abonerib@bsc.es >
2025-10-01 16:40:16 +02:00
6a3728e6c6
eudy: koro: fcs: Fix fcs unprotected cpuid all
...
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:16 +02:00
7b4cbc57e4
Add support for armv7 emulation in hut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
80eb17c065
Monitor raccoon machine via IPMI
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
338551ec0a
Move vlopez user to jungleUsers for koro host
...
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
efea776c91
Add raccoon motd file
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
c8604bf3e0
Split xeon specific configuration from base
...
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
970cdf8dbd
Control user access to each machine
...
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
7312a91271
Add PostgreSQL DB for performance test results
...
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
b89c656ff7
Enable Grafana email alerts
...
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
27bb7cd69e
Enable mail notification in Gitea
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
e0e9dc62d5
Add msmtp to send notifications via email
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
bb86b04fce
Allow Ceph traffic to lake2
2025-10-01 16:40:16 +02:00
93ad04299d
Fix meta in posts entries
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
24056305c7
Fix bogus separator
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
b43aca80ca
Manually add links to the menu
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
39378d9544
Add link to Gitea in the website
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
0dec0ee519
Collect Gitea metrics in Prometheus
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
b15130744a
Add Gitea service
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
c4f539caf6
Add firewall rules for Ceph and monitoring
...
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
cafd7ea682
Add workaround for MPICH 4.2.0
...
See: https://github.com/pmodels/mpich/issues/6946
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
08e9db0c3e
Fix SLURM bug in rank integer sign expansion
...
See: https://bugs.schedmd.com/show_bug.cgi?id=19324
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
ab4cab97ba
Merge pmix outputs for MPICH
...
MPICH expects headers and libraries to be present in the same directory.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
a4bf90ddfc
Remove nixseparatedebuginfod input
...
It has been integrated in nixpkgs, so is no longer required.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
ebfbca4b48
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
→ 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
→ 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
→ 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709 ' (2024-04-24)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
→ 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
→ 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
6be7365916
Use google.com probe instead of bsc.es
...
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
8ae2870fb3
Add another HTTPS probe for bsc.es
...
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.
Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
0899424de9
Move slurm client in a separate module
...
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:16 +02:00
baa8347753
Enable public-inbox at jungle.bsc.es/lists
...
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.
The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.
We also edited the CSS to improve the readability and have larger fonts
by default.
The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
777704a9ce
Monitor https://pm.bsc.es/gitlab/ too
...
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/ .
Cc: Víctor López <victor.lopez@bsc.es >
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
024a31dd1b
Enable nixseparatedebuginfod module
...
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es >
2025-10-01 16:40:16 +02:00
b8fbb6380e
Use tmpfs in /tmp
...
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
afae708a48
Enable runners for pm.bsc.es/gitlab too
...
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
4efd74aad6
Remove complete ceph package from hut
...
Only the ceph-client is needed.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
29cdfba328
Fix warning in slurm exporter using vendorHash
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
59d9d19891
Remove old Ceph package overlay
...
The Ceph package is now integrated in upstream nixpkgs.
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
50cf081345
flake.lock: Update
...
Flake lock file updates:
• Updated input 'agenix':
'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
→ 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b ' (2023-11-22)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
→ 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
388a10b666
BSC packages are no longer in bsc attribute
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
e7555bb1cf
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
→ 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538 ' (2023-10-31)
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
0882ad7ecc
Switch bscpkgs URL to sourcehut
...
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
6ac5225ddb
Monitor anella instead of gw.bsc.es
...
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
cd6983223e
Add ICMP probes
...
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
fb8a0cb0a3
Enable proxy for Grafana too
...
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
a8c0ce5d06
Make blackbox exporter use the proxy
...
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es >
2025-10-01 16:40:16 +02:00
9957a0269d
Don't log SLURM connection attempts from ssfhead
2025-10-01 16:40:16 +02:00
ca0937859d
Add docker runner too
2025-10-01 16:40:16 +02:00
4d362351cb
Monitor gitlab.bsc.es too
2025-10-01 16:40:16 +02:00
e9b4d87d9f
Monitor PM webpage via blackbox
2025-10-01 16:40:16 +02:00
457e403258
Temporarily disable pm runners
2025-10-01 16:40:16 +02:00
32b9cc17a9
Add runner for gitlab.bsc.es
2025-10-01 16:40:16 +02:00
fbabc06641
Allow anonymous access to grafana
2025-10-01 16:40:16 +02:00
7b67b2b703
Remove user/group when using DynamicUsers
2025-10-01 16:40:16 +02:00
ce964b9b65
Set the SLURM_CONF variable
2025-10-01 16:40:16 +02:00
b84066fde5
Enable slurm-exporter service
2025-10-01 16:40:16 +02:00
38e23068f2
Add prometheus-slurm-exporter package
2025-10-01 16:40:16 +02:00
6b2351bfe7
Document the hut shared nix store for SLURM
2025-10-01 16:40:16 +02:00
b84d1d5e26
Mount the hut nix store for SLURM jobs
2025-10-01 16:40:16 +02:00
6d8fd353d0
Enable direnv integration
2025-10-01 16:40:16 +02:00
642507b255
Remove bscpkgs from the registry and nixPath
...
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2025-10-01 16:40:16 +02:00
128136c137
Add bscpkgs and nixpkgs top level attributes
...
Allows the evaluation of packages of the intermediate overlays.
2025-10-01 16:40:16 +02:00
1242aad9a3
Use hut packages as the default package set
...
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2025-10-01 16:40:16 +02:00
0ca0da9ffe
Don't fetch registry flakes from the net
2025-10-01 16:40:16 +02:00
03822c8b26
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80 ' (2023-09-14)
2025-10-01 16:40:16 +02:00
77276fb6c1
Revert "Update slurm to 23.02.05.1"
...
This reverts commit aaefddc44a .
2025-10-01 16:40:16 +02:00
1b296f2ce7
Open ports in firewall of compute nodes
2025-10-01 16:40:16 +02:00
798fa002cc
Update slurm to 23.02.05.1
2025-10-01 16:40:16 +02:00
44667e8e40
Monitor storage nodes via IPMI too
2025-10-01 16:40:16 +02:00
668a65b9c6
Specify the space available in /ceph
2025-10-01 16:40:16 +02:00
73f18d5801
Add update post to website
2025-10-01 16:40:16 +02:00
627c912b87
Enable fstrim service
2025-10-01 16:40:16 +02:00
66b5074ff1
Serve the nix store from hut
2025-10-01 16:40:16 +02:00
79446cebcb
Add encrypted munge key with agenix
2025-10-01 16:40:16 +02:00
061fc60939
Remove unused large port hole in firewall
2025-10-01 16:40:16 +02:00
09ac1d6c13
Make exporters listen in localhost only
2025-10-01 16:40:16 +02:00
a6324e47e8
Allow only some ports for srun
2025-10-01 16:40:16 +02:00
2f258e1cdd
Block ssfhead from reaching our slurm daemon
2025-10-01 16:40:16 +02:00
4c88f9a783
Poweroff idle slurm nodes after 1 hour
2025-10-01 16:40:16 +02:00
01140353c6
Add IB and IPMI node host names
2025-10-01 16:40:16 +02:00
c38a01c8dc
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906 ' (2023-09-07)
2025-10-01 16:40:16 +02:00
aa52236a80
Unlock ovni gitlab runners
2025-10-01 16:40:16 +02:00
3a31dcd58b
Update email contact to jungle mail list
2025-10-01 16:40:16 +02:00
d57849a954
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27 ' (2023-08-31)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f ' (2023-09-01)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
→ 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2025-10-01 16:40:16 +02:00
6850bf3a71
Add agenix to all nodes
2025-10-01 16:40:16 +02:00
aa92294907
Add agenix module to ceph
2025-10-01 16:40:16 +02:00
da92154d33
Remove old secrets
2025-10-01 16:40:16 +02:00
adec7f80fd
Mount /ceph in owl1 and owl2
2025-10-01 16:40:16 +02:00
8a0034a867
Warn about the owl2 omnipath device
2025-10-01 16:40:16 +02:00
6828273c05
Clean owl2 configuration
2025-10-01 16:40:16 +02:00
8cedffe040
Move the ceph client config to an external module
2025-10-01 16:40:16 +02:00
8a027d8b09
Reorganize secrets and ssh keys
...
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2025-10-01 16:40:16 +02:00
1f32b8409a
Add anavarro user
2025-10-01 16:40:16 +02:00
bc51564a88
Set zsh inc_append_history option
2025-10-01 16:40:16 +02:00
8ba4f910c3
Set zsh shell for rarias
2025-10-01 16:40:16 +02:00
515fa49ed0
Enable zsh and fix key bindings
2025-10-01 16:40:16 +02:00
c63fa494d5
Keep a log over time with the config commits
2025-10-01 16:40:16 +02:00
bf53f01f0a
Configure bscpkgs.nixpkgs to follow nixpkgs
2025-10-01 16:40:16 +02:00
a6d3f43b98
Store nixos config in /etc/nixos/config.rev
2025-10-01 16:40:16 +02:00
76e6ae2f00
Enable binary emulation for other architectures
2025-10-01 16:40:16 +02:00
409efacf5b
Enable watchdog
2025-10-01 16:40:16 +02:00
e1e879178d
Enable all osd on boot in lake2
2025-10-01 16:40:16 +02:00
042ca9e882
Scrape lake2 too
2025-10-01 16:40:16 +02:00
9241bda0ac
Also enable monitoring in lake2
2025-10-01 16:40:16 +02:00
005a1be48a
Scrape metrics from bay
2025-10-01 16:40:16 +02:00
f86114f33e
Add monitoring in the bay node
2025-10-01 16:40:16 +02:00
af29f639e2
Add fio tool
2025-10-01 16:40:16 +02:00
0fe025e8be
Add ceph tools in hut too
2025-10-01 16:40:16 +02:00
6a429fda1b
Switch ceph logs to journal
2025-10-01 16:40:16 +02:00
4cea250cf4
Update ceph to 18.2.0 in overlay
2025-10-01 16:40:16 +02:00
3b823ee478
Move pkgs overlay to overlay.nix
2025-10-01 16:40:16 +02:00
d9dea762de
Enable ceph osd daemons in lake2
2025-10-01 16:40:16 +02:00
80efd57a11
Add the lake2 hostname to the hosts
2025-10-01 16:40:16 +02:00
cced6b0dc0
Use the sda for lake2
2025-10-01 16:40:16 +02:00
b63b450111
Remove netboot module
2025-10-01 16:40:16 +02:00
81baeee5b1
Disable pixiecore in hut for now
2025-10-01 16:40:16 +02:00
686f750c06
Add PXE helper
2025-10-01 16:40:16 +02:00
33155fcb62
Enable netboot again for PXE
2025-10-01 16:40:16 +02:00
6e89b3f936
Specify the disk by path
2025-10-01 16:40:16 +02:00
f0f67f374e
Prepare lake2 config after bootstrap
...
The disk ID is different under NixOS.
2025-10-01 16:40:16 +02:00
7443a192c6
Add lake2 bootstrap config
2025-10-01 16:40:16 +02:00
25f06db5f1
Add section to enable serial console
2025-10-01 16:40:16 +02:00
3c83996e26
Add agenix to PATH in hut
2025-10-01 16:40:16 +02:00
a4fc3d131a
Store ceph secret key in age
...
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2025-10-01 16:40:16 +02:00
660a8ae163
Add rarias key for secrets
2025-10-01 16:40:16 +02:00
91270b26bb
Add ceph metrics to prometheus
2025-10-01 16:40:16 +02:00
94ce6fedf9
Mount the ceph filesystem in hut
2025-10-01 16:40:16 +02:00
817c98d37b
Add ceph config in bay
2025-10-01 16:40:16 +02:00
9cd013c4ed
Add the bay host name
2025-10-01 16:40:16 +02:00
f707650724
Remove netboot and fixes
2025-10-01 16:40:15 +02:00
9c152ec9cc
Add bay node
2025-10-01 16:40:15 +02:00
7e789cd062
Update flake
2025-10-01 16:40:15 +02:00
8fcb5a1079
Monitor power from other nodes via LAN
2025-10-01 16:40:15 +02:00
b80656228d
Increase prometheus retention time to one year
2025-10-01 16:40:15 +02:00
cd6e6de2ad
Don't set all_proxy
2025-10-01 16:40:15 +02:00
ca78313752
Update nixpkgs to fix docker problem
2025-10-01 16:40:15 +02:00
ae2007e2fe
Allow access to devices for node_exporter
2025-10-01 16:40:15 +02:00
d8e366b444
GRUB version no longer needed
2025-10-01 16:40:15 +02:00
a18d56b3ae
Upgrade flake: nixpkgs, bscpkgs and agenix
2025-10-01 16:40:15 +02:00
8c1bf6db42
Kill slurmd remaining processes on upgrade
2025-10-01 16:40:15 +02:00
17f9a957eb
Add details to request access in the web
2025-10-01 16:40:15 +02:00
a094093d95
koro: Add vlopez user
2025-10-01 16:40:15 +02:00
cbe53a6f0a
Add koro node
2025-10-01 16:40:15 +02:00
1b8c3eb554
eudy: Add fcsv3 and intermediate versions for testing
2025-10-01 16:40:15 +02:00
c0335c1f95
eudy: Enable memory overcommit
2025-10-01 16:40:15 +02:00
d5857c0f7d
eudy: disable all cpu mitigations
2025-10-01 16:40:15 +02:00
d88caa8610
Add jungle.bsc.es hugo website
2025-10-01 16:40:15 +02:00
9097811cc0
Enable NTP using the BSC time server
2025-10-01 16:40:15 +02:00
83acd40880
Add the ssfhead node as gateway
2025-10-01 16:40:15 +02:00
ba75bf8249
Use our host names first by default
2025-10-01 16:40:15 +02:00
e9845cc76a
Add DNS tools to resolve hosts
2025-10-01 16:40:15 +02:00
d5951483ee
Lower perf_event_paranoid to -1
2025-10-01 16:40:15 +02:00
937d8a7637
Set perf paranoid to 0 by default
2025-10-01 16:40:15 +02:00
798e01f9e6
Add perf to packages
2025-10-01 16:40:15 +02:00
2ca7e7383e
Allow srun to specify the cpu binding
...
The task/affinity plugin needs to be selected.
2025-10-01 16:40:15 +02:00
b610f12133
Move authorized keys to users.nix
2025-10-01 16:40:15 +02:00
b6aaeb8158
Add rpenacob user
2025-10-01 16:40:15 +02:00
3d0f86ac07
Add osumb to the system packages
2025-10-01 16:40:15 +02:00
221ccef956
flake.lock: Update
...
Flake lock file updates:
• Updated input 'bscpkgs':
'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c ' (2023-04-26)
→ 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd ' (2023-06-16)
2025-10-01 16:40:15 +02:00
9f03799d34
Set mpi to mpich by default in bscpkgs
2025-10-01 16:40:15 +02:00
3a9615fce4
Add missing parameter to extend
2025-10-01 16:40:15 +02:00
45d7b31c0a
Use explicit order in overlays
2025-10-01 16:40:15 +02:00
73b33c4d6c
Replace mpi inside bsc attribute
2025-10-01 16:40:15 +02:00
bae3c75222
Add mpich overlay
2025-10-01 16:40:15 +02:00
f51e910aff
Add coments in slurm config
2025-10-01 16:40:15 +02:00
8b7ffc914a
Add eudy host key to known hosts
2025-10-01 16:40:15 +02:00
afb2bea1c9
Rename xeon08 to eudy
...
From Eudyptula, a little penguin.
2025-10-01 16:40:15 +02:00
7bbc526671
Update rebuild script for all nodes
2025-10-01 16:40:15 +02:00
4afe3121e6
Add ssh host keys
2025-10-01 16:40:15 +02:00
39f15a1b4f
Set the name of the slurm cluster to jungle
2025-10-01 16:40:15 +02:00
3fab341dc8
Change owl hostnames
2025-10-01 16:40:15 +02:00
6ec7353a27
Add owl and all partition
2025-10-01 16:40:15 +02:00
d679fd6314
Simplify flake and expose host pkgs
...
The configuration of the machines is now moved to m/
2025-10-01 16:40:15 +02:00
218acd6848
Rename xeon07 to hut
2025-10-01 16:40:15 +02:00
68805b337d
Remove profiles older than 30 days with gc
2025-10-01 16:40:15 +02:00
eae7ed4e10
Add ncdu to system packages
2025-10-01 16:40:15 +02:00
eb9b5e570f
Move arocanon user from xeon08 to common
2025-10-01 16:40:15 +02:00
0c4b85be3b
xeon08: Add config for kernel non-voluntary preemption
2025-10-01 16:40:15 +02:00
49ccf0a3f3
xeon08: Add perf
2025-10-01 16:40:15 +02:00
459b29924c
xeon08: Enable lttng lockdep tracepoints
2025-10-01 16:40:15 +02:00
0d1a7e59ee
xeon08: Add lttng module and tools
2025-10-01 16:40:15 +02:00
b959c72979
Serve grafana in https://jungle.bsc.es/grafana
2025-10-01 16:40:15 +02:00
68f7c02555
Add tree command
2025-10-01 16:40:15 +02:00
26caa390cb
Add file to system packages
2025-10-01 16:40:15 +02:00
f0af9d87a0
Add gnumake to system packages
2025-10-01 16:40:15 +02:00
038340c5d2
Add cmake to system packages
2025-10-01 16:40:15 +02:00
471d44f013
Add ix to common packages
2025-10-01 16:40:15 +02:00
af03554610
Improve documentation
2025-10-01 16:40:15 +02:00
35898c68e7
Add gitignore
2025-10-01 16:40:15 +02:00
769317c5a7
Set intel_pstate=passive and disable frequency boost
2025-10-01 16:40:14 +02:00
a9f9a1e8e5
Add xeon08 basic config
2025-10-01 16:40:14 +02:00
0a8097231d
Add nixos-config.nix to easily enable nix repl
2025-10-01 16:40:14 +02:00
be90239bbc
Automatically resume restarted nodes in SLURM
2025-10-01 16:40:14 +02:00
f2fc2af77a
Allow public dashboards in grafana
2025-10-01 16:40:14 +02:00
7828d616fc
Add hal ssh key
2025-10-01 16:40:14 +02:00
61ea1af68b
Increase the number of CPUs to 56 for nOS-V docker
2025-10-01 16:40:14 +02:00
81243cbc95
Allow 5 concurrent buils in the gitlab-runner
2025-10-01 16:40:14 +02:00
847d516d6d
Simplify bash prompt
2025-10-01 16:40:14 +02:00
a6c060a25b
Roolback to bash as default shell
...
Zsh doesn't behave properly, it needs further configuration.
2025-10-01 16:40:14 +02:00
70d2255634
Use pmix by default in slurm
2025-10-01 16:40:14 +02:00
7b176a3780
Increase locked memory to 1 GiB
2025-10-01 16:40:14 +02:00
e7bfba54bc
Use the latest kernel
2025-10-01 16:40:14 +02:00
30a12729f1
Disable osnoise and hwlat tracer for now
...
Reuse nix cache to avoid rebuilding the kernel.
2025-10-01 16:40:14 +02:00
8851dbe212
Update nixpkgs to nixos-unstable
2025-10-01 16:40:14 +02:00
b7d52423c9
Update nixpkgs
2025-10-01 16:40:14 +02:00
32b84a5add
Update ib interface name in xeon02
...
It seems to be plugged in another PCI port
2025-10-01 16:40:14 +02:00
9267ca51fa
Add steps in install documentation
2025-10-01 16:40:14 +02:00
5c9f5e845f
Add minimal netboot module to build kexec image
2025-10-01 16:40:14 +02:00
2d204c5cc0
Add xeon02 configuration
2025-10-01 16:40:14 +02:00
99481e6203
Refacto slurm configuration into compute/control
2025-10-01 16:40:14 +02:00
8b4dda98af
Lock flakes and add inputs
2025-10-01 16:40:14 +02:00
0c296c1825
Test flakes
2025-10-01 16:40:14 +02:00
513c182d24
Enable slurm in xeon01
2025-10-01 16:40:14 +02:00
a2fe4b552a
Use xeon07 as control machine
2025-10-01 16:40:14 +02:00
1b030ab76d
Remove xeon07 overlay to load upstream slurm
2025-10-01 16:40:14 +02:00
f0abbac1c7
Add script to rebuild configuration
2025-10-01 16:40:14 +02:00
6f80ebb483
Add configuration for xeon01
2025-10-01 16:40:14 +02:00
ce00282704
Load overlays from /config
2025-10-01 16:40:14 +02:00
7829ba7509
Move net.nix to common
2025-10-01 16:40:14 +02:00
a476b1758b
Remove host specific network options from net.nix
2025-10-01 16:40:14 +02:00
51a77b5213
Move ssh.nix to common
2025-10-01 16:40:14 +02:00
020ce58efd
Move overlays.nix to common
2025-10-01 16:40:14 +02:00
ab978758ba
Move users.nix to common
2025-10-01 16:40:14 +02:00
774350f288
Move common options from configuration.nix
2025-10-01 16:40:14 +02:00
b739c96882
Move the remaining hw config to common
2025-10-01 16:40:14 +02:00
32d782be96
Move boot config to common/boot.nix
2025-10-01 16:40:14 +02:00
1432a26fba
Move filesystems config to common/fs.nix
2025-10-01 16:40:14 +02:00
f5fa915e09
Use partition labels for / and swap
2025-10-01 16:40:14 +02:00
d6a7a87207
Move fs.nix to common
2025-10-01 16:40:14 +02:00
13f72753a5
Move boot.nix to common
2025-10-01 16:40:14 +02:00
fa757aaaab
Move disk selection to configuration.nix
2025-10-01 16:40:14 +02:00
15906e5818
Add common directory
2025-10-01 16:40:14 +02:00