Remove blobs #185

Closed
rarias wants to merge 459 commits from remove-blobs into old-master

459 Commits

Author SHA1 Message Date
8e0345b866 Add acinca user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
7350983bb3 Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
7c34907f52 Lower connect timeout when using hut substituter
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
15098cb89f Use hut substituter in all nodes
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
31a3ac4a4d Remove machine access for user csiringo
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
8cbb9dd58a Add web post update for 2025
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
dbeb973863 Mount apex /home via NFS in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
cb8f27071d Remove extra SSH jump configuration
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
4021e98def Add raccoon peer to wireguard
It routes traffic from fox, apex and the compute nodes so that we can
reach the git servers and tent.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
5afcdbba5e Add raccoon host key
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
71b24e1b28 Restrict fox peer to a single IP
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
407e93eff3 Use lowercase peer hostnames
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
36c5befc5b Share a public folder for documents
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
59fd50fbee Fix AMDuProfPcm so it finds libnuma.so
We change the search procedure so it detects NixOS from /etc/os-release
and uses "libnuma.so" when calling dlopen, instead of harcoding a full
path to /usr. The full patch of libnuma is stored in the runpath, so
dlopen can find it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Tested-by: Vincent Arcila <vincent.arcila@bsc.es>
2025-10-01 16:40:18 +02:00
113097f2fd Add amd_hsmp module in fox for AMD uProf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
6d4ec8dedc Add AMD uProf section to fox documentation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9fcb7c4caa Fix hidden dependencies for AMDuProfSys
It tries to dlopen libcrypt.so.1 and libstdc++.so.6, so we make sure
they are available by adding them to the runpath.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
723860aea0 Disable NMI watchdog in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9ea0d5f7ac Fix amd-uprof dependencies with patchelf
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
ceb93bdd6b Fix hrtimer new interface
The hrtimer_init() is now done via hrtimer_setup() with the callback
function as argument.

See: https://lwn.net/Articles/996598/
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
7f8a8c8ea3 Use CFLAGS_MODULE instead of EXTRA_CFLAGS
Fixes the build in Linux 6.15.6, as it was not able to find the include
files.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
814e3b4f0b Add AMD uProf module and enable it in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
db57b52db4 Add AMD uProf package and driver
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
d43d432a98 Add /nfs/home to fox documentation
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
f793d3efdc Mount home via NFS from apex in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
76c36b1c41 Allow access to NFS via wireguard subnet
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
0de9a4c9c4 Use 10.106.0.0/24 subnet to avoid collisions
The 106 byte is the code for 'j' (jungle) in ASCII:

	% printf j | od -t d
	0000000         106
	0000001

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
c1dea8fdf0 Update fox documentation for SLURM and FS
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
85996dd334 Revert "Remove pam_slurm_adopt from fox"
This reverts commit 64a52801ed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
35cac2964c Enable fail2ban in fox
Protect fox against ssh bruteforce attacks:

fox% sudo lastb | head
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:25 - 11:25  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)
root     ssh:notty    200.124.28.102   Mon Sep  1 11:24 - 11:24  (00:00)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
2a1782c8ad Accept connections from apex to fox slurmd
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
6bde1db500 Accept fox connection to slurm controller
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
cf0692795c Add fox machine to SLURM
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
cbfaa4cb6f Rekey secrets with trusted fox key
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
7e60c2c0dd Trust fox for compute node secrets
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
d41aeba923 Make apex host specific to each machine
Allows direct contact via the VPN when accessing from fox, but use
Internet when using the rest of the machines.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
8de6b0c1c3 Add local host fox in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
710982378b Enable wireguard in apex
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
725a826ea4 Add wireguard server in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
c8444b696a Use writeShellScript for suspend.sh and resume.sh
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
b5d22b2d27 Add firewall rules to slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
8d0b759524 Remove hut from slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
6a99b096e8 Only configure apex as slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9daa2a2649 Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
86ec606121 Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
b1a2effdd6 Fix typo in csiringo ssh key
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
96a85c107b Enable nix-ld in weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
f19f343330 Add csiringo user with access to apex and weasel
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:18 +02:00
3e378ae523 Access gitlab via raccoon in fox
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:18 +02:00
5eb2e2516a Move StartLimit* options to unit section
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:

> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.

When using [Unit], the limits are properly set:

  apex% systemctl show power-policy.service | grep StartLimit
  StartLimitIntervalUSec=10min
  StartLimitBurst=10
  StartLimitAction=none

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
0805684887 Set power policy to always turn on
In all machines, as soon as we recover the power, turn the machine back
on. We cannot rely on the previous state as we will shut them down
before the power is cut to prevent damage on the power supply
monitoring circuit.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
a433a1686b Add NixOS module to control power policy
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
9fa66aaa41 Move August shutdown to 3rd at 22h
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:18 +02:00
49ffca7492 Disable automatic August shutdown for Fox
The UPC has different dates for the yearly power cut, and Fox can
recover properly from a power loss, so we don't need to have it turned
off before the power cut. Simply disabling the timer is enough.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
eaeff6a564 Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
4db07cb9c3 Add missing symlink in cuda sandbox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
e2d8a17fad Enable cuda systemFeature in raccoon and fox
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
a05ef0c3eb Move shared nvidia settings to a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
ef2f2115de Replace xeon07 by hut in ssh config
The xeon07 machine has been renamed to hut.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
deb3370cdc Enable automatic Nix GC in raccoon
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
1d2dca5869 Select proprietary NVIDIA driver in raccoon
The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS
suggests using the proprietary driver for older than Turing:

> It is suggested to use the open source kernel modules on Turing or
> later GPUs (RTX series, GTX 16xx), and the closed source modules
> otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
111c13d200 Enable open source NVidia driver in fox
It is recommended for newer versions.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
ef166563d8 Remove option allowUnfree from fox and raccoon
It is already set to true for all machines.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
9e5ce89600 Ban another scanner trying to connect via SSH
It is constantly spamming out logs:

apex# journalctl | grep 'Connection closed by 84.88.52.176' | wc -l
2255

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
9c11076a43 Update weasel IPMI hostname for monitoring
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
a81aebc788 Remove merged MPICH patch
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
0a39169d17 Remove package ix as it is gone
Fails with: "error: ix has been removed from Nixpkgs, as the ix.io
pastebin has been offline since Dec. 2023".

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
957da4b1fd flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41?narHash=sha256-b%2Buqzj%2BWa6xgMS9aNbX4I%2BsXeb5biPDi39VgvSFqFvU%3D' (2024-08-10)
  → 'github:ryantm/agenix/531beac616433bac6f9e2a19feb8e99a22a66baf?narHash=sha256-9P1FziAwl5%2B3edkfFcr5HeGtQUtrSdk/MksX39GieoA%3D' (2025-06-17)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d?narHash=sha256-gzGLZSiOhf155FW7262kdHo2YDeugp3VuIFb4/GGng0%3D' (2023-11-24)
  → 'github:lnl7/nix-darwin/43975d782b418ebf4969e9ccba82466728c2851b?narHash=sha256-dyN%2BteG9G82G%2Bm%2BPX/aSAagkC%2BvUv0SgUw3XkPhQodQ%3D' (2025-04-12)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1?narHash=sha256-7ulcXOk63TIT2lVDSExj7XzFx09LpdSAPtvgtM7yQPE%3D' (2023-12-20)
  → 'github:nix-community/home-manager/abfad3d2958c9e6300a883bd443512c55dfeb1be?narHash=sha256-YZCh2o9Ua1n9uCvrvi5pRxtuVNml8X2a03qIFfRKpFs%3D' (2025-04-24)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f' (2024-11-29)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=9d1944c658929b6f98b3f3803fead4d1b91c4405' (2025-06-11)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc?narHash=sha256-i/UJ5I7HoqmFMwZEH6vAvBxOrjjOJNU739lnZnhUln8%3D' (2025-01-14)
  → 'github:NixOS/nixpkgs/dfcd5b901dbab46c9c6e80b265648481aafb01f8?narHash=sha256-Kt1UIPi7kZqkSc5HVj6UY5YLHHEzPBkgpNUByuyxtlw%3D' (2025-07-13)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
acd8fd6c51 Upgrade nixpkgs to nixos 25.05
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
99741382ab Silently ban OpenVAS BSC scanner from apex
It is spamming our logs with refused connection lines:

apex% sudo journalctl -b0 | grep 'refused connection.*SRC=192.168.8.16' | wc -l
13945

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d487241db2 Rotate anavarro password and SSH key
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
24128e22f4 Add weasel machine configuration
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
a36b403bd4 Remove extra flush commands on firewall stop
They are not needed as they are already flushed when the firewall
starts or stops.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
c7a52e2999 Prevent accidental use of nftables
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
88d2de454b Add proxy configuration for internal hosts
Access internal hosts via apex proxy. From the compute nodes we first
open an SSH connection to apex, and then tunnel it through the HTTP
proxy with netcat.

This way we allow reaching internal GitLab repositories without
requiring the user to have credentials in the remote host, while we can
use multiple remotes to provide redundancy.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
1899ad89db Remove unused blackbox configuration modules
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
cd37f86b09 Use IPv4 in blackbox probes
Otherwise they simply fail as IPv6 doesn't work.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d5a9086afa Make NFS mount async to improve latency
Don't wait to flush writes, as we don't care about consistency on a
crash:

> This option allows the NFS server to violate the NFS protocol and
> reply to requests before any changes made by that request have been
> committed to stable storage (e.g. disc drive).
>
> Using this option usually improves performance, but at the cost that
> an unclean server restart (i.e. a crash) can cause data to be lost or
> corrupted.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
7be793da3d Disable root_squash from NFS
Allows root to read files in the NFS export, so we can directly run
`nixos-rebuild switch` from /home.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
327c0d61a2 Remove SSH proxy to access BSC clusters
We now have direct connection to them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
f70074b216 Add users to apex machine
They need to be able to login to apex to access any other machine from
the SSF rack.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
fd99204913 Remove proxy from hut HTTP probes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
6826a372d5 Remove proxy configuration from environment
All machines have now direct connection with the outside world.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
3d02a231a5 Add storcli utility to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
52bd019cdd Add new configuration for apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
57b9450a59 Add pmartin1 user with access to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
daba3eca18 Add access to fox for rpenacob user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
c47c880ff2 Revert "Only allow Vincent to access fox for now"
This reverts commit efac36b186.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
5318fb1a0d Add all terminfo files in environment
Fixes problems with the kitty terminal when opening vim or kakoune.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
ad834cebd6 Monitor Fox BMC with ICMP probes too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
43c20c544d Restrict DAC VPN to fox-ipmi machine only
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
5928c68720 Monitor fox via VPN
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
8eb14acdf9 Add OpenVPN service to connect to fox BMC
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d83ad5decb Add ac.upc.edu as name search server
Allows referring to fox.ac.upc.edu directly as fox.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
bc64d62a7f Update access instructions
We no longer need to request a petition through BSC, as we will be in
charge of the login. Remove link to the old repository as well and
prefer only email.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d07855e5c5 Disable kptr_restrict in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
98b882507b Disable NUMA balancing in fox
See: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#numa-balancing

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
e6f526bbc6 Load amd_uncore module in fox
Needed for L3 events in perf.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
df70515bc8 Enable SSH X11 forwarding
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
dfee33038b Disable registration in Gitea
Get rid of all the spam accounts they are trying to register.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
6e316e57be Enable msmtp configuration in tent
Allows gitea to send notifications via email.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
28e094d4c1 Add GitLab runner with debian docker for PM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
1210d96ae9 Monitor nix-daemon in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
d5227db996 Move nix-daemon exporter to modules
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
eeb8557b96 Add p service for pastes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
e94ef5a08d Enable public-inbox service in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
e1359d134e Enable gitea in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
2e94b6795b Add bsc.es to resolve domain names
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
965dca422a Monitor AXLE machine too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
03209b6bfc Use IPv4 for blackbox exporter
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
a4328fe380 Add public html files to tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
085b92ce0f Add docker GitLab runner for BSC GitLab
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
96d7f186d2 Add GitLab shell runner in tent for PM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
01891b9bef Enable jungle robot emails for Grafana in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
71bbdd5922 Add tent key for nix-serve
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
c51ca035b7 Remove jungle nix cache from tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
8b2c9dcacd Enable nix cache
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
017b57670b Serve Grafana from subpath
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
de99ff3414 Add nginx server in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
5d25805c6a Add monitoring in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
ef2f31510c Disable nix garbage collector in tent
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
59961d1351 Rekey secrets with tent keys
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
3cb9563738 Add tent host key and admin keys
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
2006a5fb05 Create directories in /vault/home for tent users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
ddcf158758 Add software RAID in tent using 3 disks
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
df0ce98526 Add access to tent to all hut users too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
0022dfab63 Add hut SSH configuration from outside SSF LAN
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
5dce13c512 Don't use proxy in base preset
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
144d87008d Add tent machine from xeon04
We moved the tent machine to the server room in the BSC building and is
now directly connected to the raccoon via NAT.

Fixes: #106
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
7a312bd01c Create specific SSF rack configuration
Allow xeon machines to optionally inherit SSF configuration such as the
NFS mount point and the network configuration.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
7905969874 Only allow Vincent to access fox for now
Needed to run benchmarks without interference.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
3797a8ecaf Use performance governor in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
9c3274d068 Add hut as nix cache in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
ce4c653eb2 Use extra- for substituters and trusted-public-keys
From the nix manual:

> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
d76f38b502 Use DHCP for Ethernet in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
d6b7421f3f Use UPC time servers as others are blocked
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
2760355358 Create tracing group and add arocanon in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
a8c68a630f Extend perf support in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
0b330e5274 Enable nixdebuginfod in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
3e080465a4 Make raccoon use performance governor
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
ac46401243 Enable binfmt emulation in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
54a30e063c Disable nix garbage collector in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
384ef9e9df Add dbautist user to raccoon machine
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
fe344ea31a Add node exporter monitoring in raccoon
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
0271ba399f Allow X11 forwarding via SSH
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
d53c7a3acb Enable linger for user rarias
Allows services to run without a login session.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
a3e48bc83c Only proxy SSH git remotes via hut in xeon
Other machines like raccoon have direct access.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
c2bfe806fe Add machine map file
Documents the location, board and serial numbers so we can track the
machines if they move around. Some information is unkown.

Using the Nix language to encode the machines location and properties
allows us to later use that information in the configuration of the
machines themselves.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
6f4fe9bb22 Remove fox monitoring via IPMI
We will need to setup an VPN to be able to access fox in its new
location, so for now we simply remove the IPMI monitoring.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
dad5225486 Monitor fox, gateway and UPC anella via ICMP
Fox should reply once the machine is connected to the UPC network.
Monitoring also the gateway and UPC anella allows us to estimate if the
whole network is down or just fox.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
1b090785a0 Update configuration for UPC network
The fox machine will be placed in the UPC network, so we update the
configuration with the new IP and gateway. We won't be able to reach hut
directly so we also remove the host entry and proxy.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
dbedeb3613 Disable home via NFS in fox
It won't be accesible anymore as we won't be in the same LAN.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
8887b9e1f8 Rekey all secrets
Fox is no longer able to use munge or ceph, so we remove the key and
rekey them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
2af538e99b Rotate fox SSH host key
Prevent decrypting old secrets by reading the git history.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
7f16280fd5 Distrust fox SSH key
We no longer will share secrets with fox until we can regain our trust.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
f7cad2381c Remove Ceph module from fox
It will no longer be accesible from the UPC.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
40ef1d4886 Remove fox from SLURM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
3e0674f872 Remove pam_slurm_adopt from fox
We no longer will be able to use SLURM from jungle.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
5dbc100738 Add UPC temperature sensor monitoring
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
371b8b6a23 Add meteocat exporter
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
102d8706dd Add custom nix-daemon exporter
Allows us to see which derivations are being built in realtime. It is a
bit of a hack, but it seems to work. We simply look at the environment
of the child processes of nix-daemon (usually bash) and then look for
the $name variable which should hold the current derivation being
built. Needs root to be able to read the environ file of the different
nix-daemon processes as they are owned by the nixbld* users.

See: https://discourse.nixos.org/t/query-ongoing-builds/23486
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
edca5972ee Set keep-outputs to true in all machines
From the documentation of keep-outputs, setting it to true would prevent
the GC from removing build time dependencies:

If true, the garbage collector will keep the outputs of non-garbage
derivations. If false (default), outputs will be deleted unless they are
GC roots themselves (or reachable from other roots).

In general, outputs must be registered as roots separately. However,
even if the output of a derivation is registered as a root, the
collector will still delete store paths that are used only at build time
(e.g., the C compiler, or source tarballs downloaded from the network).
To prevent it from doing so, set this option to true.

See: https://nix.dev/manual/nix/2.24/command-ref/conf-file.html#conf-keep-outputs
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:17 +02:00
b7e1e4faa8 Add raccoon node exporter monitoring
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
f2cfc4a415 Increase data retention to 5 years
Now that we have more space, we can extend the retention time to 5 years
to hold the monitoring metrics. For a year we have:

	# du -sh /var/lib/prometheus2
	13G     /var/lib/prometheus2

So we can expect it to increase to about 65 GiB. In the future we may
want to reduce some adquisition frequency.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
41e83fc5ee Don't forward any docker traffic
Access to the 23080 local port will be done by applying the INPUT rules,
which pass through nixos-fw.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
166508fc4f Allow traffic from docker to enter port 23080
Before:

  hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
  + true
  + nc -w 3 -v 10.0.40.7 23080
  nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out

After:

  hut% sudo docker run -it --rm alpine /bin/ash -xc 'true | nc -w 3 -v 10.0.40.7 23080'
  + true
  + nc -w 3 -v 10.0.40.7 23080
  10.0.40.7 (10.0.40.7:23080) open

Fixes: #94
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
b1795ba5be Add bscpm04.bsc.es SSH host and public key
Allows fetching repositories from hut and other machines in jungle
without the need to do any extra configuration.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
dcfd74f387 Add nix cache documentation section
Include usage from NixOS and non-NixOS hosts and a test with curl to
ensure it can be reached.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
03cdf10cbc Use hut nix cache in owl1, owl2 and raccoon
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
5b14172646 Clean all iptables rules on stop
Prevents the "iptables: Chain already exists." error by making sure that
we don't leave any chain on start. The ideal solution is to use
iptables-restore instead, which will do the right job. But this needs to
be changed in NixOS entirely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
85d4e8ad5c Make nginx listen on all interfaces
Needed for local hosts to contact the nix cache via HTTP directly.
We also allow the incoming traffic on port 80.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
0ecf221730 Fix nginx /cache regex
`nix-serve` does not handle duplicates in the path:
```
hut$ curl http://127.0.0.1:5000/nix-cache-info
StoreDir: /nix/store
WantMassQuery: 1
Priority: 30
hut$ curl http://127.0.0.1:5000//nix-cache-info
File not found.
```

This meant that the cache was not accessible via:
`curl https://jungle.bsc.es/cache/nix-cache-info` but
`curl https://jungle.bsc.es/cachenix-cache-info` worked.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:17 +02:00
61df5d4ddb Add new GitLab runner for gitlab.bsc.es
It uses docker based on alpine and the host nix store, so we can perform
builds but isolate them from the system.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
fc75b32c5f Remove SLURM partition all
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
97c1fb240d Add varcila user to hut and fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
2938acc3e4 Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
a886d6c943 Add abonerib user to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
5abdc0da89 Don't move doc in web output
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
9839206f4e Add quickstart guide
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
a7aa3b79a1 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
094801a362 Add users to fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
e8d65e70e9 Add dalvare1 user
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
4adfc0297f Add fox page in jungle website
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
8b64f53dac Mount NVME disks in /nvme{0,1}
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
6af214dfa3 Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
226dba428e Use IPMI host names instead of IP addresses
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
b18a9f99ef Add fox IPMI monitoring
Use agenix to store the credentials safely.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:17 +02:00
c0f5db745b Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
7d84c9e088 Update PM GitLab tokens to new URL
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
7e5211d049 Fix MPICH build by fetching upstream patches too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
c0531bbe8a Fix papermod theme in website for new hugo
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
32367fbc07 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
  → 'github:ryantm/agenix/f6291c5935fdc4e0bef208cfc0dcab7e3f7a1c41' (2024-08-10)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709' (2024-04-24)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=6782fc6c5b5a29e84a7f2c2d1064f4bcb1288c0f' (2024-11-29)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)
  → 'github:NixOS/nixpkgs/9c6b49aeac36e2ed73a8c472f1546f6d9cf1addc' (2025-01-14)

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
ae2194debe Set nixpkgs to track nixos-24.11
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
693a96878a Add script to monitor GPFS
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
37ed60eb09 Add BSC machines to ssh config
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
23b58839de Collect statistics from logged users
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
93546953aa Add custom GPFS exporter for MN5
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
3ce5dd7c68 Remove exception to fetch task endpoint
It causes the request to go to the website rather than the Gitea
service.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
74c0ad07ad Use SSD for boot, then switch to NVME
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
4d7c8378bf Use NVME as root
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
2d487e9722 Keep host header for Grafana requests
This was breaking requests due to CSRF check.

See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
53e7ce6b64 Ignore logging requests from the gitea runner
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
0a3429ed8f Log the client IP not the proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
3ebd3852f6 Ignore misc directory
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
8789f6d1fe Create paste directories in /ceph/p
Ensure that all hut users have a paste directory in /ceph/p owned by
themselves. We need to wait for the ceph mount point to create them, so
we use a systemd service that waits for the remote-fs.target.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
a5b512dd67 Add paste documentation in jungle website
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
b8db4ad3cd Add p command to paste files
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
8a150555a6 Use nginx to serve website and other services
Instead of using multiple tunels to forward all our services to the VM
that serves jungle.bsc.es, just use nginx to redirect the traffic from
hut. This allows adding custom rules for paths that are not posible
otherwise.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
e76e10ec19 Mount the NVME disk in /nvme
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
c4e12872d9 Delay nix-gc until /home is mounted
Prevents starting the garbage collector before the remote FS are
mounted, in particular /home. Otherwise, all the gcroots which have
symlinks in /home will be considered stale and they will be removed.

See: #79
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
7c381b2b65 Add dbautist user with access to hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
92482721b4 Set the serial console to ttyS1 in raccoon
Apparently the ttyS0 console doesn't exist but ttyS1 does:

  raccoon% sudo stty -F /dev/ttyS0
  stty: /dev/ttyS0: Input/output error
  raccoon% sudo stty -F /dev/ttyS1
  speed 9600 baud; line = 0;
  -brkint -imaxbel

The dmesg line agrees:

  00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A

The console configuration is then moved from base to xeon to allow
changing it for the raccoon machine.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
fd23d84da8 Remove setLdLibraryPath and driSupport options
They have been removed from NixOS. The "hardware.opengl" group is now
renamed to "hardware.graphics".

See: 98cef4c273
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
291feda7ff Add documentation section about GRUB chain loading
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
6f22f683a9 Add 10 min shutdown jitter to avoid spikes
The shutdown timer will fire at slightly different times for the
different nodes, so we slowly decrease the power consumption.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
240b26d82e Don't mount the nix store in owl nodes
Initially we planned to run jobs in those nodes by sharing the same nix
store from hut. However, these nodes are now used to build packages
which are not available in hut. Users also ssh to the nodes, which
doesn't mount the hut store, so it doesn't make much sense to keep
mounting it.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
fd76a36c36 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
f1373e5227 Program shutdown for August 2nd for all machines
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
d8ca283b80 Enable debuginfod daemon in owl nodes
WARNING: This will introduce noise, as the daemon wakes up from time to
time to check for new packages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
cdabc58c09 Set gitea and grafana log level to warn
Prevents filling the journal logs with information messages.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
c737993c9c Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
60c094030b Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
8b8fc73225 Use authentication tokens for PM GitLab runner
Starting with GitLab 16, there is a new mechanism to authenticate the
runners via authentication tokens, so use it instead.  Older tokens and
runners are also removed, as they are no longer used.

With the new way of managing tokens, both the tags and the locked state
are managed from the GitLab web page.

See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
f095067deb flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
  → 'github:ryantm/agenix/de96bd907d5fbc3b14fc33ad37d1b9a3cb15edc6' (2024-07-09)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
  → 'github:NixOS/nixpkgs/693bc46d169f5af9c992095736e82c3488bf7dbb' (2024-07-14)

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
90d44a95eb Allow ptrace to any process of the same user
Allows users to attach GDB to their own processes, without requiring
running the program with GDB from the start. It is only available in
compute nodes, the storage nodes continue with the restricted settings.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
4873c881a9 Add abonerib user to hut, raccon, owl1 and owl2
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
d6d7516e12 Grant rpenacob access to owl1 and owl2 nodes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
ae96be6915 Access private repositories via hut SSH proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
bb10c47c2e Set the default proxy to point to hut
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
87598e74ae Allow incoming traffic to hut proxy
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-10-01 16:40:16 +02:00
6a3728e6c6 eudy: koro: fcs: Fix fcs unprotected cpuid all
smp_processor_id() was called in a preepmtible context, which could
invalidate the returned value. However, this was not harmful, because
fcs threads in nosv are pinned.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:16 +02:00
7b4cbc57e4 Add support for armv7 emulation in hut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
80eb17c065 Monitor raccoon machine via IPMI
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
338551ec0a Move vlopez user to jungleUsers for koro host
Access to other machines can be easily added into the "hosts" attribute
without the need to replicate the configuration.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
efea776c91 Add raccoon motd file
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
c8604bf3e0 Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
970cdf8dbd Control user access to each machine
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
7312a91271 Add PostgreSQL DB for performance test results
The database will hold the performance results of the execution of the
benchmarks. We follow the same setup on knights3 for now.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
b89c656ff7 Enable Grafana email alerts
Allows sending Grafana alerts via email too, so we have a reduntant
mechanism in case Slack fails to deliver them.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
27bb7cd69e Enable mail notification in Gitea
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
e0e9dc62d5 Add msmtp to send notifications via email
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
bb86b04fce Allow Ceph traffic to lake2 2025-10-01 16:40:16 +02:00
93ad04299d Fix meta in posts entries
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
24056305c7 Fix bogus separator
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
b43aca80ca Manually add links to the menu
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
39378d9544 Add link to Gitea in the website
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
0dec0ee519 Collect Gitea metrics in Prometheus
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
b15130744a Add Gitea service
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
c4f539caf6 Add firewall rules for Ceph and monitoring
The firewall was blocking the monitoring traffic from hut and the Ceph
traffic among OSDs. The rules only allow connecting from the specific
host that they are supposed to be coming from.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
cafd7ea682 Add workaround for MPICH 4.2.0
See: https://github.com/pmodels/mpich/issues/6946

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
08e9db0c3e Fix SLURM bug in rank integer sign expansion
See: https://bugs.schedmd.com/show_bug.cgi?id=19324

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
ab4cab97ba Merge pmix outputs for MPICH
MPICH expects headers and libraries to be present in the same directory.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
a4bf90ddfc Remove nixseparatedebuginfod input
It has been integrated in nixpkgs, so is no longer required.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
ebfbca4b48 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
  → 'github:ryantm/agenix/1381a759b205dff7a6818733118d02253340fd5e' (2024-04-02)
• Updated input 'agenix/darwin':
    'github:lnl7/nix-darwin/87b9d090ad39b25b2400029c64825fc2a8868943' (2023-01-09)
  → 'github:lnl7/nix-darwin/4b9b83d5a92e8c1fbfd8eb27eda375908c11ec4d' (2023-11-24)
• Updated input 'agenix/home-manager':
    'github:nix-community/home-manager/32d3e39c491e2f91152c84f8ad8b003420eab0a1' (2023-04-22)
  → 'github:nix-community/home-manager/3bfaacf46133c037bb356193bd2f1765d9dc82c1' (2023-12-20)
• Added input 'agenix/systems':
    'github:nix-systems/default/da67096a3b9bf56a91d16901293e51ba5b49a27e' (2023-04-09)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=de89197a4a7b162db7df9d41c9d07759d87c5709' (2024-04-24)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)
  → 'github:NixOS/nixpkgs/6143fc5eeb9c4f00163267708e26191d1e918932' (2024-04-21)
• Updated input 'nixseparatedebuginfod':
    'github:symphorien/nixseparatedebuginfod/232591f5274501b76dbcd83076a57760237fcd64' (2023-11-05)
  → 'github:symphorien/nixseparatedebuginfod/98d79461660f595637fa710d59a654f242b4c3f7' (2024-03-07)
• Removed input 'nixseparatedebuginfod'
• Removed input 'nixseparatedebuginfod/flake-utils'
• Removed input 'nixseparatedebuginfod/flake-utils/systems'
• Removed input 'nixseparatedebuginfod/nixpkgs'

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
6be7365916 Use google.com probe instead of bsc.es
The main website of the BSC is failing every day around 3:00 AM for
almost one hour, so it is not a very good target. Instead, google.com is
used which should be more reliable. The same robots.txt path is fetched,
as it is smaller than the main page.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
8ae2870fb3 Add another HTTPS probe for bsc.es
As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we
cannot detect a problem in our proxy or in the BSC one. Adding another
target like bsc.es that doesn't use the ops proxy allows us to discern
where the problem lies.

Instead of monitoring https://www.bsc.es/ directly, which will trigger
the whole Drupal server and take a whole second, we just fetch robots.txt
so the overhead on the server is minimal (and returns in less than 10 ms).

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
0899424de9 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:16 +02:00
baa8347753 Enable public-inbox at jungle.bsc.es/lists
The public-inbox service fetches emails from the sourcehut mailing lists
and displays them on the web. The idea is to reduce the dependency on
external services and add a secondary storage for the mailing lists in
case sourcehut goes down or changes the current free plans.

The service is available in https://jungle.bsc.es/lists/ and is open to
the public. It currently mirrors the bscpkgs and jungle mailing list.

We also edited the CSS to improve the readability and have larger fonts
by default.

The service for public-inbox produced by NixOS is not well configured to
fetch emails from an IMAP mail server, so we also manually edit the
service file to enable the network.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
777704a9ce Monitor https://pm.bsc.es/gitlab/ too
The GitLab instance is in the /gitlab endpoint and may fail
independently of https://pm.bsc.es/.

Cc: Víctor López <victor.lopez@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
024a31dd1b Enable nixseparatedebuginfod module
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-10-01 16:40:16 +02:00
b8fbb6380e Use tmpfs in /tmp
The /tmp directory was using the SSD disk which is not erased across
boots. Nix will use /tmp to perform the builds, so we want it to be as
fast as possible. In general, all the machines have enough space to
handle large builds like LLVM.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
afae708a48 Enable runners for pm.bsc.es/gitlab too
The old runners for the PM gitlab were disabled in configuration in the
last outage, but they remained working until we reboot the node. With
this change we enable the runners for both PM and gitlab.bsc.es.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
4efd74aad6 Remove complete ceph package from hut
Only the ceph-client is needed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
29cdfba328 Fix warning in slurm exporter using vendorHash
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
59d9d19891 Remove old Ceph package overlay
The Ceph package is now integrated in upstream nixpkgs.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
50cf081345 flake.lock: Update
Flake lock file updates:

• Updated input 'agenix':
    'github:ryantm/agenix/d8c973fd228949736dedf61b7f8cc1ece3236792' (2023-07-24)
  → 'github:ryantm/agenix/daf42cb35b2dc614d1551e37f96406e4c4a2d3e4' (2023-10-08)
• Updated input 'bscpkgs':
    'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538' (2023-10-31)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=e148de50d68b3eeafc3389b331cf042075971c4b' (2023-11-22)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
  → 'github:NixOS/nixpkgs/e4ad989506ec7d71f7302cc3067abd82730a4beb' (2023-11-19)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
388a10b666 BSC packages are no longer in bsc attribute
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
e7555bb1cf flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
  → 'git+https://git.sr.ht/~rodarima/bscpkgs?ref=refs/heads/master&rev=f605f8e5e4a1f392589f1ea2b9ffe2074f72a538' (2023-10-31)

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
0882ad7ecc Switch bscpkgs URL to sourcehut
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
6ac5225ddb Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
cd6983223e Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
fb8a0cb0a3 Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
a8c0ce5d06 Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
9957a0269d Don't log SLURM connection attempts from ssfhead 2025-10-01 16:40:16 +02:00
ca0937859d Add docker runner too 2025-10-01 16:40:16 +02:00
4d362351cb Monitor gitlab.bsc.es too 2025-10-01 16:40:16 +02:00
e9b4d87d9f Monitor PM webpage via blackbox 2025-10-01 16:40:16 +02:00
457e403258 Temporarily disable pm runners 2025-10-01 16:40:16 +02:00
32b9cc17a9 Add runner for gitlab.bsc.es 2025-10-01 16:40:16 +02:00
fbabc06641 Allow anonymous access to grafana 2025-10-01 16:40:16 +02:00
7b67b2b703 Remove user/group when using DynamicUsers 2025-10-01 16:40:16 +02:00
ce964b9b65 Set the SLURM_CONF variable 2025-10-01 16:40:16 +02:00
b84066fde5 Enable slurm-exporter service 2025-10-01 16:40:16 +02:00
38e23068f2 Add prometheus-slurm-exporter package 2025-10-01 16:40:16 +02:00
6b2351bfe7 Document the hut shared nix store for SLURM 2025-10-01 16:40:16 +02:00
b84d1d5e26 Mount the hut nix store for SLURM jobs 2025-10-01 16:40:16 +02:00
6d8fd353d0 Enable direnv integration 2025-10-01 16:40:16 +02:00
642507b255 Remove bscpkgs from the registry and nixPath
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2025-10-01 16:40:16 +02:00
128136c137 Add bscpkgs and nixpkgs top level attributes
Allows the evaluation of packages of the intermediate overlays.
2025-10-01 16:40:16 +02:00
1242aad9a3 Use hut packages as the default package set
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2025-10-01 16:40:16 +02:00
0ca0da9ffe Don't fetch registry flakes from the net 2025-10-01 16:40:16 +02:00
03822c8b26 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
2025-10-01 16:40:16 +02:00
77276fb6c1 Revert "Update slurm to 23.02.05.1"
This reverts commit aaefddc44a.
2025-10-01 16:40:16 +02:00
1b296f2ce7 Open ports in firewall of compute nodes 2025-10-01 16:40:16 +02:00
798fa002cc Update slurm to 23.02.05.1 2025-10-01 16:40:16 +02:00
44667e8e40 Monitor storage nodes via IPMI too 2025-10-01 16:40:16 +02:00
668a65b9c6 Specify the space available in /ceph 2025-10-01 16:40:16 +02:00
73f18d5801 Add update post to website 2025-10-01 16:40:16 +02:00
627c912b87 Enable fstrim service 2025-10-01 16:40:16 +02:00
66b5074ff1 Serve the nix store from hut 2025-10-01 16:40:16 +02:00
79446cebcb Add encrypted munge key with agenix 2025-10-01 16:40:16 +02:00
061fc60939 Remove unused large port hole in firewall 2025-10-01 16:40:16 +02:00
09ac1d6c13 Make exporters listen in localhost only 2025-10-01 16:40:16 +02:00
a6324e47e8 Allow only some ports for srun 2025-10-01 16:40:16 +02:00
2f258e1cdd Block ssfhead from reaching our slurm daemon 2025-10-01 16:40:16 +02:00
4c88f9a783 Poweroff idle slurm nodes after 1 hour 2025-10-01 16:40:16 +02:00
01140353c6 Add IB and IPMI node host names 2025-10-01 16:40:16 +02:00
c38a01c8dc flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
2025-10-01 16:40:16 +02:00
aa52236a80 Unlock ovni gitlab runners 2025-10-01 16:40:16 +02:00
3a31dcd58b Update email contact to jungle mail list 2025-10-01 16:40:16 +02:00
d57849a954 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27' (2023-08-31)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
  → 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2025-10-01 16:40:16 +02:00
6850bf3a71 Add agenix to all nodes 2025-10-01 16:40:16 +02:00
aa92294907 Add agenix module to ceph 2025-10-01 16:40:16 +02:00
da92154d33 Remove old secrets 2025-10-01 16:40:16 +02:00
adec7f80fd Mount /ceph in owl1 and owl2 2025-10-01 16:40:16 +02:00
8a0034a867 Warn about the owl2 omnipath device 2025-10-01 16:40:16 +02:00
6828273c05 Clean owl2 configuration 2025-10-01 16:40:16 +02:00
8cedffe040 Move the ceph client config to an external module 2025-10-01 16:40:16 +02:00
8a027d8b09 Reorganize secrets and ssh keys
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2025-10-01 16:40:16 +02:00
1f32b8409a Add anavarro user 2025-10-01 16:40:16 +02:00
bc51564a88 Set zsh inc_append_history option 2025-10-01 16:40:16 +02:00
8ba4f910c3 Set zsh shell for rarias 2025-10-01 16:40:16 +02:00
515fa49ed0 Enable zsh and fix key bindings 2025-10-01 16:40:16 +02:00
c63fa494d5 Keep a log over time with the config commits 2025-10-01 16:40:16 +02:00
bf53f01f0a Configure bscpkgs.nixpkgs to follow nixpkgs 2025-10-01 16:40:16 +02:00
a6d3f43b98 Store nixos config in /etc/nixos/config.rev 2025-10-01 16:40:16 +02:00
76e6ae2f00 Enable binary emulation for other architectures 2025-10-01 16:40:16 +02:00
409efacf5b Enable watchdog 2025-10-01 16:40:16 +02:00
e1e879178d Enable all osd on boot in lake2 2025-10-01 16:40:16 +02:00
042ca9e882 Scrape lake2 too 2025-10-01 16:40:16 +02:00
9241bda0ac Also enable monitoring in lake2 2025-10-01 16:40:16 +02:00
005a1be48a Scrape metrics from bay 2025-10-01 16:40:16 +02:00
f86114f33e Add monitoring in the bay node 2025-10-01 16:40:16 +02:00
af29f639e2 Add fio tool 2025-10-01 16:40:16 +02:00
0fe025e8be Add ceph tools in hut too 2025-10-01 16:40:16 +02:00
6a429fda1b Switch ceph logs to journal 2025-10-01 16:40:16 +02:00
4cea250cf4 Update ceph to 18.2.0 in overlay 2025-10-01 16:40:16 +02:00
3b823ee478 Move pkgs overlay to overlay.nix 2025-10-01 16:40:16 +02:00
d9dea762de Enable ceph osd daemons in lake2 2025-10-01 16:40:16 +02:00
80efd57a11 Add the lake2 hostname to the hosts 2025-10-01 16:40:16 +02:00
cced6b0dc0 Use the sda for lake2 2025-10-01 16:40:16 +02:00
b63b450111 Remove netboot module 2025-10-01 16:40:16 +02:00
81baeee5b1 Disable pixiecore in hut for now 2025-10-01 16:40:16 +02:00
686f750c06 Add PXE helper 2025-10-01 16:40:16 +02:00
33155fcb62 Enable netboot again for PXE 2025-10-01 16:40:16 +02:00
6e89b3f936 Specify the disk by path 2025-10-01 16:40:16 +02:00
f0f67f374e Prepare lake2 config after bootstrap
The disk ID is different under NixOS.
2025-10-01 16:40:16 +02:00
7443a192c6 Add lake2 bootstrap config 2025-10-01 16:40:16 +02:00
25f06db5f1 Add section to enable serial console 2025-10-01 16:40:16 +02:00
3c83996e26 Add agenix to PATH in hut 2025-10-01 16:40:16 +02:00
a4fc3d131a Store ceph secret key in age
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2025-10-01 16:40:16 +02:00
660a8ae163 Add rarias key for secrets 2025-10-01 16:40:16 +02:00
91270b26bb Add ceph metrics to prometheus 2025-10-01 16:40:16 +02:00
94ce6fedf9 Mount the ceph filesystem in hut 2025-10-01 16:40:16 +02:00
817c98d37b Add ceph config in bay 2025-10-01 16:40:16 +02:00
9cd013c4ed Add the bay host name 2025-10-01 16:40:16 +02:00
f707650724 Remove netboot and fixes 2025-10-01 16:40:15 +02:00
9c152ec9cc Add bay node 2025-10-01 16:40:15 +02:00
7e789cd062 Update flake 2025-10-01 16:40:15 +02:00
8fcb5a1079 Monitor power from other nodes via LAN 2025-10-01 16:40:15 +02:00
b80656228d Increase prometheus retention time to one year 2025-10-01 16:40:15 +02:00
cd6e6de2ad Don't set all_proxy 2025-10-01 16:40:15 +02:00
ca78313752 Update nixpkgs to fix docker problem 2025-10-01 16:40:15 +02:00
ae2007e2fe Allow access to devices for node_exporter 2025-10-01 16:40:15 +02:00
d8e366b444 GRUB version no longer needed 2025-10-01 16:40:15 +02:00
a18d56b3ae Upgrade flake: nixpkgs, bscpkgs and agenix 2025-10-01 16:40:15 +02:00
8c1bf6db42 Kill slurmd remaining processes on upgrade 2025-10-01 16:40:15 +02:00
17f9a957eb Add details to request access in the web 2025-10-01 16:40:15 +02:00
a094093d95 koro: Add vlopez user 2025-10-01 16:40:15 +02:00
cbe53a6f0a Add koro node 2025-10-01 16:40:15 +02:00
1b8c3eb554 eudy: Add fcsv3 and intermediate versions for testing 2025-10-01 16:40:15 +02:00
c0335c1f95 eudy: Enable memory overcommit 2025-10-01 16:40:15 +02:00
d5857c0f7d eudy: disable all cpu mitigations 2025-10-01 16:40:15 +02:00
d88caa8610 Add jungle.bsc.es hugo website 2025-10-01 16:40:15 +02:00
9097811cc0 Enable NTP using the BSC time server 2025-10-01 16:40:15 +02:00
83acd40880 Add the ssfhead node as gateway 2025-10-01 16:40:15 +02:00
ba75bf8249 Use our host names first by default 2025-10-01 16:40:15 +02:00
e9845cc76a Add DNS tools to resolve hosts 2025-10-01 16:40:15 +02:00
d5951483ee Lower perf_event_paranoid to -1 2025-10-01 16:40:15 +02:00
937d8a7637 Set perf paranoid to 0 by default 2025-10-01 16:40:15 +02:00
798e01f9e6 Add perf to packages 2025-10-01 16:40:15 +02:00
2ca7e7383e Allow srun to specify the cpu binding
The task/affinity plugin needs to be selected.
2025-10-01 16:40:15 +02:00
b610f12133 Move authorized keys to users.nix 2025-10-01 16:40:15 +02:00
b6aaeb8158 Add rpenacob user 2025-10-01 16:40:15 +02:00
3d0f86ac07 Add osumb to the system packages 2025-10-01 16:40:15 +02:00
221ccef956 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=c775ee4d6f76aded05b08ae13924c302f18f9b2c' (2023-04-26)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs%2fheads%2fmaster&rev=cbe9af5d042e9d5585fe2acef65a1347c68b2fbd' (2023-06-16)
2025-10-01 16:40:15 +02:00
9f03799d34 Set mpi to mpich by default in bscpkgs 2025-10-01 16:40:15 +02:00
3a9615fce4 Add missing parameter to extend 2025-10-01 16:40:15 +02:00
45d7b31c0a Use explicit order in overlays 2025-10-01 16:40:15 +02:00
73b33c4d6c Replace mpi inside bsc attribute 2025-10-01 16:40:15 +02:00
bae3c75222 Add mpich overlay 2025-10-01 16:40:15 +02:00
f51e910aff Add coments in slurm config 2025-10-01 16:40:15 +02:00
8b7ffc914a Add eudy host key to known hosts 2025-10-01 16:40:15 +02:00
afb2bea1c9 Rename xeon08 to eudy
From Eudyptula, a little penguin.
2025-10-01 16:40:15 +02:00
7bbc526671 Update rebuild script for all nodes 2025-10-01 16:40:15 +02:00
4afe3121e6 Add ssh host keys 2025-10-01 16:40:15 +02:00
39f15a1b4f Set the name of the slurm cluster to jungle 2025-10-01 16:40:15 +02:00
3fab341dc8 Change owl hostnames 2025-10-01 16:40:15 +02:00
6ec7353a27 Add owl and all partition 2025-10-01 16:40:15 +02:00
d679fd6314 Simplify flake and expose host pkgs
The configuration of the machines is now moved to m/
2025-10-01 16:40:15 +02:00
218acd6848 Rename xeon07 to hut 2025-10-01 16:40:15 +02:00
68805b337d Remove profiles older than 30 days with gc 2025-10-01 16:40:15 +02:00
eae7ed4e10 Add ncdu to system packages 2025-10-01 16:40:15 +02:00
eb9b5e570f Move arocanon user from xeon08 to common 2025-10-01 16:40:15 +02:00
0c4b85be3b xeon08: Add config for kernel non-voluntary preemption 2025-10-01 16:40:15 +02:00
49ccf0a3f3 xeon08: Add perf 2025-10-01 16:40:15 +02:00
459b29924c xeon08: Enable lttng lockdep tracepoints 2025-10-01 16:40:15 +02:00
0d1a7e59ee xeon08: Add lttng module and tools 2025-10-01 16:40:15 +02:00
b959c72979 Serve grafana in https://jungle.bsc.es/grafana 2025-10-01 16:40:15 +02:00
68f7c02555 Add tree command 2025-10-01 16:40:15 +02:00
26caa390cb Add file to system packages 2025-10-01 16:40:15 +02:00
f0af9d87a0 Add gnumake to system packages 2025-10-01 16:40:15 +02:00
038340c5d2 Add cmake to system packages 2025-10-01 16:40:15 +02:00
471d44f013 Add ix to common packages 2025-10-01 16:40:15 +02:00
af03554610 Improve documentation 2025-10-01 16:40:15 +02:00
35898c68e7 Add gitignore 2025-10-01 16:40:15 +02:00
769317c5a7 Set intel_pstate=passive and disable frequency boost 2025-10-01 16:40:14 +02:00
a9f9a1e8e5 Add xeon08 basic config 2025-10-01 16:40:14 +02:00
0a8097231d Add nixos-config.nix to easily enable nix repl 2025-10-01 16:40:14 +02:00
be90239bbc Automatically resume restarted nodes in SLURM 2025-10-01 16:40:14 +02:00
f2fc2af77a Allow public dashboards in grafana 2025-10-01 16:40:14 +02:00
7828d616fc Add hal ssh key 2025-10-01 16:40:14 +02:00
61ea1af68b Increase the number of CPUs to 56 for nOS-V docker 2025-10-01 16:40:14 +02:00
81243cbc95 Allow 5 concurrent buils in the gitlab-runner 2025-10-01 16:40:14 +02:00
847d516d6d Simplify bash prompt 2025-10-01 16:40:14 +02:00
a6c060a25b Roolback to bash as default shell
Zsh doesn't behave properly, it needs further configuration.
2025-10-01 16:40:14 +02:00
70d2255634 Use pmix by default in slurm 2025-10-01 16:40:14 +02:00
7b176a3780 Increase locked memory to 1 GiB 2025-10-01 16:40:14 +02:00
e7bfba54bc Use the latest kernel 2025-10-01 16:40:14 +02:00
30a12729f1 Disable osnoise and hwlat tracer for now
Reuse nix cache to avoid rebuilding the kernel.
2025-10-01 16:40:14 +02:00
8851dbe212 Update nixpkgs to nixos-unstable 2025-10-01 16:40:14 +02:00
b7d52423c9 Update nixpkgs 2025-10-01 16:40:14 +02:00
32b84a5add Update ib interface name in xeon02
It seems to be plugged in another PCI port
2025-10-01 16:40:14 +02:00
9267ca51fa Add steps in install documentation 2025-10-01 16:40:14 +02:00
5c9f5e845f Add minimal netboot module to build kexec image 2025-10-01 16:40:14 +02:00
2d204c5cc0 Add xeon02 configuration 2025-10-01 16:40:14 +02:00
99481e6203 Refacto slurm configuration into compute/control 2025-10-01 16:40:14 +02:00
8b4dda98af Lock flakes and add inputs 2025-10-01 16:40:14 +02:00
0c296c1825 Test flakes 2025-10-01 16:40:14 +02:00
513c182d24 Enable slurm in xeon01 2025-10-01 16:40:14 +02:00
a2fe4b552a Use xeon07 as control machine 2025-10-01 16:40:14 +02:00
1b030ab76d Remove xeon07 overlay to load upstream slurm 2025-10-01 16:40:14 +02:00
f0abbac1c7 Add script to rebuild configuration 2025-10-01 16:40:14 +02:00
6f80ebb483 Add configuration for xeon01 2025-10-01 16:40:14 +02:00
ce00282704 Load overlays from /config 2025-10-01 16:40:14 +02:00
7829ba7509 Move net.nix to common 2025-10-01 16:40:14 +02:00
a476b1758b Remove host specific network options from net.nix 2025-10-01 16:40:14 +02:00
51a77b5213 Move ssh.nix to common 2025-10-01 16:40:14 +02:00
020ce58efd Move overlays.nix to common 2025-10-01 16:40:14 +02:00
ab978758ba Move users.nix to common 2025-10-01 16:40:14 +02:00
774350f288 Move common options from configuration.nix 2025-10-01 16:40:14 +02:00
b739c96882 Move the remaining hw config to common 2025-10-01 16:40:14 +02:00
32d782be96 Move boot config to common/boot.nix 2025-10-01 16:40:14 +02:00
1432a26fba Move filesystems config to common/fs.nix 2025-10-01 16:40:14 +02:00
f5fa915e09 Use partition labels for / and swap 2025-10-01 16:40:14 +02:00
d6a7a87207 Move fs.nix to common 2025-10-01 16:40:14 +02:00
13f72753a5 Move boot.nix to common 2025-10-01 16:40:14 +02:00
fa757aaaab Move disk selection to configuration.nix 2025-10-01 16:40:14 +02:00
15906e5818 Add common directory 2025-10-01 16:40:14 +02:00