360 Commits

Author SHA1 Message Date
6ac5225ddb Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
cd6983223e Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
fb8a0cb0a3 Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
a8c0ce5d06 Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
9957a0269d Don't log SLURM connection attempts from ssfhead 2025-10-01 16:40:16 +02:00
ca0937859d Add docker runner too 2025-10-01 16:40:16 +02:00
4d362351cb Monitor gitlab.bsc.es too 2025-10-01 16:40:16 +02:00
e9b4d87d9f Monitor PM webpage via blackbox 2025-10-01 16:40:16 +02:00
457e403258 Temporarily disable pm runners 2025-10-01 16:40:16 +02:00
32b9cc17a9 Add runner for gitlab.bsc.es 2025-10-01 16:40:16 +02:00
fbabc06641 Allow anonymous access to grafana 2025-10-01 16:40:16 +02:00
7b67b2b703 Remove user/group when using DynamicUsers 2025-10-01 16:40:16 +02:00
ce964b9b65 Set the SLURM_CONF variable 2025-10-01 16:40:16 +02:00
b84066fde5 Enable slurm-exporter service 2025-10-01 16:40:16 +02:00
38e23068f2 Add prometheus-slurm-exporter package 2025-10-01 16:40:16 +02:00
6b2351bfe7 Document the hut shared nix store for SLURM 2025-10-01 16:40:16 +02:00
b84d1d5e26 Mount the hut nix store for SLURM jobs 2025-10-01 16:40:16 +02:00
6d8fd353d0 Enable direnv integration 2025-10-01 16:40:16 +02:00
642507b255 Remove bscpkgs from the registry and nixPath
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2025-10-01 16:40:16 +02:00
128136c137 Add bscpkgs and nixpkgs top level attributes
Allows the evaluation of packages of the intermediate overlays.
2025-10-01 16:40:16 +02:00
1242aad9a3 Use hut packages as the default package set
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2025-10-01 16:40:16 +02:00
0ca0da9ffe Don't fetch registry flakes from the net 2025-10-01 16:40:16 +02:00
03822c8b26 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
2025-10-01 16:40:16 +02:00
77276fb6c1 Revert "Update slurm to 23.02.05.1"
This reverts commit aaefddc44a9073166ac52b8bd56ac96258d3b053.
2025-10-01 16:40:16 +02:00
1b296f2ce7 Open ports in firewall of compute nodes 2025-10-01 16:40:16 +02:00
798fa002cc Update slurm to 23.02.05.1 2025-10-01 16:40:16 +02:00
44667e8e40 Monitor storage nodes via IPMI too 2025-10-01 16:40:16 +02:00
668a65b9c6 Specify the space available in /ceph 2025-10-01 16:40:16 +02:00
73f18d5801 Add update post to website 2025-10-01 16:40:16 +02:00
627c912b87 Enable fstrim service 2025-10-01 16:40:16 +02:00
66b5074ff1 Serve the nix store from hut 2025-10-01 16:40:16 +02:00
79446cebcb Add encrypted munge key with agenix 2025-10-01 16:40:16 +02:00
061fc60939 Remove unused large port hole in firewall 2025-10-01 16:40:16 +02:00
09ac1d6c13 Make exporters listen in localhost only 2025-10-01 16:40:16 +02:00
a6324e47e8 Allow only some ports for srun 2025-10-01 16:40:16 +02:00
2f258e1cdd Block ssfhead from reaching our slurm daemon 2025-10-01 16:40:16 +02:00
4c88f9a783 Poweroff idle slurm nodes after 1 hour 2025-10-01 16:40:16 +02:00
01140353c6 Add IB and IPMI node host names 2025-10-01 16:40:16 +02:00
c38a01c8dc flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
2025-10-01 16:40:16 +02:00
aa52236a80 Unlock ovni gitlab runners 2025-10-01 16:40:16 +02:00
3a31dcd58b Update email contact to jungle mail list 2025-10-01 16:40:16 +02:00
d57849a954 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27' (2023-08-31)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
  → 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2025-10-01 16:40:16 +02:00
6850bf3a71 Add agenix to all nodes 2025-10-01 16:40:16 +02:00
aa92294907 Add agenix module to ceph 2025-10-01 16:40:16 +02:00
da92154d33 Remove old secrets 2025-10-01 16:40:16 +02:00
adec7f80fd Mount /ceph in owl1 and owl2 2025-10-01 16:40:16 +02:00
8a0034a867 Warn about the owl2 omnipath device 2025-10-01 16:40:16 +02:00
6828273c05 Clean owl2 configuration 2025-10-01 16:40:16 +02:00
8cedffe040 Move the ceph client config to an external module 2025-10-01 16:40:16 +02:00
8a027d8b09 Reorganize secrets and ssh keys
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2025-10-01 16:40:16 +02:00