f9622b19ef
Add ICMP probes
...
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
f2d26fd2e2
Enable proxy for Grafana too
...
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
bf8f0ac583
Make blackbox exporter use the proxy
...
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
33c1da6c40
Add docker runner too
2025-10-01 16:40:16 +02:00
5dc41a86e5
Monitor gitlab.bsc.es too
2025-10-01 16:40:16 +02:00
697c3d884e
Monitor PM webpage via blackbox
2025-10-01 16:40:16 +02:00
5a537c7478
Temporarily disable pm runners
2025-10-01 16:40:16 +02:00
c06b706e49
Add runner for gitlab.bsc.es
2025-10-01 16:40:16 +02:00
270cff123d
Allow anonymous access to grafana
2025-10-01 16:40:16 +02:00
b4ede66387
Enable slurm-exporter service
2025-10-01 16:40:16 +02:00
00068cb11c
Monitor storage nodes via IPMI too
2025-10-01 16:40:16 +02:00
c26cff7bdb
Serve the nix store from hut
2025-10-01 16:40:16 +02:00
3385252f5f
Make exporters listen in localhost only
2025-10-01 16:40:16 +02:00
e35b51cd00
Poweroff idle slurm nodes after 1 hour
2025-10-01 16:40:16 +02:00
ac3817d99b
Unlock ovni gitlab runners
2025-10-01 16:40:16 +02:00
e7aa2d3fe3
Add agenix to all nodes
2025-10-01 16:40:16 +02:00
1f199c73f1
Remove old secrets
2025-10-01 16:40:16 +02:00
758ddc71cb
Move the ceph client config to an external module
2025-10-01 16:40:16 +02:00
224bafd20d
Reorganize secrets and ssh keys
...
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2025-10-01 16:40:16 +02:00
32d5adf900
Enable binary emulation for other architectures
2025-10-01 16:40:16 +02:00
6ada09fe91
Scrape lake2 too
2025-10-01 16:40:16 +02:00
0db43352ac
Scrape metrics from bay
2025-10-01 16:40:16 +02:00
7bb16f858e
Add fio tool
2025-10-01 16:40:16 +02:00
cb76e7afa3
Add ceph tools in hut too
2025-10-01 16:40:16 +02:00
4a40098459
Disable pixiecore in hut for now
2025-10-01 16:40:16 +02:00
c360937d52
Add PXE helper
2025-10-01 16:40:16 +02:00
b5c061be41
Add agenix to PATH in hut
2025-10-01 16:40:16 +02:00
33cc03eb34
Store ceph secret key in age
...
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2025-10-01 16:40:16 +02:00
ac1783c516
Add rarias key for secrets
2025-10-01 16:40:16 +02:00
71000731c0
Add ceph metrics to prometheus
2025-10-01 16:40:16 +02:00
e320e9ced4
Mount the ceph filesystem in hut
2025-10-01 16:40:16 +02:00
49153acfbd
Monitor power from other nodes via LAN
2025-10-01 16:40:15 +02:00
04c2974a8e
Increase prometheus retention time to one year
2025-10-01 16:40:15 +02:00
5e3470f3bf
Allow access to devices for node_exporter
2025-10-01 16:40:15 +02:00
6ec7353a27
Add owl and all partition
2025-10-01 16:40:15 +02:00
d679fd6314
Simplify flake and expose host pkgs
...
The configuration of the machines is now moved to m/
2025-10-01 16:40:15 +02:00