cd6983223e
Add ICMP probes
...
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.
In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
fb8a0cb0a3
Enable proxy for Grafana too
...
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
a8c0ce5d06
Make blackbox exporter use the proxy
...
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.
Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-10-01 16:40:16 +02:00
ca0937859d
Add docker runner too
2025-10-01 16:40:16 +02:00
4d362351cb
Monitor gitlab.bsc.es too
2025-10-01 16:40:16 +02:00
e9b4d87d9f
Monitor PM webpage via blackbox
2025-10-01 16:40:16 +02:00
457e403258
Temporarily disable pm runners
2025-10-01 16:40:16 +02:00
32b9cc17a9
Add runner for gitlab.bsc.es
2025-10-01 16:40:16 +02:00
fbabc06641
Allow anonymous access to grafana
2025-10-01 16:40:16 +02:00
b84066fde5
Enable slurm-exporter service
2025-10-01 16:40:16 +02:00
44667e8e40
Monitor storage nodes via IPMI too
2025-10-01 16:40:16 +02:00
66b5074ff1
Serve the nix store from hut
2025-10-01 16:40:16 +02:00
09ac1d6c13
Make exporters listen in localhost only
2025-10-01 16:40:16 +02:00
4c88f9a783
Poweroff idle slurm nodes after 1 hour
2025-10-01 16:40:16 +02:00
aa52236a80
Unlock ovni gitlab runners
2025-10-01 16:40:16 +02:00
6850bf3a71
Add agenix to all nodes
2025-10-01 16:40:16 +02:00
da92154d33
Remove old secrets
2025-10-01 16:40:16 +02:00
8cedffe040
Move the ceph client config to an external module
2025-10-01 16:40:16 +02:00
8a027d8b09
Reorganize secrets and ssh keys
...
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2025-10-01 16:40:16 +02:00
76e6ae2f00
Enable binary emulation for other architectures
2025-10-01 16:40:16 +02:00
042ca9e882
Scrape lake2 too
2025-10-01 16:40:16 +02:00
005a1be48a
Scrape metrics from bay
2025-10-01 16:40:16 +02:00
af29f639e2
Add fio tool
2025-10-01 16:40:16 +02:00
0fe025e8be
Add ceph tools in hut too
2025-10-01 16:40:16 +02:00
81baeee5b1
Disable pixiecore in hut for now
2025-10-01 16:40:16 +02:00
686f750c06
Add PXE helper
2025-10-01 16:40:16 +02:00
3c83996e26
Add agenix to PATH in hut
2025-10-01 16:40:16 +02:00
a4fc3d131a
Store ceph secret key in age
...
This allows a node to mount the ceph FS without any extra ceph
configuration in /etc/ceph.
2025-10-01 16:40:16 +02:00
660a8ae163
Add rarias key for secrets
2025-10-01 16:40:16 +02:00
91270b26bb
Add ceph metrics to prometheus
2025-10-01 16:40:16 +02:00
94ce6fedf9
Mount the ceph filesystem in hut
2025-10-01 16:40:16 +02:00
8fcb5a1079
Monitor power from other nodes via LAN
2025-10-01 16:40:15 +02:00
b80656228d
Increase prometheus retention time to one year
2025-10-01 16:40:15 +02:00
ae2007e2fe
Allow access to devices for node_exporter
2025-10-01 16:40:15 +02:00
6ec7353a27
Add owl and all partition
2025-10-01 16:40:15 +02:00
d679fd6314
Simplify flake and expose host pkgs
...
The configuration of the machines is now moved to m/
2025-10-01 16:40:15 +02:00