Commit Graph

106 Commits

Author SHA1 Message Date
2953080fb8 Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-27 12:46:08 +02:00
9871517be2 Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 17:13:03 +02:00
736eacaac5 Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:56 +02:00
0e66aad099 Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:24 +02:00
67a4905a0a Don't log SLURM connection attempts from ssfhead 2023-10-06 15:22:04 +02:00
d52d22e0db Add docker runner too 2023-10-06 15:17:07 +02:00
42920c2521 Monitor gitlab.bsc.es too 2023-10-06 15:17:07 +02:00
4acd35e036 Monitor PM webpage via blackbox 2023-10-06 15:17:07 +02:00
621d20db3a Temporarily disable pm runners 2023-10-06 15:17:07 +02:00
0926f6ec1f Add runner for gitlab.bsc.es 2023-10-06 15:17:07 +02:00
61646cb3bd Allow anonymous access to grafana 2023-09-22 10:51:30 +02:00
c0066c4744 Remove user/group when using DynamicUsers 2023-09-22 10:13:06 +02:00
ffd0593f51 Set the SLURM_CONF variable 2023-09-21 22:22:00 +02:00
f49ae0773e Enable slurm-exporter service 2023-09-21 21:40:02 +02:00
8de3d2b149 Mount the hut nix store for SLURM jobs 2023-09-20 19:38:43 +02:00
bc62e28ca3 Enable direnv integration 2023-09-20 09:32:58 +02:00
653d411b9e Remove bscpkgs from the registry and nixPath
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 12:00:33 +02:00
a1e8cfea47 Don't fetch registry flakes from the net 2023-09-15 12:00:28 +02:00
e88805947e Open ports in firewall of compute nodes 2023-09-14 15:45:43 +02:00
d9d249411d Monitor storage nodes via IPMI too 2023-09-13 15:57:13 +02:00
10ca572aec Enable fstrim service 2023-09-12 16:39:45 +02:00
75b0f48715 Serve the nix store from hut 2023-09-12 12:19:43 +02:00
19a451db77 Add encrypted munge key with agenix 2023-09-08 19:05:45 +02:00
ec9be9bb62 Remove unused large port hole in firewall 2023-09-08 18:22:48 +02:00
7ddd1977f3 Make exporters listen in localhost only 2023-09-08 18:13:04 +02:00
7050c505b5 Allow only some ports for srun 2023-09-08 17:51:37 +02:00
033a1fe97b Block ssfhead from reaching our slurm daemon 2023-09-08 17:36:28 +02:00
77cb3c494e Poweroff idle slurm nodes after 1 hour 2023-09-08 16:49:53 +02:00
6db5772ac4 Add IB and IPMI node host names 2023-09-08 13:21:37 +02:00
dca274d020 Unlock ovni gitlab runners 2023-09-05 16:59:45 +02:00
02f40a8217 Add agenix to all nodes 2023-09-04 22:10:43 +02:00
77d43b6da9 Add agenix module to ceph 2023-09-04 22:07:07 +02:00
ab55aac5ff Remove old secrets 2023-09-04 22:04:32 +02:00
9b5bfbb7a3 Mount /ceph in owl1 and owl2 2023-09-04 22:00:36 +02:00
a69a71d1b0 Warn about the owl2 omnipath device 2023-09-04 22:00:17 +02:00
98374bd303 Clean owl2 configuration 2023-09-04 21:59:56 +02:00
3b6be8a2fc Move the ceph client config to an external module 2023-09-04 21:59:04 +02:00
2bb366b9ac Reorganize secrets and ssh keys
The agenix tools needs to read the secrets from a standalone file, but
we also need the same information for the SSH keys.
2023-09-04 21:36:31 +02:00
2d16709648 Add anavarro user 2023-09-04 16:00:01 +02:00
9344daa31c Set zsh inc_append_history option 2023-09-03 16:57:53 +02:00
80c98041b5 Set zsh shell for rarias 2023-09-03 16:46:27 +02:00
3418e57907 Enable zsh and fix key bindings 2023-09-03 16:42:04 +02:00
6848b58e39 Keep a log over time with the config commits 2023-09-03 00:02:14 +02:00
f9c77b433a Store nixos config in /etc/nixos/config.rev 2023-09-02 23:37:11 +02:00
9d487845f6 Enable binary emulation for other architectures 2023-08-31 17:27:08 +02:00
3c99c2a662 Enable watchdog 2023-08-30 16:32:17 +02:00
7d09108c9f Enable all osd on boot in lake2 2023-08-30 16:32:17 +02:00
0f0a861896 Scrape lake2 too 2023-08-29 12:33:26 +02:00
beb0d5940e Also enable monitoring in lake2 2023-08-29 12:29:41 +02:00
70321ce237 Scrape metrics from bay 2023-08-29 11:58:00 +02:00