Commit Graph

214 Commits

Author SHA1 Message Date
2953080fb8 Monitor anella instead of gw.bsc.es
The target gw.bsc.es doesn't reply to our ICMP probes from hut. However,
the anella hop in the tracepath is a good candidate to identify cuts
between the login and the provider and between the provider and external
hosts like Google or Cloudflare DNS.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-27 12:46:08 +02:00
9871517be2 Add ICMP probes
These probes check if we can reach several targets via ICMP, which is
not proxied, so they can be used to see if ICMP forwarding is working in
the login node.

In particular, we test if we can reach the Google (8.8.8.8) and
Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping
only from the intranet and the login node (ssfhead).

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 17:13:03 +02:00
736eacaac5 Enable proxy for Grafana too
The alerts need to contact the slack endpoint, so we add the proxy
environment variables to the grafana systemd service.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:56 +02:00
0e66aad099 Make blackbox exporter use the proxy
By default it was trying to reach the targets using the default gateway,
but since the electrical cut of 2023-10-20, the login node has not
enabled forwarding again. So better if we don't rely on it.

Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-10-25 16:55:24 +02:00
67a4905a0a Don't log SLURM connection attempts from ssfhead 2023-10-06 15:22:04 +02:00
d52d22e0db Add docker runner too 2023-10-06 15:17:07 +02:00
42920c2521 Monitor gitlab.bsc.es too 2023-10-06 15:17:07 +02:00
4acd35e036 Monitor PM webpage via blackbox 2023-10-06 15:17:07 +02:00
621d20db3a Temporarily disable pm runners 2023-10-06 15:17:07 +02:00
0926f6ec1f Add runner for gitlab.bsc.es 2023-10-06 15:17:07 +02:00
61646cb3bd Allow anonymous access to grafana 2023-09-22 10:51:30 +02:00
c0066c4744 Remove user/group when using DynamicUsers 2023-09-22 10:13:06 +02:00
ffd0593f51 Set the SLURM_CONF variable 2023-09-21 22:22:00 +02:00
f49ae0773e Enable slurm-exporter service 2023-09-21 21:40:02 +02:00
8fa3fccecb Add prometheus-slurm-exporter package 2023-09-21 21:34:18 +02:00
9ee7111453 Document the hut shared nix store for SLURM 2023-09-21 13:51:42 +02:00
8de3d2b149 Mount the hut nix store for SLURM jobs 2023-09-20 19:38:43 +02:00
bc62e28ca3 Enable direnv integration 2023-09-20 09:32:58 +02:00
d612a5453c Add System Integration Service Guide document 2023-09-19 15:12:59 +02:00
653d411b9e Remove bscpkgs from the registry and nixPath
This is done to prevent accidental evaluations where the nixpkgs input
of bscpkgs is still pointing to a different version that the one
specified in the jungle flake. Instead use jungle#bscpkgs.X to get a
package from bscpkgs.
2023-09-15 12:00:33 +02:00
51c57dbc41 Add bscpkgs and nixpkgs top level attributes
Allows the evaluation of packages of the intermediate overlays.
2023-09-15 12:00:33 +02:00
33cd40160e Use hut packages as the default package set
Allows the user to directly access nixpkgs and bscpkgs from the top
level as `nix build jungle#htop` and `nix build jungle#bsc.ovni`.
2023-09-15 12:00:28 +02:00
a1e8cfea47 Don't fetch registry flakes from the net 2023-09-15 12:00:28 +02:00
5d72ee3da3 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=3a4062ac04be6263c64a481420d8e768c2521b80' (2023-09-14)
2023-09-15 11:50:47 +02:00
fdc6445d47 Revert "Update slurm to 23.02.05.1"
This reverts commit aaefddc44a.
2023-09-14 15:46:18 +02:00
e88805947e Open ports in firewall of compute nodes 2023-09-14 15:45:43 +02:00
aaefddc44a Update slurm to 23.02.05.1 2023-09-13 17:44:24 +02:00
d9d249411d Monitor storage nodes via IPMI too 2023-09-13 15:57:13 +02:00
c07f75c6bb Specify the space available in /ceph 2023-09-13 14:19:59 +02:00
8d449ba20c Add update post to website 2023-09-12 18:13:38 +02:00
10ca572aec Enable fstrim service 2023-09-12 16:39:45 +02:00
75b0f48715 Serve the nix store from hut 2023-09-12 12:19:43 +02:00
19a451db77 Add encrypted munge key with agenix 2023-09-08 19:05:45 +02:00
ec9be9bb62 Remove unused large port hole in firewall 2023-09-08 18:22:48 +02:00
7ddd1977f3 Make exporters listen in localhost only 2023-09-08 18:13:04 +02:00
7050c505b5 Allow only some ports for srun 2023-09-08 17:51:37 +02:00
033a1fe97b Block ssfhead from reaching our slurm daemon 2023-09-08 17:36:28 +02:00
77cb3c494e Poweroff idle slurm nodes after 1 hour 2023-09-08 16:49:53 +02:00
6db5772ac4 Add IB and IPMI node host names 2023-09-08 13:21:37 +02:00
3e347e673c flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=6122fef92701701e1a0622550ac0fc5c2beb5906' (2023-09-07)
2023-09-07 11:13:45 +02:00
dca274d020 Unlock ovni gitlab runners 2023-09-05 16:59:45 +02:00
c33909f32f Update email contact to jungle mail list 2023-09-05 16:10:58 +02:00
64e856e8b9 flake.lock: Update
Flake lock file updates:

• Updated input 'bscpkgs':
    'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=18d64c352c10f9ce74aabddeba5a5db02b74ec27' (2023-08-31)
  → 'git+https://pm.bsc.es/gitlab/rarias/bscpkgs.git?ref=refs/heads/master&rev=ee24b910a1cb95bd222e253da43238e843816f2f' (2023-09-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/d680ded26da5cf104dd2735a51e88d2d8f487b4d' (2023-08-19)
  → 'github:NixOS/nixpkgs/e56990880811a451abd32515698c712788be5720' (2023-09-02)
2023-09-05 15:03:26 +02:00
02f40a8217 Add agenix to all nodes 2023-09-04 22:10:43 +02:00
77d43b6da9 Add agenix module to ceph 2023-09-04 22:07:07 +02:00
ab55aac5ff Remove old secrets 2023-09-04 22:04:32 +02:00
9b5bfbb7a3 Mount /ceph in owl1 and owl2 2023-09-04 22:00:36 +02:00
a69a71d1b0 Warn about the owl2 omnipath device 2023-09-04 22:00:17 +02:00
98374bd303 Clean owl2 configuration 2023-09-04 21:59:56 +02:00
3b6be8a2fc Move the ceph client config to an external module 2023-09-04 21:59:04 +02:00