jungle

Author	SHA1	Message	Date
Rodrigo Arias Mallo	15085c8a05	Enable Grafana email alerts Allows sending Grafana alerts via email too, so we have a reduntant mechanism in case Slack fails to deliver them. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 15:57:38 +02:00
Rodrigo Arias Mallo	06748dac1d	Enable mail notification in Gitea Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:49 +02:00
Rodrigo Arias Mallo	63851306ac	Add msmtp to send notifications via email Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:20 +02:00
Rodrigo Arias Mallo	db2c6f7e45	Collect Gitea metrics in Prometheus Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:32:25 +02:00
Rodrigo Arias Mallo	8e8f9e7adb	Add Gitea service Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:31:51 +02:00
Rodrigo Arias Mallo	005a67deaf	Use google.com probe instead of bsc.es The main website of the BSC is failing every day around 3:00 AM for almost one hour, so it is not a very good target. Instead, google.com is used which should be more reliable. The same robots.txt path is fetched, as it is smaller than the main page. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-03-05 16:52:21 +01:00
Rodrigo Arias Mallo	f8097cb5cb	Add another HTTPS probe for bsc.es As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we cannot detect a problem in our proxy or in the BSC one. Adding another target like bsc.es that doesn't use the ops proxy allows us to discern where the problem lies. Instead of monitoring https://www.bsc.es/ directly, which will trigger the whole Drupal server and take a whole second, we just fetch robots.txt so the overhead on the server is minimal (and returns in less than 10 ms). Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-02-13 12:26:56 +01:00
Aleix Roca Nonell	ff792f5f48	Move slurm client in a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-02-13 11:11:17 +01:00
Rodrigo Arias Mallo	5c48b43ae0	Enable public-inbox at jungle.bsc.es/lists The public-inbox service fetches emails from the sourcehut mailing lists and displays them on the web. The idea is to reduce the dependency on external services and add a secondary storage for the mailing lists in case sourcehut goes down or changes the current free plans. The service is available in https://jungle.bsc.es/lists/ and is open to the public. It currently mirrors the bscpkgs and jungle mailing list. We also edited the CSS to improve the readability and have larger fonts by default. The service for public-inbox produced by NixOS is not well configured to fetch emails from an IMAP mail server, so we also manually edit the service file to enable the network. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-15 11:18:08 +01:00
Rodrigo Arias Mallo	b299ead00b	Monitor https://pm.bsc.es/gitlab/ too The GitLab instance is in the /gitlab endpoint and may fail independently of https://pm.bsc.es/. Cc: Víctor López <victor.lopez@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-05 09:56:28 +01:00
Aleix Roca Nonell	a92432cf5a	Enable nixseparatedebuginfod module The module is only enabled on Hut and Eudy because we noticed activity on the debuginfod service even if no debug session was active. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2023-12-04 11:04:52 +01:00
Rodrigo Arias Mallo	35a94a9b02	Enable runners for pm.bsc.es/gitlab too The old runners for the PM gitlab were disabled in configuration in the last outage, but they remained working until we reboot the node. With this change we enable the runners for both PM and gitlab.bsc.es. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 14:45:23 +01:00
Rodrigo Arias Mallo	2953080fb8	Monitor anella instead of gw.bsc.es The target gw.bsc.es doesn't reply to our ICMP probes from hut. However, the anella hop in the tracepath is a good candidate to identify cuts between the login and the provider and between the provider and external hosts like Google or Cloudflare DNS. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-27 12:46:08 +02:00
Rodrigo Arias Mallo	9871517be2	Add ICMP probes These probes check if we can reach several targets via ICMP, which is not proxied, so they can be used to see if ICMP forwarding is working in the login node. In particular, we test if we can reach the Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping only from the intranet and the login node (ssfhead). Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 17:13:03 +02:00
Rodrigo Arias Mallo	736eacaac5	Enable proxy for Grafana too The alerts need to contact the slack endpoint, so we add the proxy environment variables to the grafana systemd service. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:56 +02:00
Rodrigo Arias Mallo	0e66aad099	Make blackbox exporter use the proxy By default it was trying to reach the targets using the default gateway, but since the electrical cut of 2023-10-20, the login node has not enabled forwarding again. So better if we don't rely on it. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:24 +02:00
Rodrigo Arias Mallo	d52d22e0db	Add docker runner too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	42920c2521	Monitor gitlab.bsc.es too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	4acd35e036	Monitor PM webpage via blackbox	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	621d20db3a	Temporarily disable pm runners	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	0926f6ec1f	Add runner for gitlab.bsc.es	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	61646cb3bd	Allow anonymous access to grafana	2023-09-22 10:51:30 +02:00
Rodrigo Arias Mallo	f49ae0773e	Enable slurm-exporter service	2023-09-21 21:40:02 +02:00
Rodrigo Arias Mallo	d9d249411d	Monitor storage nodes via IPMI too	2023-09-13 15:57:13 +02:00
Rodrigo Arias Mallo	75b0f48715	Serve the nix store from hut	2023-09-12 12:19:43 +02:00
Rodrigo Arias Mallo	7ddd1977f3	Make exporters listen in localhost only	2023-09-08 18:13:04 +02:00
Rodrigo Arias Mallo	77cb3c494e	Poweroff idle slurm nodes after 1 hour	2023-09-08 16:49:53 +02:00
Rodrigo Arias Mallo	dca274d020	Unlock ovni gitlab runners	2023-09-05 16:59:45 +02:00
Rodrigo Arias Mallo	02f40a8217	Add agenix to all nodes	2023-09-04 22:10:43 +02:00
Rodrigo Arias Mallo	ab55aac5ff	Remove old secrets	2023-09-04 22:04:32 +02:00
Rodrigo Arias Mallo	3b6be8a2fc	Move the ceph client config to an external module	2023-09-04 21:59:04 +02:00
Rodrigo Arias Mallo	2bb366b9ac	Reorganize secrets and ssh keys The agenix tools needs to read the secrets from a standalone file, but we also need the same information for the SSH keys.	2023-09-04 21:36:31 +02:00
Rodrigo Arias Mallo	9d487845f6	Enable binary emulation for other architectures	2023-08-31 17:27:08 +02:00
Rodrigo Arias Mallo	0f0a861896	Scrape lake2 too	2023-08-29 12:33:26 +02:00
Rodrigo Arias Mallo	70321ce237	Scrape metrics from bay	2023-08-29 11:58:00 +02:00
Rodrigo Arias Mallo	fad9df61e1	Add fio tool	2023-08-29 11:27:50 +02:00
Rodrigo Arias Mallo	d2a80c8c18	Add ceph tools in hut too	2023-08-28 17:58:21 +02:00
Rodrigo Arias Mallo	3416416864	Disable pixiecore in hut for now	2023-08-25 13:21:00 +02:00
Rodrigo Arias Mallo	815888fb07	Add PXE helper	2023-08-25 12:05:33 +02:00
Rodrigo Arias Mallo	077eece6b9	Add agenix to PATH in hut	2023-08-23 17:42:50 +02:00
Rodrigo Arias Mallo	b3ef53de51	Store ceph secret key in age This allows a node to mount the ceph FS without any extra ceph configuration in /etc/ceph.	2023-08-23 17:26:44 +02:00
Rodrigo Arias Mallo	e0852ee89b	Add rarias key for secrets	2023-08-23 17:15:26 +02:00
Rodrigo Arias Mallo	dfffc0bdce	Add ceph metrics to prometheus	2023-08-22 16:33:55 +02:00
Rodrigo Arias Mallo	8257c245b1	Mount the ceph filesystem in hut	2023-08-22 16:15:46 +02:00
Rodrigo Arias Mallo	f8fb5fa4ff	Monitor power from other nodes via LAN	2023-08-22 11:28:54 +02:00
Rodrigo Arias Mallo	acf9b71f04	Increase prometheus retention time to one year	2023-08-22 11:28:54 +02:00
Rodrigo Arias Mallo	55d6c17776	Allow access to devices for node_exporter	2023-07-28 13:55:35 +02:00
Rodrigo Arias Mallo	2e95281af5	Add owl and all partition	2023-06-16 11:34:00 +02:00
Rodrigo Arias Mallo	f4ac9f3186	Simplify flake and expose host pkgs The configuration of the machines is now moved to m/	2023-06-16 11:31:31 +02:00

49 Commits