jungle-backup

Author	SHA1	Message	Date
Rodrigo Arias Mallo	448d85ef9d	Move nix-daemon exporter to modules Reviewed-by: Aleix Boné <abonerib@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-06-18 15:36:09 +02:00
Rodrigo Arias Mallo	9f43a0e13b	Remove fox monitoring via IPMI We will need to setup an VPN to be able to access fox in its new location, so for now we simply remove the IPMI monitoring. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-06-02 11:26:53 +02:00
Rodrigo Arias Mallo	3a3c3050ef	Monitor fox, gateway and UPC anella via ICMP Fox should reply once the machine is connected to the UPC network. Monitoring also the gateway and UPC anella allows us to estimate if the whole network is down or just fox. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-06-02 11:26:51 +02:00
Rodrigo Arias Mallo	7a2f37aaa2	Add UPC temperature sensor monitoring These sensors are part of their air quality measurements, which just happen to be very close to our server room. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-05-29 13:01:37 +02:00
Rodrigo Arias Mallo	aae6585f66	Add meteocat exporter Allows us to track ambient temperature changes and estimate the temperature delta between the server room and exterior temperature. We should be able to predict when we would need to stop the machines due to excesive temperature as summer approaches. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-05-29 13:01:29 +02:00
Rodrigo Arias Mallo	1c15e77c83	Add custom nix-daemon exporter Allows us to see which derivations are being built in realtime. It is a bit of a hack, but it seems to work. We simply look at the environment of the child processes of nix-daemon (usually bash) and then look for the $name variable which should hold the current derivation being built. Needs root to be able to read the environ file of the different nix-daemon processes as they are owned by the nixbld* users. See: https://discourse.nixos.org/t/query-ongoing-builds/23486 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-05-29 12:57:07 +02:00
Rodrigo Arias Mallo	abeab18270	Add raccoon node exporter monitoring Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-22 14:50:08 +02:00
Rodrigo Arias Mallo	1985b58619	Increase data retention to 5 years Now that we have more space, we can extend the retention time to 5 years to hold the monitoring metrics. For a year we have: # du -sh /var/lib/prometheus2 13G /var/lib/prometheus2 So we can expect it to increase to about 65 GiB. In the future we may want to reduce some adquisition frequency. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-22 14:50:03 +02:00
Rodrigo Arias Mallo	feb2060be7	Use IPMI host names instead of IP addresses Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:01 +02:00
Rodrigo Arias Mallo	00999434c2	Add fox IPMI monitoring Use agenix to store the credentials safely. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:14:59 +02:00
Rodrigo Arias Mallo	6cad205269	Add script to monitor GPFS Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 15:43:07 +01:00
Rodrigo Arias Mallo	ad4b615211	Collect statistics from logged users Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:48 +01:00
Rodrigo Arias Mallo	b4518b59cf	Add custom GPFS exporter for MN5 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:46 +01:00
Rodrigo Arias Mallo	b3e397eb4c	Set gitea and grafana log level to warn Prevents filling the journal logs with information messages. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:27 +02:00
Rodrigo Arias Mallo	1c8efd0877	Monitor raccoon machine via IPMI Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:32 +02:00
Rodrigo Arias Mallo	15085c8a05	Enable Grafana email alerts Allows sending Grafana alerts via email too, so we have a reduntant mechanism in case Slack fails to deliver them. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 15:57:38 +02:00
Rodrigo Arias Mallo	db2c6f7e45	Collect Gitea metrics in Prometheus Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:32:25 +02:00
Rodrigo Arias Mallo	005a67deaf	Use google.com probe instead of bsc.es The main website of the BSC is failing every day around 3:00 AM for almost one hour, so it is not a very good target. Instead, google.com is used which should be more reliable. The same robots.txt path is fetched, as it is smaller than the main page. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-03-05 16:52:21 +01:00
Rodrigo Arias Mallo	f8097cb5cb	Add another HTTPS probe for bsc.es As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we cannot detect a problem in our proxy or in the BSC one. Adding another target like bsc.es that doesn't use the ops proxy allows us to discern where the problem lies. Instead of monitoring https://www.bsc.es/ directly, which will trigger the whole Drupal server and take a whole second, we just fetch robots.txt so the overhead on the server is minimal (and returns in less than 10 ms). Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-02-13 12:26:56 +01:00
Rodrigo Arias Mallo	b299ead00b	Monitor https://pm.bsc.es/gitlab/ too The GitLab instance is in the /gitlab endpoint and may fail independently of https://pm.bsc.es/. Cc: Víctor López <victor.lopez@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-05 09:56:28 +01:00
Rodrigo Arias Mallo	2953080fb8	Monitor anella instead of gw.bsc.es The target gw.bsc.es doesn't reply to our ICMP probes from hut. However, the anella hop in the tracepath is a good candidate to identify cuts between the login and the provider and between the provider and external hosts like Google or Cloudflare DNS. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-27 12:46:08 +02:00
Rodrigo Arias Mallo	9871517be2	Add ICMP probes These probes check if we can reach several targets via ICMP, which is not proxied, so they can be used to see if ICMP forwarding is working in the login node. In particular, we test if we can reach the Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping only from the intranet and the login node (ssfhead). Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 17:13:03 +02:00
Rodrigo Arias Mallo	736eacaac5	Enable proxy for Grafana too The alerts need to contact the slack endpoint, so we add the proxy environment variables to the grafana systemd service. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:56 +02:00
Rodrigo Arias Mallo	42920c2521	Monitor gitlab.bsc.es too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	4acd35e036	Monitor PM webpage via blackbox	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	61646cb3bd	Allow anonymous access to grafana	2023-09-22 10:51:30 +02:00
Rodrigo Arias Mallo	f49ae0773e	Enable slurm-exporter service	2023-09-21 21:40:02 +02:00
Rodrigo Arias Mallo	7ddd1977f3	Make exporters listen in localhost only	2023-09-08 18:13:04 +02:00
Rodrigo Arias Mallo	0f0a861896	Scrape lake2 too	2023-08-29 12:33:26 +02:00
Rodrigo Arias Mallo	70321ce237	Scrape metrics from bay	2023-08-29 11:58:00 +02:00
Rodrigo Arias Mallo	dfffc0bdce	Add ceph metrics to prometheus	2023-08-22 16:33:55 +02:00
Rodrigo Arias Mallo	f8fb5fa4ff	Monitor power from other nodes via LAN	2023-08-22 11:28:54 +02:00
Rodrigo Arias Mallo	acf9b71f04	Increase prometheus retention time to one year	2023-08-22 11:28:54 +02:00
Rodrigo Arias Mallo	55d6c17776	Allow access to devices for node_exporter	2023-07-28 13:55:35 +02:00
Rodrigo Arias Mallo	f4ac9f3186	Simplify flake and expose host pkgs The configuration of the machines is now moved to m/	2023-06-16 11:31:31 +02:00

35 Commits