jungle-backup

Author	SHA1	Message	Date
Rodrigo Arias Mallo	b7603053fa	Remove unused blackbox configuration modules Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-15 11:18:30 +02:00
Rodrigo Arias Mallo	3ca55acfdf	Use IPv4 in blackbox probes Otherwise they simply fail as IPv6 doesn't work. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-15 11:18:26 +02:00
Rodrigo Arias Mallo	e8f5ce735e	Remove proxy from hut HTTP probes Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-15 11:18:04 +02:00
Rodrigo Arias Mallo	448d85ef9d	Move nix-daemon exporter to modules Reviewed-by: Aleix Boné <abonerib@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-06-18 15:36:09 +02:00
Rodrigo Arias Mallo	0317f42613	Create specific SSF rack configuration Allow xeon machines to optionally inherit SSF configuration such as the NFS mount point and the network configuration. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-06-11 16:03:49 +02:00
Rodrigo Arias Mallo	9f43a0e13b	Remove fox monitoring via IPMI We will need to setup an VPN to be able to access fox in its new location, so for now we simply remove the IPMI monitoring. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-06-02 11:26:53 +02:00
Rodrigo Arias Mallo	3a3c3050ef	Monitor fox, gateway and UPC anella via ICMP Fox should reply once the machine is connected to the UPC network. Monitoring also the gateway and UPC anella allows us to estimate if the whole network is down or just fox. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-06-02 11:26:51 +02:00
Rodrigo Arias Mallo	7a2f37aaa2	Add UPC temperature sensor monitoring These sensors are part of their air quality measurements, which just happen to be very close to our server room. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-05-29 13:01:37 +02:00
Rodrigo Arias Mallo	aae6585f66	Add meteocat exporter Allows us to track ambient temperature changes and estimate the temperature delta between the server room and exterior temperature. We should be able to predict when we would need to stop the machines due to excesive temperature as summer approaches. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-05-29 13:01:29 +02:00
Rodrigo Arias Mallo	1c15e77c83	Add custom nix-daemon exporter Allows us to see which derivations are being built in realtime. It is a bit of a hack, but it seems to work. We simply look at the environment of the child processes of nix-daemon (usually bash) and then look for the $name variable which should hold the current derivation being built. Needs root to be able to read the environ file of the different nix-daemon processes as they are owned by the nixbld* users. See: https://discourse.nixos.org/t/query-ongoing-builds/23486 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-05-29 12:57:07 +02:00
Rodrigo Arias Mallo	abeab18270	Add raccoon node exporter monitoring Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-22 14:50:08 +02:00
Rodrigo Arias Mallo	1985b58619	Increase data retention to 5 years Now that we have more space, we can extend the retention time to 5 years to hold the monitoring metrics. For a year we have: # du -sh /var/lib/prometheus2 13G /var/lib/prometheus2 So we can expect it to increase to about 65 GiB. In the future we may want to reduce some adquisition frequency. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-22 14:50:03 +02:00
Rodrigo Arias Mallo	44bd061823	Don't forward any docker traffic Access to the 23080 local port will be done by applying the INPUT rules, which pass through nixos-fw. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-15 14:16:15 +02:00
Rodrigo Arias Mallo	e8c309f584	Allow traffic from docker to enter port 23080 Before: hut% sudo docker run -it --rm alpine /bin/ash -xc 'true \| nc -w 3 -v 10.0.40.7 23080' + true + nc -w 3 -v 10.0.40.7 23080 nc: 10.0.40.7 (10.0.40.7:23080): Operation timed out After: hut% sudo docker run -it --rm alpine /bin/ash -xc 'true \| nc -w 3 -v 10.0.40.7 23080' + true + nc -w 3 -v 10.0.40.7 23080 10.0.40.7 (10.0.40.7:23080) open Fixes: rarias/jungle#94 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-15 14:16:10 +02:00
Rodrigo Arias Mallo	9c503fbefb	Clean all iptables rules on stop Prevents the "iptables: Chain already exists." error by making sure that we don't leave any chain on start. The ideal solution is to use iptables-restore instead, which will do the right job. But this needs to be changed in NixOS entirely. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-15 14:08:14 +02:00
Rodrigo Arias Mallo	51b6a8b612	Make nginx listen on all interfaces Needed for local hosts to contact the nix cache via HTTP directly. We also allow the incoming traffic on port 80. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-15 14:08:07 +02:00
Aleix Boné	52213d388d	Fix nginx /cache regex `nix-serve` does not handle duplicates in the path: ``` hut$ curl http://127.0.0.1:5000/nix-cache-info StoreDir: /nix/store WantMassQuery: 1 Priority: 30 hut$ curl http://127.0.0.1:5000//nix-cache-info File not found. ``` This meant that the cache was not accessible via: `curl https://jungle.bsc.es/cache/nix-cache-info` but `curl https://jungle.bsc.es/cachenix-cache-info` worked. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-04-15 14:08:04 +02:00
Rodrigo Arias Mallo	edf744db8d	Add new GitLab runner for gitlab.bsc.es It uses docker based on alpine and the host nix store, so we can perform builds but isolate them from the system. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:41:18 +02:00
Rodrigo Arias Mallo	fee1d4da7e	Don't move doc in web output Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:19 +02:00
Rodrigo Arias Mallo	feb2060be7	Use IPMI host names instead of IP addresses Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:01 +02:00
Rodrigo Arias Mallo	00999434c2	Add fox IPMI monitoring Use agenix to store the credentials safely. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:14:59 +02:00
Rodrigo Arias Mallo	6cad205269	Add script to monitor GPFS Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 15:43:07 +01:00
Rodrigo Arias Mallo	ad4b615211	Collect statistics from logged users Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:48 +01:00
Rodrigo Arias Mallo	b4518b59cf	Add custom GPFS exporter for MN5 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:46 +01:00
Rodrigo Arias Mallo	45dc4124a3	Remove exception to fetch task endpoint It causes the request to go to the website rather than the Gitea service. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:43 +01:00
Rodrigo Arias Mallo	bdfe9a48fd	Use SSD for boot, then switch to NVME Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:40 +01:00
Rodrigo Arias Mallo	1b337d31f8	Use NVME as root Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:37 +01:00
Rodrigo Arias Mallo	717cd5a21e	Keep host header for Grafana requests This was breaking requests due to CSRF check. See: https://github.com/grafana/grafana/issues/45117#issuecomment-1033842787 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:32 +01:00
Rodrigo Arias Mallo	def5955614	Ignore logging requests from the gitea runner Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:28 +01:00
Rodrigo Arias Mallo	0e3c975cb5	Log the client IP not the proxy Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:22 +01:00
Rodrigo Arias Mallo	36592c44eb	Create paste directories in /ceph/p Ensure that all hut users have a paste directory in /ceph/p owned by themselves. We need to wait for the ceph mount point to create them, so we use a systemd service that waits for the remote-fs.target. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:16 +01:00
Rodrigo Arias Mallo	0d2dea94fb	Add p command to paste files Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:10 +01:00
Rodrigo Arias Mallo	7f539d7e06	Use nginx to serve website and other services Instead of using multiple tunels to forward all our services to the VM that serves jungle.bsc.es, just use nginx to redirect the traffic from hut. This allows adding custom rules for paths that are not posible otherwise. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:23:07 +01:00
Rodrigo Arias Mallo	f8ec090836	Mount the NVME disk in /nvme Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-01-16 14:22:58 +01:00
Rodrigo Arias Mallo	c8687f7e45	Emulate other architectures in owl nodes too Allows cross-compilation of packages for RISC-V that are known to try to run RISC-V programs in the host. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:39 +02:00
Rodrigo Arias Mallo	b3e397eb4c	Set gitea and grafana log level to warn Prevents filling the journal logs with information messages. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:27 +02:00
Rodrigo Arias Mallo	8ca1d84844	Use authentication tokens for PM GitLab runner Starting with GitLab 16, there is a new mechanism to authenticate the runners via authentication tokens, so use it instead. Older tokens and runners are also removed, as they are no longer used. With the new way of managing tokens, both the tags and the locked state are managed from the GitLab web page. See: https://docs.gitlab.com/ee/ci/runners/new_creation_workflow.html Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:16 +02:00
Rodrigo Arias Mallo	958ad1f025	Allow incoming traffic to hut proxy Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:35:23 +02:00
Rodrigo Arias Mallo	4e2b80defd	Add support for armv7 emulation in hut Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:48 +02:00
Rodrigo Arias Mallo	1c8efd0877	Monitor raccoon machine via IPMI Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-17 11:12:32 +02:00
Rodrigo Arias Mallo	72faf8365b	Split xeon specific configuration from base To accomodate the raccoon knights workstation, some of the configuration pulled by m/common/main.nix has to be removed. To solve it, the xeon specific parts are placed into m/common/xeon.nix and only the common configuration is at m/common/base.nix. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:37 +02:00
Rodrigo Arias Mallo	22cc1d33f7	Add PostgreSQL DB for performance test results The database will hold the performance results of the execution of the benchmarks. We follow the same setup on knights3 for now. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-07-16 12:35:24 +02:00
Rodrigo Arias Mallo	15085c8a05	Enable Grafana email alerts Allows sending Grafana alerts via email too, so we have a reduntant mechanism in case Slack fails to deliver them. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 15:57:38 +02:00
Rodrigo Arias Mallo	06748dac1d	Enable mail notification in Gitea Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:49 +02:00
Rodrigo Arias Mallo	63851306ac	Add msmtp to send notifications via email Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:20 +02:00
Rodrigo Arias Mallo	db2c6f7e45	Collect Gitea metrics in Prometheus Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:32:25 +02:00
Rodrigo Arias Mallo	8e8f9e7adb	Add Gitea service Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:31:51 +02:00
Rodrigo Arias Mallo	005a67deaf	Use google.com probe instead of bsc.es The main website of the BSC is failing every day around 3:00 AM for almost one hour, so it is not a very good target. Instead, google.com is used which should be more reliable. The same robots.txt path is fetched, as it is smaller than the main page. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-03-05 16:52:21 +01:00
Rodrigo Arias Mallo	f8097cb5cb	Add another HTTPS probe for bsc.es As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we cannot detect a problem in our proxy or in the BSC one. Adding another target like bsc.es that doesn't use the ops proxy allows us to discern where the problem lies. Instead of monitoring https://www.bsc.es/ directly, which will trigger the whole Drupal server and take a whole second, we just fetch robots.txt so the overhead on the server is minimal (and returns in less than 10 ms). Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-02-13 12:26:56 +01:00
Aleix Roca Nonell	ff792f5f48	Move slurm client in a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-02-13 11:11:17 +01:00

1 2

91 Commits