jungle

Author	SHA1	Message	Date
Rodrigo Arias Mallo	37073f017d	Add raccoon workstation with Nvidia GPU The raccoon workstation has a Nvidia GTX 960 GPU which will be used for CUDA experiments. The configuration uses the production Nvidia driver at version 550 which still supports the GPU. The current CUDA 12.2 version is also supported by the driver. The workstation has Internet access directly from the gateway, but name resolution via Google DNS servers seems to be blocked, so we use BSC servers for now. The NixOS system is installed in a partition alongside the old Debian system, until we decide that is no longer neccesary to keep both. The old /home partition is not used as we are using the same UIDs and groups from the xeon machines, which don't match the ones here.	2024-06-07 10:47:10 +02:00
Rodrigo Arias Mallo	4077a87021	Split xeon specific configuration from base To accomodate the raccoon knights workstation, some of the configuration pulled by m/common/main.nix has to be removed. To solve it, the xeon specific parts are placed into m/common/xeon.nix and only the common configuration is at m/common/base.nix.	2024-06-07 10:45:46 +02:00
Rodrigo Arias Mallo	cc31cb30c3	Control user access to each machine The users.jungleUsers configuration option behaves like the users.users option, but defines the list attribute `hosts` for each user, which filters users so that only the user can only access those hosts.	2024-06-07 10:45:46 +02:00
Rodrigo Arias Mallo	1dd7550459	Add PostgreSQL DB for performance tests results The database will hold the performance results of the execution of the benchmarks. We follow the same setup on knights3 for now.	2024-06-07 10:44:22 +02:00
Rodrigo Arias Mallo	15085c8a05	Enable Grafana email alerts Allows sending Grafana alerts via email too, so we have a reduntant mechanism in case Slack fails to deliver them. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 15:57:38 +02:00
Rodrigo Arias Mallo	06748dac1d	Enable mail notification in Gitea Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:49 +02:00
Rodrigo Arias Mallo	63851306ac	Add msmtp to send notifications via email Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-31 10:56:20 +02:00
Rodrigo Arias Mallo	2bdc793c8c	Allow Ceph traffic to lake2	2024-05-02 17:43:48 +02:00
Rodrigo Arias Mallo	db2c6f7e45	Collect Gitea metrics in Prometheus Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:32:25 +02:00
Rodrigo Arias Mallo	8e8f9e7adb	Add Gitea service Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-05-02 17:31:51 +02:00
Rodrigo Arias Mallo	d2adc3a6d3	Add firewall rules for Ceph and monitoring The firewall was blocking the monitoring traffic from hut and the Ceph traffic among OSDs. The rules only allow connecting from the specific host that they are supposed to be coming from. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-04-25 13:25:11 +02:00
Rodrigo Arias Mallo	49be0f208c	Remove nixseparatedebuginfod input It has been integrated in nixpkgs, so is no longer required. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-04-25 13:24:58 +02:00
Rodrigo Arias Mallo	005a67deaf	Use google.com probe instead of bsc.es The main website of the BSC is failing every day around 3:00 AM for almost one hour, so it is not a very good target. Instead, google.com is used which should be more reliable. The same robots.txt path is fetched, as it is smaller than the main page. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-03-05 16:52:21 +01:00
Rodrigo Arias Mallo	f8097cb5cb	Add another HTTPS probe for bsc.es As all other HTTPS probes pass through the opsproxy01.bsc.es proxy, we cannot detect a problem in our proxy or in the BSC one. Adding another target like bsc.es that doesn't use the ops proxy allows us to discern where the problem lies. Instead of monitoring https://www.bsc.es/ directly, which will trigger the whole Drupal server and take a whole second, we just fetch robots.txt so the overhead on the server is minimal (and returns in less than 10 ms). Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2024-02-13 12:26:56 +01:00
Aleix Roca Nonell	ff792f5f48	Move slurm client in a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-02-13 11:11:17 +01:00
Rodrigo Arias Mallo	5c48b43ae0	Enable public-inbox at jungle.bsc.es/lists The public-inbox service fetches emails from the sourcehut mailing lists and displays them on the web. The idea is to reduce the dependency on external services and add a secondary storage for the mailing lists in case sourcehut goes down or changes the current free plans. The service is available in https://jungle.bsc.es/lists/ and is open to the public. It currently mirrors the bscpkgs and jungle mailing list. We also edited the CSS to improve the readability and have larger fonts by default. The service for public-inbox produced by NixOS is not well configured to fetch emails from an IMAP mail server, so we also manually edit the service file to enable the network. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-15 11:18:08 +01:00
Rodrigo Arias Mallo	b299ead00b	Monitor https://pm.bsc.es/gitlab/ too The GitLab instance is in the /gitlab endpoint and may fail independently of https://pm.bsc.es/. Cc: Víctor López <victor.lopez@bsc.es> Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-12-05 09:56:28 +01:00
Aleix Roca Nonell	a92432cf5a	Enable nixseparatedebuginfod module The module is only enabled on Hut and Eudy because we noticed activity on the debuginfod service even if no debug session was active. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2023-12-04 11:04:52 +01:00
Rodrigo Arias Mallo	82f5d828c2	Use tmpfs in /tmp The /tmp directory was using the SSD disk which is not erased across boots. Nix will use /tmp to perform the builds, so we want it to be as fast as possible. In general, all the machines have enough space to handle large builds like LLVM. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-28 12:25:50 +01:00
Rodrigo Arias Mallo	35a94a9b02	Enable runners for pm.bsc.es/gitlab too The old runners for the PM gitlab were disabled in configuration in the last outage, but they remained working until we reboot the node. With this change we enable the runners for both PM and gitlab.bsc.es. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 14:45:23 +01:00
Rodrigo Arias Mallo	b6bd31e159	Remove complete ceph package from hut Only the ceph-client is needed. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-24 12:58:54 +01:00
Rodrigo Arias Mallo	dd341902fc	BSC packages are no longer in bsc attribute Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-11-09 13:40:48 +01:00
Rodrigo Arias Mallo	2953080fb8	Monitor anella instead of gw.bsc.es The target gw.bsc.es doesn't reply to our ICMP probes from hut. However, the anella hop in the tracepath is a good candidate to identify cuts between the login and the provider and between the provider and external hosts like Google or Cloudflare DNS. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-27 12:46:08 +02:00
Rodrigo Arias Mallo	9871517be2	Add ICMP probes These probes check if we can reach several targets via ICMP, which is not proxied, so they can be used to see if ICMP forwarding is working in the login node. In particular, we test if we can reach the Google (8.8.8.8) and Cloudflare (1.1.1.1) DNS servers, the BSC gateway which responds to ping only from the intranet and the login node (ssfhead). Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 17:13:03 +02:00
Rodrigo Arias Mallo	736eacaac5	Enable proxy for Grafana too The alerts need to contact the slack endpoint, so we add the proxy environment variables to the grafana systemd service. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:56 +02:00
Rodrigo Arias Mallo	0e66aad099	Make blackbox exporter use the proxy By default it was trying to reach the targets using the default gateway, but since the electrical cut of 2023-10-20, the login node has not enabled forwarding again. So better if we don't rely on it. Reviewed-By: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2023-10-25 16:55:24 +02:00
Rodrigo Arias Mallo	67a4905a0a	Don't log SLURM connection attempts from ssfhead	2023-10-06 15:22:04 +02:00
Rodrigo Arias Mallo	d52d22e0db	Add docker runner too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	42920c2521	Monitor gitlab.bsc.es too	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	4acd35e036	Monitor PM webpage via blackbox	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	621d20db3a	Temporarily disable pm runners	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	0926f6ec1f	Add runner for gitlab.bsc.es	2023-10-06 15:17:07 +02:00
Rodrigo Arias Mallo	61646cb3bd	Allow anonymous access to grafana	2023-09-22 10:51:30 +02:00
Rodrigo Arias Mallo	c0066c4744	Remove user/group when using DynamicUsers	2023-09-22 10:13:06 +02:00
Rodrigo Arias Mallo	ffd0593f51	Set the SLURM_CONF variable	2023-09-21 22:22:00 +02:00
Rodrigo Arias Mallo	f49ae0773e	Enable slurm-exporter service	2023-09-21 21:40:02 +02:00
Rodrigo Arias Mallo	8de3d2b149	Mount the hut nix store for SLURM jobs	2023-09-20 19:38:43 +02:00
Rodrigo Arias Mallo	bc62e28ca3	Enable direnv integration	2023-09-20 09:32:58 +02:00
Rodrigo Arias Mallo	653d411b9e	Remove bscpkgs from the registry and nixPath This is done to prevent accidental evaluations where the nixpkgs input of bscpkgs is still pointing to a different version that the one specified in the jungle flake. Instead use jungle#bscpkgs.X to get a package from bscpkgs.	2023-09-15 12:00:33 +02:00
Rodrigo Arias Mallo	a1e8cfea47	Don't fetch registry flakes from the net	2023-09-15 12:00:28 +02:00
Rodrigo Arias Mallo	e88805947e	Open ports in firewall of compute nodes	2023-09-14 15:45:43 +02:00
Rodrigo Arias Mallo	d9d249411d	Monitor storage nodes via IPMI too	2023-09-13 15:57:13 +02:00
Rodrigo Arias Mallo	10ca572aec	Enable fstrim service	2023-09-12 16:39:45 +02:00
Rodrigo Arias Mallo	75b0f48715	Serve the nix store from hut	2023-09-12 12:19:43 +02:00
Rodrigo Arias Mallo	19a451db77	Add encrypted munge key with agenix	2023-09-08 19:05:45 +02:00
Rodrigo Arias Mallo	ec9be9bb62	Remove unused large port hole in firewall	2023-09-08 18:22:48 +02:00
Rodrigo Arias Mallo	7ddd1977f3	Make exporters listen in localhost only	2023-09-08 18:13:04 +02:00
Rodrigo Arias Mallo	7050c505b5	Allow only some ports for srun	2023-09-08 17:51:37 +02:00
Rodrigo Arias Mallo	033a1fe97b	Block ssfhead from reaching our slurm daemon	2023-09-08 17:36:28 +02:00
Rodrigo Arias Mallo	77cb3c494e	Poweroff idle slurm nodes after 1 hour	2023-09-08 16:49:53 +02:00

1 2 3

128 Commits