jungle-backup

Author	SHA1	Message	Date
Rodrigo Arias Mallo	00fe0f46a1	Add acinca user Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-01 12:27:43 +02:00
Rodrigo Arias Mallo	79940876c3	Restart slurmd on failure A failure to reach the control node can cause slurmd to fail and the unit remains in the failed state until is manually restarted. Instead, try to restart the service every 30 seconds, forever: owl1% systemctl show slurmd \| grep -E 'Restart=\|RestartUSec=' Restart=on-failure RestartUSec=30s owl1% pgrep slurmd 5903 owl1% sudo kill -SEGV 5903 owl1% pgrep slurmd 6137 Fixes: rarias/jungle#177 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-30 17:20:39 +02:00
Aleix Boné	163d19bd05	Lower connect timeout when using hut substituter Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-09-29 18:44:48 +02:00
Aleix Boné	360f67cfab	Use hut substituter in all nodes Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-09-29 18:44:38 +02:00
Aleix Boné	a402bc880c	Remove machine access for user csiringo Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-09-29 18:23:24 +02:00
Rodrigo Arias Mallo	9c3fbc0ec9	Mount apex /home via NFS in raccoon Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:53 +02:00
Rodrigo Arias Mallo	3f8e6b9fcd	Remove extra SSH jump configuration We now have direct visibility among nodes so we don't need any extra SSH configuration to reach them. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:51 +02:00
Rodrigo Arias Mallo	08e4dda6d2	Add raccoon peer to wireguard It routes traffic from fox, apex and the compute nodes so that we can reach the git servers and tent. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:48 +02:00
Rodrigo Arias Mallo	3380ec5e05	Restrict fox peer to a single IP Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:43 +02:00
Rodrigo Arias Mallo	e934a2bc9d	Use lowercase peer hostnames Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:25 +02:00
Rodrigo Arias Mallo	3387cbcc25	Share a public folder for documents Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:59:40 +02:00
Rodrigo Arias Mallo	ac5f4e4dca	Add amd_hsmp module in fox for AMD uProf Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:54:24 +02:00
Rodrigo Arias Mallo	cad88f92a8	Disable NMI watchdog in fox Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:54:17 +02:00
Rodrigo Arias Mallo	3ab0e13960	Add AMD uProf module and enable it in fox Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:54:05 +02:00
Rodrigo Arias Mallo	2ed881cd89	Mount home via NFS from apex in fox Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 15:34:02 +02:00
Rodrigo Arias Mallo	2a07df1d30	Allow access to NFS via wireguard subnet Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 15:33:47 +02:00
Rodrigo Arias Mallo	52380eae59	Use 10.106.0.0/24 subnet to avoid collisions The 106 byte is the code for 'j' (jungle) in ASCII: % printf j \| od -t d 0000000 106 0000001 Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:13 +02:00
Rodrigo Arias Mallo	3b16b41be3	Revert "Remove pam_slurm_adopt from fox" This reverts commit `64a52801ed`. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:06 +02:00
Rodrigo Arias Mallo	ee481deffb	Enable fail2ban in fox Protect fox against ssh bruteforce attacks: fox% sudo lastb \| head root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00) Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:02 +02:00
Rodrigo Arias Mallo	b1bad25008	Accept connections from apex to fox slurmd Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:00 +02:00
Rodrigo Arias Mallo	85f38e17a2	Accept fox connection to slurm controller Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:59 +02:00
Rodrigo Arias Mallo	08ab01b89c	Add fox machine to SLURM Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:57 +02:00
Rodrigo Arias Mallo	e7490858c6	Make apex host specific to each machine Allows direct contact via the VPN when accessing from fox, but use Internet when using the rest of the machines. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:49 +02:00
Rodrigo Arias Mallo	7606030135	Add local host fox in apex Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:46 +02:00
Rodrigo Arias Mallo	e55590f59e	Enable wireguard in apex Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:43 +02:00
Rodrigo Arias Mallo	c3da39c392	Add wireguard server in fox Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:38 +02:00
Rodrigo Arias Mallo	d3889b3339	Use writeShellScript for suspend.sh and resume.sh Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:28 +02:00
Rodrigo Arias Mallo	28540d8cf3	Add firewall rules to slurm server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:26 +02:00
Rodrigo Arias Mallo	f847621ceb	Remove hut from slurm Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:24 +02:00
Rodrigo Arias Mallo	12fe43f95f	Only configure apex as slurm server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:22 +02:00
Rodrigo Arias Mallo	0e8329eef3	Split slurm configuration for client and server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:20 +02:00
Rodrigo Arias Mallo	df3b21b570	Move slurm control server to apex Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:16 +02:00
Aleix Boné	78df61d24a	Fix typo in csiringo ssh key Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-08-27 17:44:20 +02:00
Aleix Boné	8e7da73151	Enable nix-ld in weasel Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-08-27 16:19:34 +02:00
Aleix Boné	a7e17e40dc	Add csiringo user with access to apex and weasel Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-08-27 16:02:26 +02:00
Rodrigo Arias Mallo	0e8bd22347	Access gitlab via raccoon in fox Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-08-27 15:27:38 +02:00
Rodrigo Arias Mallo	d948f8b752	Move StartLimit* options to unit section The StartLimitBurst and StartLimitIntervalSec options belong to the [Unit] section, otherwise they are ignored in [Service]: > Unknown key 'StartLimitIntervalSec' in section [Service], ignoring. When using [Unit], the limits are properly set: apex% systemctl show power-policy.service \| grep StartLimit StartLimitIntervalUSec=10min StartLimitBurst=10 StartLimitAction=none Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 14:32:46 +02:00
Rodrigo Arias Mallo	8f7787e217	Set power policy to always turn on In all machines, as soon as we recover the power, turn the machine back on. We cannot rely on the previous state as we will shut them down before the power is cut to prevent damage on the power supply monitoring circuit. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es> Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 11:22:38 +02:00
Rodrigo Arias Mallo	30b9b23112	Add NixOS module to control power policy Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es> Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 11:22:36 +02:00
Rodrigo Arias Mallo	9a056737de	Move August shutdown to 3rd at 22h Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es> Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 11:22:33 +02:00
Rodrigo Arias Mallo	ac700d34a5	Disable automatic August shutdown for Fox The UPC has different dates for the yearly power cut, and Fox can recover properly from a power loss, so we don't need to have it turned off before the power cut. Simply disabling the timer is enough. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es> Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 11:22:10 +02:00
Rodrigo Arias Mallo	9b681ab7ce	Add cudainfo program to test CUDA The cudainfo program checks that we can initialize the CUDA RT library and communicate with the driver. It can be used as standalone program or built with cudainfo.gpuCheck so it is executed inside the build sandbox to see if it also works fine. It uses the autoAddDriverRunpath hook to inject in the runpath the location of the library directory for CUDA libraries. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-23 11:52:09 +02:00
Rodrigo Arias Mallo	9ce394bffd	Add missing symlink in cuda sandbox Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-23 11:51:47 +02:00
Aleix Boné	8cd7b713ca	Enable cuda systemFeature in raccoon and fox This allows running derivations which depend on cuda runtime without breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];` to the derivation. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-07-22 17:07:13 +02:00
Aleix Boné	8eed90d2bd	Move shared nvidia settings to a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-07-22 17:06:45 +02:00
Aleix Boné	aee54ef39f	Replace xeon07 by hut in ssh config The xeon07 machine has been renamed to hut. Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-07-21 18:10:08 +02:00
Rodrigo Arias Mallo	69f7ab701b	Enable automatic Nix GC in raccoon Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-21 17:58:26 +02:00
Rodrigo Arias Mallo	4c9bcebcdc	Select proprietary NVIDIA driver in raccoon The NVIDIA GTX 960 from 2016 has the Maxwell architecture, and NixOS suggests using the proprietary driver for older than Turing: > It is suggested to use the open source kernel modules on Turing or > later GPUs (RTX series, GTX 16xx), and the closed source modules > otherwise. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-21 17:58:21 +02:00
Rodrigo Arias Mallo	86e7c72b9b	Enable open source NVidia driver in fox It is recommended for newer versions. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-18 09:57:38 +02:00
Rodrigo Arias Mallo	a7dffc33b5	Remove option allowUnfree from fox and raccoon It is already set to true for all machines. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-18 09:57:21 +02:00

1 2 3 4 5 ...

327 Commits