jungle

Author	SHA1	Message	Date
Rodrigo Arias Mallo	1d025f7a38	Don't suspend owl compute nodes Currently the owl nodes are located on top of the rack and turning them off causes a high temperature increase at that region, which accumulates heat from the whole rack. To maximize airflow we will leave them on at all times. This also makes allocations immediate at the extra cost of around 200 W. In the future, if we include more nodes in SLURM we can configure those to turn off if needed. Fixes: rarias/jungle#156 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-31 11:41:44 +01:00
Rodrigo Arias Mallo	5ff1b1343b	Add nixgen to all machines Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-29 16:28:05 +01:00
Rodrigo Arias Mallo	019826d09e	Add OmpSs-2 release timers and services Send a reminder email to the STAR group to mark the release cycle dates. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-28 12:38:37 +01:00
Rodrigo Arias Mallo	a294daf7e3	Use specific mail-robot group to send mail Allows any user to be able to send mail from the robot account as long as it is added to the mail-robot group. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-28 12:38:17 +01:00
Rodrigo Arias Mallo	e3d1785285	Run a shell in the allocated node with salloc By default, salloc will open a new shell in the current node instead of in the allocated node. This often causes users to leave the extra shell running once the allocation ends. Repeating this process several times causes chains of shells. By running the shell in the remote node, once the allocation ends the shell finishes as well. Fixes: rarias/jungle#174 See: https://slurm.schedmd.com/faq.html#prompt Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-28 11:44:14 +01:00
Rodrigo Arias Mallo	14f2393d30	Update website Add apex page and replace bscpkgs references for jungle after the merge. See: rarias/jungle-website#1 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-22 15:48:13 +02:00
Rodrigo Arias Mallo	f115d611e7	Add aaguirre user Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-22 15:28:29 +02:00
Rodrigo Arias Mallo	4261d327c6	Include agenix module and package directly Avoids adding an extra flake input only to fetch a single module and package. Reviewed-by: Aleix Boné <abonerib@bsc.es> Tested-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-10-14 09:37:47 +02:00
Aleix Boné	98d17b19d3	Enable custom sys-devices system feature Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-10-09 11:40:44 +02:00
Rodrigo Arias Mallo	188ba6df0a	Remove bscpkgs input Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-07 16:07:26 +02:00
Aleix Boné	e42058f08b	Allow access to hut from fox Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-10-02 17:03:21 +02:00
Rodrigo Arias Mallo	f3bfe89f27	Fetch website from its own git repository Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-02 15:45:21 +02:00
Rodrigo Arias Mallo	b040bebd1d	Add acinca user Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-10-01 12:27:43 +02:00
Rodrigo Arias Mallo	f69629d2da	Restart slurmd on failure A failure to reach the control node can cause slurmd to fail and the unit remains in the failed state until is manually restarted. Instead, try to restart the service every 30 seconds, forever: owl1% systemctl show slurmd \| grep -E 'Restart=\|RestartUSec=' Restart=on-failure RestartUSec=30s owl1% pgrep slurmd 5903 owl1% sudo kill -SEGV 5903 owl1% pgrep slurmd 6137 Fixes: rarias/jungle#177 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-30 17:20:39 +02:00
Aleix Boné	0668f0db74	Lower connect timeout when using hut substituter Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-09-29 18:44:48 +02:00
Aleix Boné	5fcd57a061	Use hut substituter in all nodes Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-09-29 18:44:38 +02:00
Aleix Boné	ad1544759f	Remove machine access for user csiringo Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-09-29 18:23:24 +02:00
Rodrigo Arias Mallo	e1c950a530	Mount apex /home via NFS in raccoon Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:53 +02:00
Rodrigo Arias Mallo	f9632c37f8	Remove extra SSH jump configuration We now have direct visibility among nodes so we don't need any extra SSH configuration to reach them. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:51 +02:00
Rodrigo Arias Mallo	1f0cb4ae76	Add raccoon peer to wireguard It routes traffic from fox, apex and the compute nodes so that we can reach the git servers and tent. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:48 +02:00
Rodrigo Arias Mallo	e98fdb89ab	Restrict fox peer to a single IP Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:43 +02:00
Rodrigo Arias Mallo	6afe05b5fd	Use lowercase peer hostnames Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-26 12:28:25 +02:00
Rodrigo Arias Mallo	7d5aebf882	Share a public folder for documents Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:59:40 +02:00
Rodrigo Arias Mallo	4da7780472	Add amd_hsmp module in fox for AMD uProf Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:54:24 +02:00
Rodrigo Arias Mallo	d6126501ba	Disable NMI watchdog in fox Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:54:17 +02:00
Rodrigo Arias Mallo	e6e4846529	Add AMD uProf module and enable it in fox Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-19 10:54:05 +02:00
Rodrigo Arias Mallo	ff0fc18d0a	Mount home via NFS from apex in fox Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 15:34:02 +02:00
Rodrigo Arias Mallo	19c7e32678	Allow access to NFS via wireguard subnet Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 15:33:47 +02:00
Rodrigo Arias Mallo	017c19e7d0	Use 10.106.0.0/24 subnet to avoid collisions The 106 byte is the code for 'j' (jungle) in ASCII: % printf j \| od -t d 0000000 106 0000001 Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:13 +02:00
Rodrigo Arias Mallo	a36eff8749	Revert "Remove pam_slurm_adopt from fox" This reverts commit 1eac0fcad8211195499bc566e6c70312b31af700. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:06 +02:00
Rodrigo Arias Mallo	df17b11458	Enable fail2ban in fox Protect fox against ssh bruteforce attacks: fox% sudo lastb \| head root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:25 - 11:25 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00) root ssh:notty 200.124.28.102 Mon Sep 1 11:24 - 11:24 (00:00) Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:02 +02:00
Rodrigo Arias Mallo	0dc7b7eb3d	Accept connections from apex to fox slurmd Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:03:00 +02:00
Rodrigo Arias Mallo	dff6eaf587	Accept fox connection to slurm controller Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:59 +02:00
Rodrigo Arias Mallo	4b6b67b587	Add fox machine to SLURM Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:57 +02:00
Rodrigo Arias Mallo	6bbfb0d124	Make apex host specific to each machine Allows direct contact via the VPN when accessing from fox, but use Internet when using the rest of the machines. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:49 +02:00
Rodrigo Arias Mallo	46d03d5ca7	Add local host fox in apex Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:46 +02:00
Rodrigo Arias Mallo	e366e6ce87	Enable wireguard in apex Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:43 +02:00
Rodrigo Arias Mallo	e415f70bbb	Add wireguard server in fox Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-09-03 12:02:38 +02:00
Rodrigo Arias Mallo	200c727bbf	Use writeShellScript for suspend.sh and resume.sh Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:28 +02:00
Rodrigo Arias Mallo	7413021440	Add firewall rules to slurm server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:26 +02:00
Rodrigo Arias Mallo	20b4805335	Remove hut from slurm Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:24 +02:00
Rodrigo Arias Mallo	f7dff9deab	Only configure apex as slurm server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:22 +02:00
Rodrigo Arias Mallo	f569933732	Split slurm configuration for client and server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:20 +02:00
Rodrigo Arias Mallo	ee895d2e4f	Move slurm control server to apex Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:16 +02:00
Aleix Boné	5ee8623af2	Fix typo in csiringo ssh key Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-08-27 17:44:20 +02:00
Aleix Boné	a0e4b209b0	Enable nix-ld in weasel Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-08-27 16:19:34 +02:00
Aleix Boné	ce25867421	Add csiringo user with access to apex and weasel Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2025-08-27 16:02:26 +02:00
Rodrigo Arias Mallo	f89bba35a6	Access gitlab via raccoon in fox Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>	2025-08-27 15:27:38 +02:00
Rodrigo Arias Mallo	d591721a61	Move StartLimit* options to unit section The StartLimitBurst and StartLimitIntervalSec options belong to the [Unit] section, otherwise they are ignored in [Service]: > Unknown key 'StartLimitIntervalSec' in section [Service], ignoring. When using [Unit], the limits are properly set: apex% systemctl show power-policy.service \| grep StartLimit StartLimitIntervalUSec=10min StartLimitBurst=10 StartLimitAction=none Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 14:32:46 +02:00
Rodrigo Arias Mallo	343b4f155e	Set power policy to always turn on In all machines, as soon as we recover the power, turn the machine back on. We cannot rely on the previous state as we will shut them down before the power is cut to prevent damage on the power supply monitoring circuit. Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es> Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-07-24 11:22:38 +02:00

1 2 3 4 5 ...

339 Commits