jungle

Author	SHA1	Message	Date
Rodrigo Arias Mallo	79940876c3	Restart slurmd on failure A failure to reach the control node can cause slurmd to fail and the unit remains in the failed state until is manually restarted. Instead, try to restart the service every 30 seconds, forever: owl1% systemctl show slurmd \| grep -E 'Restart=\|RestartUSec=' Restart=on-failure RestartUSec=30s owl1% pgrep slurmd 5903 owl1% sudo kill -SEGV 5903 owl1% pgrep slurmd 6137 Fixes: #177 Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-09-30 17:20:39 +02:00
Rodrigo Arias Mallo	0e8329eef3	Split slurm configuration for client and server Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:20 +02:00
Rodrigo Arias Mallo	df3b21b570	Move slurm control server to apex Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-08-29 12:35:16 +02:00
Rodrigo Arias Mallo	b4846b0f6c	Remove fox from SLURM Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-06-02 11:26:20 +02:00
Rodrigo Arias Mallo	b82894eaec	Remove SLURM partition all We no longer have homogeneous nodes so it doesn't make much sense to allocate a mix of them. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:27 +02:00
Rodrigo Arias Mallo	8738bd4eeb	Adjust fox slurm config after disabling SMT Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:23 +02:00
Rodrigo Arias Mallo	b4a12625c5	Reject SSH connections without SLURM allocation Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:15 +02:00
Rodrigo Arias Mallo	88555e3f8c	Exclude fox from being suspended by slurm Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:15:04 +02:00
Rodrigo Arias Mallo	29d58cc62d	Add new fox machine Reviewed-by: Aleix Boné <abonerib@bsc.es>	2025-04-08 17:14:42 +02:00
Rodrigo Arias Mallo	5ad2c683ed	Set default SLURM job time limit to one hour Prevents enless jobs from being left forever, while allow users to request a larger time limit. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:24 +02:00
Rodrigo Arias Mallo	1f06f0fa0c	Allow other jobs to run in unused cores The current select mechanism was using the memory too as a consumable resource, which by default only sets 1 MiB per node. As each job already requests 1 MiB, it prevents other jobs from running. As we are not really concerned with memory usage, we only use the unused cores in the select criteria. Reviewed-by: Aleix Boné <abonerib@bsc.es>	2024-09-12 08:36:22 +02:00
Aleix Roca Nonell	ff792f5f48	Move slurm client in a separate module Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>	2024-02-13 11:11:17 +01:00

12 Commits