Commit Graph

12 Commits

Author SHA1 Message Date
ef914953d4 Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-29 19:17:33 +02:00
0cc76fc98d Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:36:52 +02:00
70da186d15 Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 11:56:20 +02:00
b386d30380 Remove fox from SLURM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:43:16 +02:00
db04825a11 Remove SLURM partition all
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-07 16:17:32 +02:00
5683fe5be1 Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-03-28 11:04:19 +01:00
8ff54219f6 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-13 14:47:38 +01:00
b046baee48 Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 15:02:18 +01:00
a0eae1feea Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-11 12:56:30 +01:00
be802804d1 Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 11:44:01 +02:00
e1967ccda6 Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 11:19:03 +02:00
df5a5e1668 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-09 11:14:34 +01:00