12 Commits

Author SHA1 Message Date
f69629d2da Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-30 17:20:39 +02:00
f569933732 Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:20 +02:00
ee895d2e4f Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:35:16 +02:00
4528b7c2a6 Remove fox from SLURM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-02 11:26:20 +02:00
2c7211ffa3 Remove SLURM partition all
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:27 +02:00
7c55d10ceb Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:23 +02:00
5487a93972 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:15 +02:00
f887dacdea Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:04 +02:00
2b04812320 Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:14:42 +02:00
f158cb63e8 Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:24 +02:00
8860f76cad Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-09-12 08:36:22 +02:00
7f17fe8874 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-13 11:11:17 +01:00