51 Commits

Author SHA1 Message Date
ef914953d4 Restart slurmd on failure
A failure to reach the control node can cause slurmd to fail and the
unit remains in the failed state until is manually restarted. Instead,
try to restart the service every 30 seconds, forever:

    owl1% systemctl show slurmd | grep -E 'Restart=|RestartUSec='
    Restart=on-failure
    RestartUSec=30s
    owl1% pgrep slurmd
    5903
    owl1% sudo kill -SEGV 5903
    owl1% pgrep slurmd
    6137

Fixes: #177
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-29 19:17:33 +02:00
98abb3edf2 Lower connect timeout when using hut substituter
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-09-29 09:41:34 +02:00
e4c0f95906 Remove extra SSH jump configuration
We now have direct visibility among nodes so we don't need any extra
SSH configuration to reach them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-25 15:15:43 +02:00
87bae5b9df Add AMD uProf module and enable it in fox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 15:51:46 +02:00
7d4ebd8495 Use 10.106.0.0/24 subnet to avoid collisions
The 106 byte is the code for 'j' (jungle) in ASCII:

	% printf j | od -t d
	0000000         106
	0000001

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-09-03 11:12:25 +02:00
02e2470c1a Accept fox connection to slurm controller
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:46:24 +02:00
3f67bc4a2e Add fox machine to SLURM
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-08-29 14:40:43 +02:00
ddfb26be5a Use writeShellScript for suspend.sh and resume.sh
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-29 12:02:12 +02:00
1b21a398a8 Add firewall rules to slurm server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:59:21 +02:00
4d16e794cd Remove hut from slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:43:12 +02:00
0cc76fc98d Split slurm configuration for client and server
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 12:36:52 +02:00
70da186d15 Move slurm control server to apex
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-08-27 11:56:20 +02:00
d1f58a62f5 Move StartLimit* options to unit section
The StartLimitBurst and StartLimitIntervalSec options belong to the
[Unit] section, otherwise they are ignored in [Service]:

> Unknown key 'StartLimitIntervalSec' in section [Service], ignoring.

When using [Unit], the limits are properly set:

  apex% systemctl show power-policy.service | grep StartLimit
  StartLimitIntervalUSec=10min
  StartLimitBurst=10
  StartLimitAction=none

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-24 12:21:05 +02:00
3d7e8b8a07 Add NixOS module to control power policy
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 14:07:06 +02:00
8bb09dd061 Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-22 15:24:55 +02:00
f686797234 Add missing symlink in cuda sandbox
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-21 17:19:25 +02:00
6411a94f77 Enable cuda systemFeature in raccoon and fox
This allows running derivations which depend on cuda runtime without
breaking the sandbox. We only need to add `requiredSystemFeatures = [ "cuda" ];`
to the derivation.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 11:34:28 +02:00
7b61cfbe54 Move shared nvidia settings to a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 11:31:59 +02:00
4e1fd7b0e0 Replace xeon07 by hut in ssh config
The xeon07 machine has been renamed to hut.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-07-18 10:59:39 +02:00
904bb5f2ba Restrict DAC VPN to fox-ipmi machine only
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 14:47:55 +02:00
23310cbfa9 Add OpenVPN service to connect to fox BMC
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-17 14:29:15 +02:00
02da9f1847 Move nix-daemon exporter to modules
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 15:09:54 +02:00
996602845c Add p service for pastes
Reviewed-by: Aleix Boné <abonerib@bsc.es>
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-06-13 12:53:58 +02:00
70eecd1e39 Add hut SSH configuration from outside SSF LAN
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-03 17:17:29 +02:00
08953f64fb Use extra- for substituters and trusted-public-keys
From the nix manual:

> A configuration setting usually overrides any previous value. However,
> for settings that take a list of items, you can prefix the name of the
> setting by extra- to append to the previous value.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-06-03 17:59:17 +02:00
b386d30380 Remove fox from SLURM
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:43:16 +02:00
bbf09ab960 Add UPC temperature sensor monitoring
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:24:12 +02:00
3b5781ba63 Add meteocat exporter
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-23 15:40:09 +02:00
4ed53d4384 Use hut nix cache in owl1, owl2 and raccoon
For owl1 and owl2 directly connect to hut via LAN with HTTP, but for
raccoon pass via the proxy using jungle.bsc.es with HTTPS. There is no
risk of tampering as packages are signed.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2025-02-26 16:03:26 +01:00
db04825a11 Remove SLURM partition all
We no longer have homogeneous nodes so it doesn't make much sense to
allocate a mix of them.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-07 16:17:32 +02:00
5683fe5be1 Adjust fox slurm config after disabling SMT
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-03-28 11:04:19 +01:00
8ff54219f6 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-13 14:47:38 +01:00
b046baee48 Exclude fox from being suspended by slurm
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-12 15:02:18 +01:00
a0eae1feea Add new fox machine
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-11 12:56:30 +01:00
3c1be2d4b4 Emulate other architectures in owl nodes too
Allows cross-compilation of packages for RISC-V that are known to try to
run RISC-V programs in the host.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-19 17:53:10 +02:00
be802804d1 Set default SLURM job time limit to one hour
Prevents enless jobs from being left forever, while allow users to
request a larger time limit.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 11:44:01 +02:00
e1967ccda6 Allow other jobs to run in unused cores
The current select mechanism was using the memory too as a consumable
resource, which by default only sets 1 MiB per node. As each job already
requests 1 MiB, it prevents other jobs from running.

As we are not really concerned with memory usage, we only use the unused
cores in the select criteria.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2024-07-18 11:19:03 +02:00
cd3284d1b2 Split xeon specific configuration from base
To accomodate the raccoon knights workstation, some of the configuration
pulled by m/common/main.nix has to be removed. To solve it, the xeon
specific parts are placed into m/common/xeon.nix and only the common
configuration is at m/common/base.nix.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-03 09:20:11 +02:00
91a42375e3 Control user access to each machine
The users.jungleUsers configuration option behaves like the users.users
option, but defines the list attribute `hosts` for each user, which
filters users so that only the user can only access those hosts.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-06-06 14:06:33 +02:00
b5da1c6521 Remove nixseparatedebuginfod input
It has been integrated in nixpkgs, so is no longer required.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-14 16:44:21 +01:00
df5a5e1668 Move slurm client in a separate module
Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2024-02-09 11:14:34 +01:00
1c6e5d8f82 Enable nixseparatedebuginfod module
The module is only enabled on Hut and Eudy because we noticed activity
on the debuginfod service even if no debug session was active.

Reviewed-by: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
2023-12-01 19:57:04 +01:00
4d833d2088 Remove complete ceph package from hut
Only the ceph-client is needed.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-20 12:57:31 +01:00
24c05e5ebf Remove user/group when using DynamicUsers 2023-09-22 10:13:06 +02:00
7aef154dd4 Set the SLURM_CONF variable 2023-09-21 22:18:30 +02:00
4ca4e0fae9 Enable slurm-exporter service 2023-09-21 21:38:34 +02:00
d4c803dbfb Mount the hut nix store for SLURM jobs 2023-09-20 18:26:48 +02:00
722c0b0eaa Open ports in firewall of compute nodes 2023-09-14 15:45:43 +02:00
ae4ad95902 Add agenix to all nodes 2023-09-04 22:09:40 +02:00
3cc7b33c5a Add agenix module to ceph 2023-09-04 22:06:20 +02:00