17 Commits

Author SHA1 Message Date
084d556c56 Add AMD uProf package and driver
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-09-19 10:53:49 +02:00
3269d763aa Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-23 11:52:09 +02:00
0e37ab5fe1 Remove merged MPICH patch
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-16 13:07:12 +02:00
dd15f9c943 Add UPC temperature sensor monitoring
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-29 13:01:37 +02:00
4048b3327a Add meteocat exporter
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-29 13:01:29 +02:00
5487a93972 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-04-08 17:15:15 +02:00
371b0c7e76 Fix MPICH build by fetching upstream patches too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-16 15:43:13 +01:00
3863fc25a5 Add workaround for MPICH 4.2.0
See: https://github.com/pmodels/mpich/issues/6946

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:08 +02:00
2b26cd2f46 Fix SLURM bug in rank integer sign expansion
See: https://bugs.schedmd.com/show_bug.cgi?id=19324

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:05 +02:00
30f2079f0b Merge pmix outputs for MPICH
MPICH expects headers and libraries to be present in the same directory.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-04-25 13:25:03 +02:00
7afe7344ac Remove old Ceph package overlay
The Ceph package is now integrated in upstream nixpkgs.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-24 12:58:47 +01:00
0d9c99a24e BSC packages are no longer in bsc attribute
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-09 13:40:48 +01:00
9071a4de8b Add prometheus-slurm-exporter package 2023-09-21 21:34:18 +02:00
8dbee06d1d Revert "Update slurm to 23.02.05.1"
This reverts commit 7bfd786c01c36131cd00b90fc6a9503fd1226578.
2023-09-14 15:46:18 +02:00
7bfd786c01 Update slurm to 23.02.05.1 2023-09-13 17:44:24 +02:00
8912d2b9bc Update ceph to 18.2.0 in overlay 2023-08-25 18:20:21 +02:00
b4015ded86 Move pkgs overlay to overlay.nix 2023-08-25 18:12:00 +02:00