23 Commits

Author SHA1 Message Date
6f958c14cd Add AMD uProf package and driver
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-06-20 14:55:43 +02:00
8bb09dd061 Add cudainfo program to test CUDA
The cudainfo program checks that we can initialize the CUDA RT library
and communicate with the driver. It can be used as standalone program or
built with cudainfo.gpuCheck so it is executed inside the build sandbox
to see if it also works fine. It uses the autoAddDriverRunpath hook to
inject in the runpath the location of the library directory for CUDA
libraries.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-07-22 15:24:55 +02:00
c6cc2a7638 Remove merged MPICH patch
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2025-07-15 17:57:22 +02:00
bbf09ab960 Add UPC temperature sensor monitoring
These sensors are part of their air quality measurements, which just
happen to be very close to our server room.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-26 11:24:12 +02:00
3b5781ba63 Add meteocat exporter
Allows us to track ambient temperature changes and estimate the
temperature delta between the server room and exterior temperature.
We should be able to predict when we would need to stop the machines due
to excesive temperature as summer approaches.

Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-05-23 15:40:09 +02:00
8ff54219f6 Reject SSH connections without SLURM allocation
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-02-13 14:47:38 +01:00
9b183c4202 Fix MPICH build by fetching upstream patches too
Reviewed-by: Aleix Boné <abonerib@bsc.es>
2025-01-15 13:16:10 +01:00
d93fea8288 Add workaround for MPICH 4.2.0
See: https://github.com/pmodels/mpich/issues/6946

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-15 21:39:43 +01:00
5f69d51134 Fix SLURM bug in rank integer sign expansion
See: https://bugs.schedmd.com/show_bug.cgi?id=19324

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-15 13:12:46 +01:00
a2ec4546df Merge pmix outputs for MPICH
MPICH expects headers and libraries to be present in the same directory.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2024-03-14 16:59:11 +01:00
3d67c17cac Fix warning in slurm exporter using vendorHash
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-20 12:40:24 +01:00
ea2eeff5f9 Remove old Ceph package overlay
The Ceph package is now integrated in upstream nixpkgs.

Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-07 00:02:26 +01:00
2acfd589d4 BSC packages are no longer in bsc attribute
Reviewed-by: Aleix Roca Nonell <aleix.rocanonell@bsc.es>
2023-11-06 23:03:56 +01:00
7b686d0ea4 Add prometheus-slurm-exporter package 2023-09-21 21:34:18 +02:00
010491618e Revert "Update slurm to 23.02.05.1"
This reverts commit aaefddc44a9073166ac52b8bd56ac96258d3b053.
2023-09-14 15:46:18 +02:00
772e0f00fb Update slurm to 23.02.05.1 2023-09-13 17:44:24 +02:00
3c523572cb Update ceph to 18.2.0 in overlay 2023-08-25 18:12:46 +02:00
7cd15b9732 Move pkgs overlay to overlay.nix 2023-08-25 18:12:00 +02:00
197c93a2be Set mpi to mpich by default in bscpkgs 2023-06-16 16:05:17 +02:00
d9002dd028 Add missing parameter to extend 2023-06-16 16:04:36 +02:00
60ee744a54 Use explicit order in overlays 2023-06-16 16:02:25 +02:00
cd1fde4760 Replace mpi inside bsc attribute 2023-06-16 15:54:55 +02:00
3985e66fa4 Add mpich overlay 2023-06-16 14:16:51 +02:00