Add web post update for 2025

This commit is contained in:
2025-09-26 14:53:37 +02:00
parent 9c3fbc0ec9
commit e8eb47c9b8
2 changed files with 48 additions and 0 deletions

View File

@@ -0,0 +1,48 @@
---
title: "Update 2025-09-26"
author: "Rodrigo Arias Mallo"
date: 2025-09-26
---
This is a summary of notable changes introduced in the last two years. We
continue to maintain all machines updated to the last NixOS release (currently
NixOS 25.05).
### New compute node: fox
We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA
RTX4000 GPUs. During the last months we have been doing some tests and it seems
that most of the components work well. We have configured CUDA to use the NVIDIA
GPUs as well as AMD uProf to trace performance and energy counters from the
CPUs.
### Upgraded login node: apex
We have upgraded the operating system on the login node to NixOS, which now runs
Linux 6.15.6. During the upgrade, we have detected a problem with the RAID
controller that caused a catastrophic failure that prevented the BIOS from
starting.
The `/` and `/home` partitions sit on a RAID 5 governed by a RAID hardware
controller, however it was unable to boot properly before handling
the control over to the BIOS. After a long debugging session, we detected that
the flash memory that stores the firmware of the hardware controller was likely
to be the issue, as
[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation)
may lose charge over time and can end up corrupting the content. So we flashed
the latest firmware so the memory cells are charged again with the new bits and
that fixed the problem. Hopefully we will be able to use it for some more years.
The SLURM server has been moved to apex, so now you can allocate your jobs from
there, including the new fox machine.
### Translated machines to BSC building
The server room had a temperature issue that affected our machines since the end
of February of 2025. As the summer approached, the temperature exceeded the safe
limits for our hardware, so we had to shutdown the cluster.
![Room temperature](temp.png)
Since then, we have moved the cluster to BSC premises, where now rests at a
stable temperature, so hopefully we won't have more unscheduled downtime.

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB