diff --git a/web/content/posts/2025-09-26/_index.md b/web/content/posts/2025-09-26/_index.md new file mode 100644 index 0000000..935fcba --- /dev/null +++ b/web/content/posts/2025-09-26/_index.md @@ -0,0 +1,49 @@ +--- +title: "Update 2025-09-26" +author: "Rodrigo Arias Mallo" +date: 2025-09-26 +--- + +This is a summary of notable changes introduced in the last two years. We +continue to maintain all machines updated to the last NixOS release (currently +NixOS 25.05). + +### New compute node: fox + +We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA +RTX4000 GPUs. During the last months we have been doing some tests and it seems +that most of the components work well. We have configured CUDA to use the NVIDIA +GPUs, as well as AMD uProf to trace performance and energy counters from the +CPUs. + +### Upgraded login node: apex + +We have upgraded the operating system on the login node to NixOS, which now runs +Linux 6.15.6. During the upgrade, we have detected a problem with the storage +disks. The `/` and `/home` partitions sit on a +[RAID 5](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5), +transparently handled by a RAID hardware controller which starts its own +firmware before passing the control to the BIOS to continue the boot sequence. A +problem during the startup of the firmware prevented the node to even reach the +BIOS screen. + +After a long debugging session, we detected that the flash memory that stores +the firmware of the hardware controller was likely to be the issue, since +[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation) +may lose charge over time and can end up corrupting the content. We flashed +the latest firmware so the memory cells are charged again with the new bits and +that fixed the problem. Hopefully we will be able to use it for some more years. + +The SLURM server has been moved to apex which allows users to also submit jobs +to fox. + +### Migrated machines to BSC building + +The server room had a temperature issue that had been affecting our machines +since the end of February of 2025. As the summer approached, the temperature +exceeded the safe limits for our hardware, so we had to shutdown the cluster. + +![Room temperature](temp.png) + +Since then, we have moved the cluster to BSC premises, where it now rests at a +stable temperature, so hopefully we won't have more unscheduled downtime. diff --git a/web/content/posts/2025-09-26/temp.png b/web/content/posts/2025-09-26/temp.png new file mode 100644 index 0000000..af28d6d Binary files /dev/null and b/web/content/posts/2025-09-26/temp.png differ