diff --git a/web/content/posts/2025-09-26/_index.md b/web/content/posts/2025-09-26/_index.md new file mode 100644 index 00000000..61d05dac --- /dev/null +++ b/web/content/posts/2025-09-26/_index.md @@ -0,0 +1,48 @@ +--- +title: "Update 2025-09-26" +author: "Rodrigo Arias Mallo" +date: 2025-09-26 +--- + +This is a summary of notable changes introduced in the last two years. We +continue to maintain all machines updated to the last NixOS release (currently +NixOS 25.05). + +### New compute node: fox + +We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA +RTX4000 GPUs. During the last months we have been doing some tests and it seems +that most of the components work well. We have configured CUDA to use the NVIDIA +GPUs as well as AMD uProf to trace performance and energy counters from the +CPUs. + +### Upgraded login node: apex + +We have upgraded the operating system on the login node to NixOS, which now runs +Linux 6.15.6. During the upgrade, we have detected a problem with the RAID +controller that caused a catastrophic failure that prevented the BIOS from +starting. + +The `/` and `/home` partitions sit on a RAID 5 governed by a RAID hardware +controller, however it was unable to boot properly before handling +the control over to the BIOS. After a long debugging session, we detected that +the flash memory that stores the firmware of the hardware controller was likely +to be the issue, as +[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation) +may lose charge over time and can end up corrupting the content. So we flashed +the latest firmware so the memory cells are charged again with the new bits and +that fixed the problem. Hopefully we will be able to use it for some more years. + +The SLURM server has been moved to apex, so now you can allocate your jobs from +there, including the new fox machine. + +### Translated machines to BSC building + +The server room had a temperature issue that affected our machines since the end +of February of 2025. As the summer approached, the temperature exceeded the safe +limits for our hardware, so we had to shutdown the cluster. + +![Room temperature](temp.png) + +Since then, we have moved the cluster to BSC premises, where now rests at a +stable temperature, so hopefully we won't have more unscheduled downtime. diff --git a/web/content/posts/2025-09-26/temp.png b/web/content/posts/2025-09-26/temp.png new file mode 100644 index 00000000..af28d6da Binary files /dev/null and b/web/content/posts/2025-09-26/temp.png differ