Add web post update for 2025

2025-09-29 17:43:59 +02:00
1 changed files with 18 additions and 17 deletions
--- a/web/content/posts/2025-09-26/_index.md
+++ b/web/content/posts/2025-09-26/_index.md
@@ -13,36 +13,37 @@ NixOS 25.05).
 We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA
 RTX4000 GPUs. During the last months we have been doing some tests and it seems
 that most of the components work well. We have configured CUDA to use the NVIDIA
-GPUs as well as AMD uProf to trace performance and energy counters from the
+GPUs, as well as AMD uProf to trace performance and energy counters from the
 CPUs.

 ### Upgraded login node: apex

 We have upgraded the operating system on the login node to NixOS, which now runs
-Linux 6.15.6. During the upgrade, we have detected a problem with the RAID
-controller that caused a catastrophic failure that prevented the BIOS from
-starting.
+Linux 6.15.6. During the upgrade, we have detected a problem with the storage
+disks. The `/` and `/home` partitions sit on a
+[RAID 5](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5),
+transparently handled by a RAID hardware controller which starts its own
+firmware before passing the control to the BIOS to continue the boot sequence. A
+problem during the startup of the firmware prevented the node to even reach the
+BIOS screen.

-The `/` and `/home` partitions sit on a RAID 5 governed by a RAID hardware
-controller, however it was unable to boot properly before handling
-the control over to the BIOS. After a long debugging session, we detected that
-the flash memory that stores the firmware of the hardware controller was likely
-to be the issue, as
+After a long debugging session, we detected that the flash memory that stores
+the firmware of the hardware controller was likely to be the issue, since
 [memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation)
-may lose charge over time and can end up corrupting the content. So we flashed
+may lose charge over time and can end up corrupting the content. We flashed
 the latest firmware so the memory cells are charged again with the new bits and
 that fixed the problem. Hopefully we will be able to use it for some more years.

-The SLURM server has been moved to apex, so now you can allocate your jobs from
-there, including the new fox machine.
+The SLURM server has been moved to apex which allows users to also submit jobs
+to fox.

-### Translated machines to BSC building
+### Migrated machines to BSC building

-The server room had a temperature issue that affected our machines since the end
-of February of 2025. As the summer approached, the temperature exceeded the safe
-limits for our hardware, so we had to shutdown the cluster.
+The server room had a temperature issue that had been affecting our machines
+since the end of February of 2025. As the summer approached, the temperature
+exceeded the safe limits for our hardware, so we had to shutdown the cluster.

 ![Room temperature](temp.png)

-Since then, we have moved the cluster to BSC premises, where now rests at a
+Since then, we have moved the cluster to BSC premises, where it now rests at a
 stable temperature, so hopefully we won't have more unscheduled downtime.