Add web post update for 2025
Reviewed-by: Aleix Boné <abonerib@bsc.es>
This commit is contained in:
parent
9c3fbc0ec9
commit
c441178910
49
web/content/posts/2025-09-26/_index.md
Normal file
49
web/content/posts/2025-09-26/_index.md
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
---
|
||||||
|
title: "Update 2025-09-26"
|
||||||
|
author: "Rodrigo Arias Mallo"
|
||||||
|
date: 2025-09-26
|
||||||
|
---
|
||||||
|
|
||||||
|
This is a summary of notable changes introduced in the last two years. We
|
||||||
|
continue to maintain all machines updated to the last NixOS release (currently
|
||||||
|
NixOS 25.05).
|
||||||
|
|
||||||
|
### New compute node: fox
|
||||||
|
|
||||||
|
We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA
|
||||||
|
RTX4000 GPUs. During the last months we have been doing some tests and it seems
|
||||||
|
that most of the components work well. We have configured CUDA to use the NVIDIA
|
||||||
|
GPUs, as well as AMD uProf to trace performance and energy counters from the
|
||||||
|
CPUs.
|
||||||
|
|
||||||
|
### Upgraded login node: apex
|
||||||
|
|
||||||
|
We have upgraded the operating system on the login node to NixOS, which now runs
|
||||||
|
Linux 6.15.6. During the upgrade, we have detected a problem with the storage
|
||||||
|
disks. The `/` and `/home` partitions sit on a
|
||||||
|
[RAID 5](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5),
|
||||||
|
transparently handled by a RAID hardware controller which starts its own
|
||||||
|
firmware before passing the control to the BIOS to continue the boot sequence. A
|
||||||
|
problem during the startup of the firmware prevented the node to even reach the
|
||||||
|
BIOS screen.
|
||||||
|
|
||||||
|
After a long debugging session, we detected that the flash memory that stores
|
||||||
|
the firmware of the hardware controller was likely to be the issue, since
|
||||||
|
[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation)
|
||||||
|
may lose charge over time and can end up corrupting the content. We flashed
|
||||||
|
the latest firmware so the memory cells are charged again with the new bits and
|
||||||
|
that fixed the problem. Hopefully we will be able to use it for some more years.
|
||||||
|
|
||||||
|
The SLURM server has been moved to apex which allows users to also submit jobs
|
||||||
|
to fox.
|
||||||
|
|
||||||
|
### Migrated machines to BSC building
|
||||||
|
|
||||||
|
The server room had a temperature issue that had been affecting our machines
|
||||||
|
since the end of February of 2025. As the summer approached, the temperature
|
||||||
|
exceeded the safe limits for our hardware, so we had to shutdown the cluster.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Since then, we have moved the cluster to BSC premises, where it now rests at a
|
||||||
|
stable temperature, so hopefully we won't have more unscheduled downtime.
|
BIN
web/content/posts/2025-09-26/temp.png
Normal file
BIN
web/content/posts/2025-09-26/temp.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 97 KiB |
Loading…
x
Reference in New Issue
Block a user