Add web post update for 2025 #176

Manually merged
rarias merged 1 commits from post-update-2025 into master 2025-09-29 18:04:37 +02:00
2 changed files with 49 additions and 0 deletions
Showing only changes of commit c441178910 - Show all commits

View File

@@ -0,0 +1,49 @@
---
title: "Update 2025-09-26"
author: "Rodrigo Arias Mallo"
date: 2025-09-26
---
This is a summary of notable changes introduced in the last two years. We
continue to maintain all machines updated to the last NixOS release (currently
NixOS 25.05).
### New compute node: fox
We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA
RTX4000 GPUs. During the last months we have been doing some tests and it seems
that most of the components work well. We have configured CUDA to use the NVIDIA
GPUs, as well as AMD uProf to trace performance and energy counters from the
rarias marked this conversation as resolved Outdated

I feel like a comma after GPUs would make things more clear.

I feel like a comma after GPUs would make things more clear.
CPUs.
### Upgraded login node: apex
We have upgraded the operating system on the login node to NixOS, which now runs
Linux 6.15.6. During the upgrade, we have detected a problem with the storage
disks. The `/` and `/home` partitions sit on a
rarias marked this conversation as resolved Outdated

the second that could be changed to which to avoid repetition

the second `that` could be changed to `which` to avoid repetition
[RAID 5](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5),
transparently handled by a RAID hardware controller which starts its own
firmware before passing the control to the BIOS to continue the boot sequence. A
rarias marked this conversation as resolved Outdated

We are still talking about the RAID controller, so splitting the paragraph is a bit confusing (Unless we change the section header to problems with the raid controller).

We are still talking about the RAID controller, so splitting the paragraph is a bit confusing (Unless we change the section header to problems with the raid controller).

I rewrote it to make it more clear.

I rewrote it to make it more clear.
problem during the startup of the firmware prevented the node to even reach the
BIOS screen.
After a long debugging session, we detected that the flash memory that stores
rarias marked this conversation as resolved Outdated

as since memory (as is grammatically correct, but using it here reads as while: e.g. as/while memory cells lose charge they do X). (https://writinglawtutors.com/dont-use-as-to-mean-because/)

~as~ since memory (as is grammatically correct, but using it here reads as while: e.g. `as/while memory cells lose charge they do X`). (https://writinglawtutors.com/dont-use-as-to-mean-because/)
the firmware of the hardware controller was likely to be the issue, since
[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation)
rarias marked this conversation as resolved Outdated

I would drop the first So since it's a crutch.

I would drop the first `So` since it's a crutch.
may lose charge over time and can end up corrupting the content. We flashed
the latest firmware so the memory cells are charged again with the new bits and
that fixed the problem. Hopefully we will be able to use it for some more years.
rarias marked this conversation as resolved Outdated

The rest of the sos are fine, although they are a bit repetitve.

The rest of the `so`s are fine, although they are a bit repetitve.
The SLURM server has been moved to apex which allows users to also submit jobs
to fox.
rarias marked this conversation as resolved Outdated

Transferred / Migrated

Transferred / Migrated
### Migrated machines to BSC building
rarias marked this conversation as resolved Outdated

had been affecting

had been affecting
The server room had a temperature issue that had been affecting our machines
since the end of February of 2025. As the summer approached, the temperature
exceeded the safe limits for our hardware, so we had to shutdown the cluster.
![Room temperature](temp.png)
rarias marked this conversation as resolved Outdated

where it now

where **it** now
Since then, we have moved the cluster to BSC premises, where it now rests at a
stable temperature, so hopefully we won't have more unscheduled downtime.

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB