49 lines
2.0 KiB
Markdown
49 lines
2.0 KiB
Markdown
---
|
|
title: "Update 2025-09-26"
|
|
author: "Rodrigo Arias Mallo"
|
|
date: 2025-09-26
|
|
---
|
|
|
|
This is a summary of notable changes introduced in the last two years. We
|
|
continue to maintain all machines updated to the last NixOS release (currently
|
|
NixOS 25.05).
|
|
|
|
### New compute node: fox
|
|
|
|
We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA
|
|
RTX4000 GPUs. During the last months we have been doing some tests and it seems
|
|
that most of the components work well. We have configured CUDA to use the NVIDIA
|
|
GPUs as well as AMD uProf to trace performance and energy counters from the
|
|
CPUs.
|
|
|
|
### Upgraded login node: apex
|
|
|
|
We have upgraded the operating system on the login node to NixOS, which now runs
|
|
Linux 6.15.6. During the upgrade, we have detected a problem with the RAID
|
|
controller that caused a catastrophic failure that prevented the BIOS from
|
|
starting.
|
|
|
|
The `/` and `/home` partitions sit on a RAID 5 governed by a RAID hardware
|
|
controller, however it was unable to boot properly before handling
|
|
the control over to the BIOS. After a long debugging session, we detected that
|
|
the flash memory that stores the firmware of the hardware controller was likely
|
|
to be the issue, as
|
|
[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation)
|
|
may lose charge over time and can end up corrupting the content. So we flashed
|
|
the latest firmware so the memory cells are charged again with the new bits and
|
|
that fixed the problem. Hopefully we will be able to use it for some more years.
|
|
|
|
The SLURM server has been moved to apex, so now you can allocate your jobs from
|
|
there, including the new fox machine.
|
|
|
|
### Translated machines to BSC building
|
|
|
|
The server room had a temperature issue that affected our machines since the end
|
|
of February of 2025. As the summer approached, the temperature exceeded the safe
|
|
limits for our hardware, so we had to shutdown the cluster.
|
|
|
|

|
|
|
|
Since then, we have moved the cluster to BSC premises, where now rests at a
|
|
stable temperature, so hopefully we won't have more unscheduled downtime.
|