forked from rarias/jungle
Compare commits
10 Commits
c4a63b8ffd
...
26cd5f768d
| Author | SHA1 | Date | |
|---|---|---|---|
|
26cd5f768d
|
|||
|
751e9bef76
|
|||
|
c0b846d819
|
|||
|
ea4ecd4f0f
|
|||
|
38ce508447
|
|||
|
2349128639
|
|||
|
84008b3fbc
|
|||
|
ca0c05f797
|
|||
|
1ad98a3b49
|
|||
|
2e74af2cd8
|
@@ -168,7 +168,7 @@
|
||||
home = "/home/Computational/csiringo";
|
||||
description = "Cesare Siringo";
|
||||
group = "Computational";
|
||||
hosts = [ ];
|
||||
hosts = [ "apex" "weasel" ];
|
||||
hashedPassword = "$6$0IsZlju8jFukLlAw$VKm0FUXbS.mVmPm3rcJeizTNU4IM5Nmmy21BvzFL.cQwvlGwFI1YWRQm6gsbd4nbg47mPDvYkr/ar0SlgF6GO1";
|
||||
openssh.authorizedKeys.keys = [
|
||||
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHA65zvvG50iuFEMf+guRwZB65jlGXfGLF4HO+THFaed csiringo@bsc.es"
|
||||
|
||||
@@ -6,8 +6,5 @@
|
||||
{
|
||||
extra-substituters = [ "http://hut/cache" ];
|
||||
extra-trusted-public-keys = [ "jungle.bsc.es:pEc7MlAT0HEwLQYPtpkPLwRsGf80ZI26aj29zMw/HH0=" ];
|
||||
|
||||
# Set a low timeout in case hut is down
|
||||
connect-timeout = 3; # seconds
|
||||
};
|
||||
}
|
||||
|
||||
@@ -1,49 +0,0 @@
|
||||
---
|
||||
title: "Update 2025-09-26"
|
||||
author: "Rodrigo Arias Mallo"
|
||||
date: 2025-09-26
|
||||
---
|
||||
|
||||
This is a summary of notable changes introduced in the last two years. We
|
||||
continue to maintain all machines updated to the last NixOS release (currently
|
||||
NixOS 25.05).
|
||||
|
||||
### New compute node: fox
|
||||
|
||||
We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA
|
||||
RTX4000 GPUs. During the last months we have been doing some tests and it seems
|
||||
that most of the components work well. We have configured CUDA to use the NVIDIA
|
||||
GPUs, as well as AMD uProf to trace performance and energy counters from the
|
||||
CPUs.
|
||||
|
||||
### Upgraded login node: apex
|
||||
|
||||
We have upgraded the operating system on the login node to NixOS, which now runs
|
||||
Linux 6.15.6. During the upgrade, we have detected a problem with the storage
|
||||
disks. The `/` and `/home` partitions sit on a
|
||||
[RAID 5](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5),
|
||||
transparently handled by a RAID hardware controller which starts its own
|
||||
firmware before passing the control to the BIOS to continue the boot sequence. A
|
||||
problem during the startup of the firmware prevented the node to even reach the
|
||||
BIOS screen.
|
||||
|
||||
After a long debugging session, we detected that the flash memory that stores
|
||||
the firmware of the hardware controller was likely to be the issue, since
|
||||
[memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation)
|
||||
may lose charge over time and can end up corrupting the content. We flashed
|
||||
the latest firmware so the memory cells are charged again with the new bits and
|
||||
that fixed the problem. Hopefully we will be able to use it for some more years.
|
||||
|
||||
The SLURM server has been moved to apex which allows users to also submit jobs
|
||||
to fox.
|
||||
|
||||
### Migrated machines to BSC building
|
||||
|
||||
The server room had a temperature issue that had been affecting our machines
|
||||
since the end of February of 2025. As the summer approached, the temperature
|
||||
exceeded the safe limits for our hardware, so we had to shutdown the cluster.
|
||||
|
||||

|
||||
|
||||
Since then, we have moved the cluster to BSC premises, where it now rests at a
|
||||
stable temperature, so hopefully we won't have more unscheduled downtime.
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 97 KiB |
Reference in New Issue
Block a user