Add web post update for 2025
Reviewed-by: Aleix Boné <abonerib@bsc.es>
This commit is contained in:
		
							parent
							
								
									b51428ed67
								
							
						
					
					
						commit
						7737468e46
					
				
							
								
								
									
										49
									
								
								content/posts/2025-09-26/_index.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										49
									
								
								content/posts/2025-09-26/_index.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,49 @@ | |||||||
|  | --- | ||||||
|  | title: "Update 2025-09-26" | ||||||
|  | author: "Rodrigo Arias Mallo" | ||||||
|  | date: 2025-09-26 | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | This is a summary of notable changes introduced in the last two years. We | ||||||
|  | continue to maintain all machines updated to the last NixOS release (currently | ||||||
|  | NixOS 25.05). | ||||||
|  | 
 | ||||||
|  | ### New compute node: fox | ||||||
|  | 
 | ||||||
|  | We have a new [fox machine](/fox), with two AMD Genoa 9684X CPUs and two NVIDIA | ||||||
|  | RTX4000 GPUs. During the last months we have been doing some tests and it seems | ||||||
|  | that most of the components work well. We have configured CUDA to use the NVIDIA | ||||||
|  | GPUs, as well as AMD uProf to trace performance and energy counters from the | ||||||
|  | CPUs. | ||||||
|  | 
 | ||||||
|  | ### Upgraded login node: apex | ||||||
|  | 
 | ||||||
|  | We have upgraded the operating system on the login node to NixOS, which now runs | ||||||
|  | Linux 6.15.6. During the upgrade, we have detected a problem with the storage | ||||||
|  | disks. The `/` and `/home` partitions sit on a | ||||||
|  | [RAID 5](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5), | ||||||
|  | transparently handled by a RAID hardware controller which starts its own | ||||||
|  | firmware before passing the control to the BIOS to continue the boot sequence. A | ||||||
|  | problem during the startup of the firmware prevented the node to even reach the | ||||||
|  | BIOS screen. | ||||||
|  | 
 | ||||||
|  | After a long debugging session, we detected that the flash memory that stores | ||||||
|  | the firmware of the hardware controller was likely to be the issue, since | ||||||
|  | [memory cells](https://en.wikipedia.org/wiki/Flash_memory#Principles_of_operation) | ||||||
|  | may lose charge over time and can end up corrupting the content. We flashed | ||||||
|  | the latest firmware so the memory cells are charged again with the new bits and | ||||||
|  | that fixed the problem. Hopefully we will be able to use it for some more years. | ||||||
|  | 
 | ||||||
|  | The SLURM server has been moved to apex which allows users to also submit jobs | ||||||
|  | to fox. | ||||||
|  | 
 | ||||||
|  | ### Migrated machines to BSC building | ||||||
|  | 
 | ||||||
|  | The server room had a temperature issue that had been affecting our machines | ||||||
|  | since the end of February of 2025. As the summer approached, the temperature | ||||||
|  | exceeded the safe limits for our hardware, so we had to shutdown the cluster. | ||||||
|  | 
 | ||||||
|  |  | ||||||
|  | 
 | ||||||
|  | Since then, we have moved the cluster to BSC premises, where it now rests at a | ||||||
|  | stable temperature, so hopefully we won't have more unscheduled downtime. | ||||||
							
								
								
									
										
											BIN
										
									
								
								content/posts/2025-09-26/temp.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								content/posts/2025-09-26/temp.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| After Width: | Height: | Size: 97 KiB | 
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user