Add update post to website
This commit is contained in:
		
							parent
							
								
									10ca572aec
								
							
						
					
					
						commit
						8d449ba20c
					
				
							
								
								
									
										69
									
								
								web/content/posts/2023-09-12/_index.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										69
									
								
								web/content/posts/2023-09-12/_index.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,69 @@ | |||||||
|  | --- | ||||||
|  | title: "Update 2023-09-12" | ||||||
|  | author: "Rodrigo Arias Mallo" | ||||||
|  | date: 2023-09-12 | ||||||
|  | --- | ||||||
|  | 
 | ||||||
|  | This is a summary of notable changes introduced in the jungle cluster in the | ||||||
|  | last months. | ||||||
|  | 
 | ||||||
|  | ### New Ceph filesystem available | ||||||
|  | 
 | ||||||
|  | We have installed the latest [Ceph filesystem][1] (18.2.0) which stores three | ||||||
|  | redundant copies of the data so a failure in one disk doesn't cause data loss. | ||||||
|  | It is mounted in /ceph and available for use in the owl1, owl2 and hut nodes. | ||||||
|  | 
 | ||||||
|  | [1]: https://en.wikipedia.org/wiki/Ceph_(software) | ||||||
|  | 
 | ||||||
|  | The throughput is limited by the 1 Gigabit Ethernet speed, but should be | ||||||
|  | reasonably fast for most workloads. Here is a test with dd which reaches the | ||||||
|  | network limit: | ||||||
|  | 
 | ||||||
|  | ```txt | ||||||
|  | hut% dd if=/dev/urandom of=/ceph/rarias/urandom bs=1M count=1024 | ||||||
|  | 1024+0 records in | ||||||
|  | 1024+0 records out | ||||||
|  | 1073741824 bytes (1,1 GB, 1,0 GiB) copied, 8,98544 s, 119 MB/s | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ### SLURM power save | ||||||
|  | 
 | ||||||
|  | The SLURM daemon has been configured to power down the nodes after one hour of | ||||||
|  | idling. When a new job is allocated to a node that is powered off, it is | ||||||
|  | automatically turned on and as soon as it becomes available it will execute the | ||||||
|  | job. Here is an example with two nodes that boot and execute a simple job that | ||||||
|  | shows the date. | ||||||
|  | 
 | ||||||
|  | ```txt | ||||||
|  | hut% date; srun -N 2 date | ||||||
|  | 2023-09-12T17:36:09 CEST | ||||||
|  | 2023-09-12T17:38:26 CEST | ||||||
|  | 2023-09-12T17:38:18 CEST | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | You can expect a similar delay (around 2-3 min) while the nodes are starting. | ||||||
|  | Notice that while the nodes are kept on, the delay is not noticeable: | ||||||
|  | 
 | ||||||
|  | ```txt | ||||||
|  | hut% date; srun -N 2 date | ||||||
|  | 2023-09-12T17:40:04 CEST | ||||||
|  | 2023-09-12T17:40:04 CEST | ||||||
|  | 2023-09-12T17:40:04 CEST | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ### Power and temperature monitoring | ||||||
|  | 
 | ||||||
|  | In the cluster, we monitor the temperature and the power draw of all nodes. This | ||||||
|  | allows us to understand which machines are not being used and turn them off to | ||||||
|  | save energy that otherwise would be wasted. Here is an example where some nodes | ||||||
|  | are powered off to save energy: | ||||||
|  | 
 | ||||||
|  |  | ||||||
|  | 
 | ||||||
|  | We also configured the nodes to work at low CPU frequencies, so the temperature | ||||||
|  | is kept low to increase the lifespan of the node components. Towards these | ||||||
|  | goals, we have configured two alerts that trigger when the CPUs of a node | ||||||
|  | exceeds the limit temperature of 80 °C or when the power draw exceeds 350 W. | ||||||
|  | 
 | ||||||
|  | By keeping the power consumption and temperatures controlled, we can safely | ||||||
|  | incorporate more machines that will only be used on demand. | ||||||
							
								
								
									
										
											BIN
										
									
								
								web/content/posts/2023-09-12/power.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								web/content/posts/2023-09-12/power.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| After Width: | Height: | Size: 58 KiB | 
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user