Powering off unused nodes with SLURM #35

Closed
opened 2023-09-08 13:07:32 +02:00 by rarias · 3 comments
rarias commented 2023-09-08 13:07:32 +02:00 (Migrated from pm.bsc.es)

Having nodes turned on without jobs is a waste of energy (bad for the environment) and reduces the lifespan of the hardware, which is already close to death.

The nodes owl1 and owl2 shoud be kept powered off until a job script is launched. Then the slurm daemon can turn on the nodes, run the job, wait for a grace period without more jobs and then turn them back off.

Having nodes turned on without jobs is a waste of energy (bad for the environment) and reduces the lifespan of the hardware, which is already close to death. The nodes owl1 and owl2 shoud be kept powered off until a job script is launched. Then the slurm daemon can turn on the nodes, run the job, wait for a grace period without more jobs and then turn them back off.
rarias commented 2023-09-08 13:07:33 +02:00 (Migrated from pm.bsc.es)

assigned to @rarias

assigned to @rarias
rarias commented 2023-09-08 16:28:08 +02:00 (Migrated from pm.bsc.es)

Seems to be working fine with an ipmitool script:

hut% sudo scontrol show 'nodes=owl[1-2]'
NodeName=owl1 Arch=x86_64 CoresPerSocket=14
   CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=0.00
   AvailableFeatures=owl
   ActiveFeatures=owl
   Gres=(null)
   NodeAddr=owl1 NodeHostName=owl1 Version=23.02.4
   OS=Linux 6.5.1 #1-NixOS SMP PREEMPT_DYNAMIC Sat Sep  2 07:13:30 UTC 2023
   RealMemory=1 AllocMem=0 FreeMem=127721 Sockets=2 Boards=1
   State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=owl,all
   BootTime=2023-09-08T16:03:52 SlurmdStartTime=2023-09-08T16:05:52
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=56,mem=1M,billing=56
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=owl2 Arch=x86_64 CoresPerSocket=14
   CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=0.00
   AvailableFeatures=owl
   ActiveFeatures=owl
   Gres=(null)
   NodeAddr=owl2 NodeHostName=owl2 Version=23.02.4
   OS=Linux 6.5.1 #1-NixOS SMP PREEMPT_DYNAMIC Sat Sep  2 07:13:30 UTC 2023
   RealMemory=1 AllocMem=0 FreeMem=127743 Sockets=2 Boards=1
   State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=owl,all
   BootTime=2023-09-08T16:03:44 SlurmdStartTime=2023-09-08T16:06:32
   LastBusyTime=Unknown ResumeAfterTime=None
   CfgTRES=cpu=56,mem=1M,billing=56
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

hut% srun -v -N 2 uptime
srun: defined options
srun: -------------------- --------------------
srun: nodes               : 2
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: Waiting for resource configuration
<a few seconds later...>
srun: Nodes owl[1-2] are ready for job
srun: jobid 4997: nodes(2):`owl[1-2]', cpu counts: 56(x2)
srun: CpuBindType=(null type)
srun: launching StepId=4997.0 on host owl1, 1 tasks: 0
srun: launching StepId=4997.0 on host owl2, 1 tasks: 1
srun: route/default: init: route default plugin loaded
srun: topology/none: init: topology NONE plugin loaded
srun: Node owl2, 1 tasks started
srun: Node owl1, 1 tasks started
 16:25:27  up   0:00,  0 users,  load average: 0,47, 0,12, 0,04
srun: Received task exit notification for 1 task of StepId=4997.0 (status=0x0000).
srun: owl2: task 1: Completed
 16:25:27  up   0:00,  0 users,  load average: 0,11, 0,03, 0,01
srun: Received task exit notification for 1 task of StepId=4997.0 (status=0x0000).
srun: owl1: task 0: Completed

Following runs are immediate:

hut% srun -N 2 uptime
 16:25:57  up   0:01,  0 users,  load average: 0,07, 0,03, 0,01
 16:25:56  up   0:01,  0 users,  load average: 0,29, 0,11, 0,04
hut% srun -N 2 uptime
 16:25:59  up   0:01,  0 users,  load average: 0,26, 0,11, 0,04
 16:26:00  up   0:01,  0 users,  load average: 0,06, 0,03, 0,01

Until the suspend time arrives after being idle for too long they are powered down again.

Seems to be working fine with an ipmitool script: ``` hut% sudo scontrol show 'nodes=owl[1-2]' NodeName=owl1 Arch=x86_64 CoresPerSocket=14 CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=0.00 AvailableFeatures=owl ActiveFeatures=owl Gres=(null) NodeAddr=owl1 NodeHostName=owl1 Version=23.02.4 OS=Linux 6.5.1 #1-NixOS SMP PREEMPT_DYNAMIC Sat Sep 2 07:13:30 UTC 2023 RealMemory=1 AllocMem=0 FreeMem=127721 Sockets=2 Boards=1 State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=owl,all BootTime=2023-09-08T16:03:52 SlurmdStartTime=2023-09-08T16:05:52 LastBusyTime=Unknown ResumeAfterTime=None CfgTRES=cpu=56,mem=1M,billing=56 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=owl2 Arch=x86_64 CoresPerSocket=14 CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=0.00 AvailableFeatures=owl ActiveFeatures=owl Gres=(null) NodeAddr=owl2 NodeHostName=owl2 Version=23.02.4 OS=Linux 6.5.1 #1-NixOS SMP PREEMPT_DYNAMIC Sat Sep 2 07:13:30 UTC 2023 RealMemory=1 AllocMem=0 FreeMem=127743 Sockets=2 Boards=1 State=IDLE+POWERED_DOWN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=owl,all BootTime=2023-09-08T16:03:44 SlurmdStartTime=2023-09-08T16:06:32 LastBusyTime=Unknown ResumeAfterTime=None CfgTRES=cpu=56,mem=1M,billing=56 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s hut% srun -v -N 2 uptime srun: defined options srun: -------------------- -------------------- srun: nodes : 2 srun: verbose : 1 srun: -------------------- -------------------- srun: end of defined options srun: Waiting for resource configuration <a few seconds later...> srun: Nodes owl[1-2] are ready for job srun: jobid 4997: nodes(2):`owl[1-2]', cpu counts: 56(x2) srun: CpuBindType=(null type) srun: launching StepId=4997.0 on host owl1, 1 tasks: 0 srun: launching StepId=4997.0 on host owl2, 1 tasks: 1 srun: route/default: init: route default plugin loaded srun: topology/none: init: topology NONE plugin loaded srun: Node owl2, 1 tasks started srun: Node owl1, 1 tasks started 16:25:27 up 0:00, 0 users, load average: 0,47, 0,12, 0,04 srun: Received task exit notification for 1 task of StepId=4997.0 (status=0x0000). srun: owl2: task 1: Completed 16:25:27 up 0:00, 0 users, load average: 0,11, 0,03, 0,01 srun: Received task exit notification for 1 task of StepId=4997.0 (status=0x0000). srun: owl1: task 0: Completed ``` Following runs are immediate: ``` hut% srun -N 2 uptime 16:25:57 up 0:01, 0 users, load average: 0,07, 0,03, 0,01 16:25:56 up 0:01, 0 users, load average: 0,29, 0,11, 0,04 hut% srun -N 2 uptime 16:25:59 up 0:01, 0 users, load average: 0,26, 0,11, 0,04 16:26:00 up 0:01, 0 users, load average: 0,06, 0,03, 0,01 ``` Until the suspend time arrives after being idle for too long they are powered down again.
rarias commented 2023-09-08 16:55:54 +02:00 (Migrated from pm.bsc.es)

mentioned in merge request !20

mentioned in merge request !20
Sign in to join this conversation.
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#35
No description provided.