Slurmstepd has created 120469 children processes #204

Open
opened 2025-10-22 11:48:47 +02:00 by rarias · 1 comment
Owner

It looks we have a bug in SLURM (in fox). It has frozen the node, but if I stop the process, I can see that:

fox# ps aux -q 369308
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      369308 1220  0.1 224449156 1185844 ?   Tl   10:16 1078:38 slurmstepd: [178.extern]

fox# ls /proc/369308/task/ | wc -l
120469

GDB fails to attach as it tries to attach to every inferior.

(lldb) bt
* thread #1, name = 'slurmstepd', stop reason = signal SIGSTOP
  * frame #0: 0x00007f1585303b9f libc.so.6`wait4 + 111
    frame #1: 0x000000000041e024 slurmstepd`_spawn_job_container + 3612
    frame #2: 0x000000000041f046 slurmstepd`job_manager + 1201
    frame #3: 0x0000000000418373 slurmstepd`main + 11638
    frame #4: 0x00007f158522a47e libc.so.6`__libc_start_call_main + 126
    frame #5: 0x00007f158522a539 libc.so.6`__libc_start_main@@GLIBC_2.34 + 137
    frame #6: 0x00000000004114e5 slurmstepd`_start + 37
It looks we have a bug in SLURM (in fox). It has frozen the node, but if I stop the process, I can see that: ``` fox# ps aux -q 369308 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 369308 1220 0.1 224449156 1185844 ? Tl 10:16 1078:38 slurmstepd: [178.extern] fox# ls /proc/369308/task/ | wc -l 120469 ``` GDB fails to attach as it tries to attach to every inferior. ``` (lldb) bt * thread #1, name = 'slurmstepd', stop reason = signal SIGSTOP * frame #0: 0x00007f1585303b9f libc.so.6`wait4 + 111 frame #1: 0x000000000041e024 slurmstepd`_spawn_job_container + 3612 frame #2: 0x000000000041f046 slurmstepd`job_manager + 1201 frame #3: 0x0000000000418373 slurmstepd`main + 11638 frame #4: 0x00007f158522a47e libc.so.6`__libc_start_call_main + 126 frame #5: 0x00007f158522a539 libc.so.6`__libc_start_main@@GLIBC_2.34 + 137 frame #6: 0x00000000004114e5 slurmstepd`_start + 37 ```
Author
Owner

Seem to be looping trying to wait for ophan processes, but then creates more orphans.

https://github.com/SchedMD/slurm/blob/slurm-24.11/src/slurmd/slurmstepd/req.c#L1589

Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369380
Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 369380
Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369308
Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 369308
Oct 22 11:23:18 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369380
Oct 22 11:23:18 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 369380

Also:

Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394508
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 394508
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394734
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 394734
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394734
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394959
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 390884
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg
Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process

...

Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] debug:  cgroup/v2: common_file_write_uints: cgroup_common.c:294: common_file_write_uints: safe_write (7 of 7) failed: No such process
Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '466094' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_speci
al/cgroup.procs' failed: No such process
Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369380
Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] debug:  _handle_add_extern_pid_internal: for StepId=178.extern, pid 369380
Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 466094 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg
Seem to be looping trying to wait for ophan processes, but then creates more orphans. https://github.com/SchedMD/slurm/blob/slurm-24.11/src/slurmd/slurmstepd/req.c#L1589 ``` Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369380 Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 369380 Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369308 Oct 22 11:23:16 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 369308 Oct 22 11:23:18 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369380 Oct 22 11:23:18 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 369380 ``` Also: ``` Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394508 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 394508 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394734 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 394734 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394734 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 394959 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 390884 Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 390884 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg Oct 22 11:11:44 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '390884' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special/cgroup.procs' failed: No such process ... Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] debug: cgroup/v2: common_file_write_uints: cgroup_common.c:294: common_file_write_uints: safe_write (7 of 7) failed: No such process Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] error: common_file_write_uints: write value '466094' to '/sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_speci al/cgroup.procs' failed: No such process Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] debug2: adding tracking of orphaned process 369380 Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] debug: _handle_add_extern_pid_internal: for StepId=178.extern, pid 369380 Oct 22 11:20:31 fox slurmstepd[369308]: [178.extern] error: Unable to move pid 466094 to /sys/fs/cgroup/system.slice/slurmstepd.scope/job_178/step_extern/user/task_special cg ```
rarias added the
bug
slurm
labels 2025-10-22 15:45:40 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#204
No description provided.