Failed to run a job with firewall #38

Closed
opened 2023-09-13 17:28:48 +02:00 by rarias · 3 comments
rarias commented 2023-09-13 17:28:48 +02:00 (Migrated from pm.bsc.es)

After closing the inbound ports, srun fails to launch a job with pmix:

hut% srun -N2 osu_bw
slurmstepd: error:  mpi/pmix_v3: _tcp_connect: owl2 [1]: pmixp_dconn_tcp.c:141: Cannot establish the connection
slurmstepd: error:  mpi/pmix_v3: pmixp_dconn_connect: owl2 [1]: pmixp_dconn.h:245: Cannot establish direct connection to owl1 (0)
slurmstepd: error:  mpi/pmix_v3: _process_extended_hdr: owl2 [1]: pmixp_server.c:733: Unable to connect to 0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 5033.0 ON owl1 CANCELLED AT 2023-09-13T17:27:16 ***
srun: error: owl2: task 1: Killed
srun: error: owl1: task 0: Killed

See https://bugs.schedmd.com/show_bug.cgi?id=3925#c19

This seems to be solved in the last release:

It has been fixed in 23.02.5.

But is not yet in upstream. Let's see if we can add it as an overlay.

After closing the inbound ports, srun fails to launch a job with pmix: ``` hut% srun -N2 osu_bw slurmstepd: error: mpi/pmix_v3: _tcp_connect: owl2 [1]: pmixp_dconn_tcp.c:141: Cannot establish the connection slurmstepd: error: mpi/pmix_v3: pmixp_dconn_connect: owl2 [1]: pmixp_dconn.h:245: Cannot establish direct connection to owl1 (0) slurmstepd: error: mpi/pmix_v3: _process_extended_hdr: owl2 [1]: pmixp_server.c:733: Unable to connect to 0 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 5033.0 ON owl1 CANCELLED AT 2023-09-13T17:27:16 *** srun: error: owl2: task 1: Killed srun: error: owl1: task 0: Killed ``` See https://bugs.schedmd.com/show_bug.cgi?id=3925#c19 This seems to be solved in the last release: > It has been fixed in 23.02.5. But is not yet in upstream. Let's see if we can add it as an overlay.
rarias commented 2023-09-13 17:28:48 +02:00 (Migrated from pm.bsc.es)

assigned to @rarias

assigned to @rarias
rarias commented 2023-09-14 15:46:51 +02:00 (Migrated from pm.bsc.es)

Not working with slurm 23.02.5

hut% srun -N2 osu_bw
slurmstepd: error:  mpi/pmix_v3: _tcp_connect: owl1 [0]: pmixp_dconn_tcp.c:141: Cannot establish the connection
slurmstepd: error:  mpi/pmix_v3: pmixp_dconn_connect: owl1 [0]: pmixp_dconn.h:245: Cannot establish direct connection to owl2 (1)
slurmstepd: error:  mpi/pmix_v3: _process_extended_hdr: owl1 [0]: pmixp_server.c:733: Unable to connect to 1
slurmstepd: error: *** STEP 5038.0 ON owl1 CANCELLED AT 2023-09-14T15:39:07 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: owl2: task 1: Killed
srun: error: owl1: task 0: Killed
hut% srun --version
slurm 23.02.5

Let's open the input firewall in the compute nodes for now.

Not working with slurm 23.02.5 ``` hut% srun -N2 osu_bw slurmstepd: error: mpi/pmix_v3: _tcp_connect: owl1 [0]: pmixp_dconn_tcp.c:141: Cannot establish the connection slurmstepd: error: mpi/pmix_v3: pmixp_dconn_connect: owl1 [0]: pmixp_dconn.h:245: Cannot establish direct connection to owl2 (1) slurmstepd: error: mpi/pmix_v3: _process_extended_hdr: owl1 [0]: pmixp_server.c:733: Unable to connect to 1 slurmstepd: error: *** STEP 5038.0 ON owl1 CANCELLED AT 2023-09-14T15:39:07 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: owl2: task 1: Killed srun: error: owl1: task 0: Killed hut% srun --version slurm 23.02.5 ``` Let's open the input firewall in the compute nodes for now.
rarias commented 2023-09-14 15:55:27 +02:00 (Migrated from pm.bsc.es)

Closed in !20

Closed in !20
Sign in to join this conversation.
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: rarias/jungle#38
No description provided.