Run a shell in the allocated node with salloc #208

Manually merged
rarias merged 2 commits from slurm-interactive into master 2025-10-28 12:37:32 +01:00
Owner

By default, salloc will open a new shell in the current node instead
of in the allocated node. This often causes users to leave the extra
shell running once the allocation ends. Repeating this process several
times causes chains of shells.

By running the shell in the remote node, once the allocation ends the
shell finishes as well.

Fixes: #174
See: https://slurm.schedmd.com/faq.html#prompt


Example:

apex% salloc -p fox -t 0:01:00
salloc: Granted job allocation 196
salloc: Nodes fox are ready for job
fox%
salloc: Job 196 has exceeded its time limit and its allocation has been revoked.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 196.interactive ON fox CANCELLED AT 2025-10-24T15:47:51 DUE TO TIME LIMIT ***
srun: error: fox: task 0: Killed
apex%
apex% ps
    PID TTY          TIME CMD
 268880 pts/7    00:00:01 zsh
 271654 pts/7    00:00:00 ps

CC @varcila

By default, salloc will open a new shell in the *current* node instead of in the allocated node. This often causes users to leave the extra shell running once the allocation ends. Repeating this process several times causes chains of shells. By running the shell in the remote node, once the allocation ends the shell finishes as well. Fixes: https://jungle.bsc.es/git/rarias/jungle/issues/174 See: https://slurm.schedmd.com/faq.html#prompt --- Example: ``` apex% salloc -p fox -t 0:01:00 salloc: Granted job allocation 196 salloc: Nodes fox are ready for job fox% salloc: Job 196 has exceeded its time limit and its allocation has been revoked. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 196.interactive ON fox CANCELLED AT 2025-10-24T15:47:51 DUE TO TIME LIMIT *** srun: error: fox: task 0: Killed apex% apex% ps PID TTY TIME CMD 268880 pts/7 00:00:01 zsh 271654 pts/7 00:00:00 ps ``` CC @varcila
rarias added 1 commit 2025-10-24 15:44:03 +02:00
Run a shell in the allocated node with salloc
All checks were successful
CI / build:cross (pull_request) Successful in 6s
CI / build:all (pull_request) Successful in 15s
5b041f2339
By default, salloc will open a new shell in the *current* node instead
of in the allocated node. This often causes users to leave the extra
shell running once the allocation ends. Repeating this process several
times causes chains of shells.

By running the shell in the remote node, once the allocation ends the
shell finishes as well.

Fixes: #174
See: https://slurm.schedmd.com/faq.html#prompt
rarias requested review from abonerib 2025-10-24 15:44:11 +02:00
abonerib reviewed 2025-10-24 15:53:05 +02:00
@ -88,7 +88,7 @@ in {
# LaunchParameters=ulimit_pam_adopt will set RLIMIT_RSS in processes
# adopted by the external step, similar to tasks running in regular steps
Collaborator

I would remove or update the comment

I would remove or update the comment
rarias marked this conversation as resolved
rarias force-pushed slurm-interactive from 5b041f2339 to 9c622bb6b7 2025-10-24 15:56:06 +02:00 Compare
abonerib approved these changes 2025-10-24 17:13:31 +02:00
Dismissed
Collaborator

I am having trouble loging into fox, I guess related to LaunchParameters=use_interactive_step:

This is from my machine:

Connection closed by 147.83.30.141 port 22

I can login from inside apex, but not when using it as proxy though. This is not ideal, but I can live with it if there is no other choice :)

Now I can only get a shell when I run salloc without the --no-shell command, cannot ssh into fox from other shells.

I am having trouble loging into fox, I guess related to `LaunchParameters=use_interactive_step`: This is from my machine: ``` Connection closed by 147.83.30.141 port 22 ``` ~~I can login from inside apex, but not when using it as proxy though. This is not ideal, but I can live with it if there is no other choice :)~~ Now I can only get a shell when I run `salloc` without the `--no-shell` command, cannot ssh into fox from other shells.
Collaborator

Can confirm that it's broken:

$ ssh -vvv fox
<...>
debug1: Will attempt key: /home/leix/.ssh/id_ed25519 ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 agent
debug1: Will attempt key: /home/leix/.ssh/id_rsa
debug1: Will attempt key: /home/leix/.ssh/id_ecdsa
debug1: Will attempt key: /home/leix/.ssh/id_ecdsa_sk
debug1: Will attempt key: /home/leix/.ssh/id_ed25519_sk
debug2: pubkey_prepare: done
debug1: Offering public key: /home/leix/.ssh/id_ed25519 ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 agent
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 60
debug1: Server accepts key: /home/leix/.ssh/id_ed25519 ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 agent
debug3: sign_and_send_pubkey: using publickey-hostbound-v00@openssh.com with ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8
debug3: sign_and_send_pubkey: signing using ssh-ed25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8
debug3: send packet: type 50
Connection closed by 147.83.30.141 port 22

Full output

  • This was with an active allocation with --no-shell. I get the same result (send packet type 50 (SSH_MSG_USERAUTH_REQUEST) into closed connection) regardless of having allocated the node or not.
Can confirm that it's broken: ``` $ ssh -vvv fox <...> debug1: Will attempt key: /home/leix/.ssh/id_ed25519 ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 agent debug1: Will attempt key: /home/leix/.ssh/id_rsa debug1: Will attempt key: /home/leix/.ssh/id_ecdsa debug1: Will attempt key: /home/leix/.ssh/id_ecdsa_sk debug1: Will attempt key: /home/leix/.ssh/id_ed25519_sk debug2: pubkey_prepare: done debug1: Offering public key: /home/leix/.ssh/id_ed25519 ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 agent debug3: send packet: type 50 debug2: we sent a publickey packet, wait for reply debug3: receive packet: type 60 debug1: Server accepts key: /home/leix/.ssh/id_ed25519 ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 agent debug3: sign_and_send_pubkey: using publickey-hostbound-v00@openssh.com with ED25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 debug3: sign_and_send_pubkey: signing using ssh-ed25519 SHA256:Jmq7aNH8XDdGy7E9dqfqrc/LRaVqhnFgDgdxlFw/pl8 debug3: send packet: type 50 Connection closed by 147.83.30.141 port 22 ``` [Full output](https://jungle.bsc.es/p/abonerib/DODh3b6G.txt) - This was with an active allocation with `--no-shell`. I get the same result (send packet type 50 (SSH_MSG_USERAUTH_REQUEST) into closed connection) regardless of having allocated the node or not.
Author
Owner

Connection closed by 147.83.30.141 port 22

This is caused by PAM because it doesn't find the pam_slurm_adopt module. It happens because I forgot to add the slurm package to the override after merging bscpkgs, so it comes with PAM disabled. Should be fixed now:

apex% salloc -p fox
salloc: Granted job allocation 211
salloc: Nodes fox are ready for job
fox%

# Another shell in apex
apex% ssh fox date
2025-10-27T11:45:08 CET

# Another machine
hop% ssh fox date
2025-10-27T11:45:44 CET
> Connection closed by 147.83.30.141 port 22 This is caused by PAM because it doesn't find the pam_slurm_adopt module. It happens because I forgot to add the slurm package to the override after merging bscpkgs, so it comes with PAM disabled. Should be fixed now: ``` apex% salloc -p fox salloc: Granted job allocation 211 salloc: Nodes fox are ready for job fox% # Another shell in apex apex% ssh fox date 2025-10-27T11:45:08 CET # Another machine hop% ssh fox date 2025-10-27T11:45:44 CET ```
rarias added 1 commit 2025-10-27 11:41:28 +01:00
Add missing slurm package to overlay
All checks were successful
CI / build:cross (pull_request) Successful in 1m4s
CI / build:all (pull_request) Successful in 13m13s
84b7e316a5
Author
Owner

Note: Using callPackage to do overrides only is not a good idea, as in this case the slurm module uses override to change some options, which would fail if we wrap the original package with another callPackage layer. Using a raw import seems to be a good compromise, so we don't pollute the overlay.nix file.

Note: Using callPackage to do overrides only is not a good idea, as in this case the slurm module uses `override` to change some options, which would fail if we wrap the original package with another callPackage layer. Using a raw import seems to be a good compromise, so we don't pollute the overlay.nix file.
Collaborator

I can confirm that now I can ssh from other terminals from apex, and from my machine, with and without the --no-shell option.

I can confirm that now I can ssh from other terminals from apex, and from my machine, with and without the `--no-shell` option.
rarias requested review from abonerib 2025-10-28 11:35:11 +01:00
abonerib approved these changes 2025-10-28 11:42:48 +01:00
rarias force-pushed slurm-interactive from 84b7e316a5 to a7018250ca 2025-10-28 11:45:27 +01:00 Compare
rarias manually merged commit a7018250ca into master 2025-10-28 12:37:32 +01:00
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#208
No description provided.