Mount the hut nix store for SLURM jobs #68
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "slurm-shared-nix-store"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Until we don't transition to a global nix store #42 or fix the overlay problems #41, this is an intermediate solution that allows us to run parallel jobs without the need to copy derivations to the compute nodes.
The trick resides in the private mount namespace that systemd creates to the slurm daemon, which replaces the
/nix/store
by a read only mount of the hut store exported via NFS.There are some drawbacks:
The local binaries in
/run/current-system/sw/bin
are not available, as the overlay FS doesn't work. But at least it allows us to run some jobs in the meanwhile.The nix build/shell/develop run as if executed outside the slurm mount namespace, as they contact with the daemon for build operations, and the daemon only sees the local store. But nothing will appear inside the slurm namespace. The environment must be entered from the hut node first, and then the srun command must be launched with all dependencies in hut.
It seems to be immune to the overlay FS "caching" problem, where a ls of a missing path that later becomes readable doesn't work:
requested review from @arocanon
assigned to @rarias
I expected to see some config changes for the owl nodes to forward builds to hut. Aren't they needed yet?
The builds in the owl machines are still done locally. This is required to switch to a new system while we keep the profile installed in
/nix/var/nix/profiles/system
which is loaded by the grub script, as we continue to boot from the disk.The only way to alter the shared nix store seen by the slurm jobs is by issuing the nix build/shell/develop commands from the hut node.
We could add a bind mount in
/nix/var/daemon-socket
to the slurmd systemd service and connect it to the hut daemon so the builds can also be done from the compute nodes from the slurm mount namespace. If this setup proves to work reliably we can try to add this capability later too, but for now it allows me to begin the ovni CI testing with multiple nodes.Good for me, but if we merge this to master, shouldn't we add some doc about this behavior?
added 1 commit
9ee71114
- Document the hut shared nix store for SLURMCompare with previous version
Yeah. I added some documentation under the owl page, but this will be better covered by the introductory guide I'm preparing, where I describe how to use the whole cluster to run jobs, build deerivations, etc.
Perfect, thank you very much!
resolved all threads
approved this merge request