Shared nix store across compute nodes #23
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We should configure the compute nodes to directly access the nix store at hut, so we don't have to copy the closures manually when launching SLURM jobs.
A possible option is to configure an overlay file system where we mount the ro nix store by NFS or other remote FS and then set an overlay to always have the kernel to boot:
https://discourse.nixos.org/t/sharing-nix-store-between-containers/9733/16
8346dc04b3/nixos/modules/installer/netboot/netboot.nix (L39-L70)
8346dc04b3/nixos/modules/installer/netboot/netboot.nix (L106-L108)
We may be able to use the new Ceph filesystem for that with an overlay.
So, the ceph FS may not even be needed for now, as we can expose the read-only hut store via NFS (already being exported) and make an overlay in compute nodes using the disk nix store as lower.
This needs to work when hut is down, otherwise we won't be able to boot. The rw store will contain the bootstrap to boot the system.
The current problem is that the overlay fs refuses to mount as upper a directory that is read only, like /nix/store (due to systemd).
May be interesting: https://talks.nixcon.org/nixcon-2023/talk/GXW3EX/
I was able to mount the hut nix store in owl1 by doing the following procedure:
mount --bind -o remount,rw /nix/store
mount -o ro hut:/nix /mnt/nix-hut
mount -t overlay overlay -o lowerdir=/mnt/nix-hut,upperdir=/nix,workdir=/mnt/nix-work /nix
Mounting only
/nix/store
fails, I suspect is because they have to be in the same filesystem but is a bind mount. So I mounted /nix directly.Installing packages is not working fine, I broke the nix database:
I'm thinking that we don't need to "see" the derivation of the mount point from nix-daemon. So we can specify a private mount for nix-daemon, and the rest of the system sees /mnt/hut-nix-store + /nix/store, both in read-only.
This will allow jobs from the login to run from a shell with all the software loaded. However, nix shell won't work from the compute nodes, as nix-daemon won't see the derivations.
Using a ro store doesn't work either, the daemon seems to be unable to write to it:
Let's try with rw overlay, using a work directory.
Using a rw overlay seems to work:
The derivations still need to be built, so it wouldn't be very practical if users run
nix shell/develop
from the compute node instead of running it from hut.Configuring the overlay with
fileSystems."/nix/store"
causes the stage1 to attempt to mount the filesystem during boot, even with theneededForBoot
option set to false, see:17a46d09ac/nixos/lib/utils.nix (L14)
A simple hack is to add a double slash, so the check fails but the fs can be mounted anyway:
mentioned in merge request !22
It seems to be attempting to mount it in the incorrect order:
The mount unit doesn't seem to have the proper Requires dependency. Maybe we should create a mount systemd unit instead of relying on NixOS fileSystems.
After fixing some boot dependency problems, there is still a cycle in the order of the units at boot:
Using an override file adds the mount point to the list, instead of replacing it: