Issues with systemRequiredFeatures = [ "cuda" ] #147

Closed
opened 2025-07-18 20:38:26 +02:00 by abonerib · 3 comments
Collaborator

In #146 we tried to enable the nix mechanism to handle cuda inside the sandbox. It did not work.

Findings

libcuda is in /run/opengl-driver/lib ; which is set in the paths added to the nvidia mounts (the entry is duplicated, but I don't think that should be any issue?):

:p outputs.nixosConfigurations.fox.config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths
[
  "/run/opengl-driver"
  "/dev/dri"
  "/dev/nvidia*"
  "/run/opengl-driver"
  «derivation /nix/store/4l4wafckfgp8k6axkzq6m36jvqij5npg-mesa-25.0.7.drv»
  «derivation /nix/store/frzpf9jz7var510z96bnp68gqwb96b10-nvidia-x11-570.153.02-6.15.6.drv»
  «derivation /nix/store/bac7mbhj4nizfy40gjs0xj5s3lnifdg2-nvidia-vaapi-driver-0.0.13.drv»
]

:p map toString outputs.nixosConfigurations.fox.config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths
[
  "/run/opengl-driver"
  "/dev/dri"
  "/dev/nvidia*"
  "/run/opengl-driver"
  "/nix/store/cpwib3zazj49fm0y04y53w4xkbqsgrgm-mesa-25.0.7"
  "/nix/store/abw75js074viqk8ksprqkz35ca9f2d7z-nvidia-x11-570.153.02-6.15.6"
  "/nix/store/vq74dby3q0s3sillqc6vp73vqbps1zvv-nvidia-vaapi-driver-0.0.13"
]

There is a hook addDriverRunpath which adds /run/opengl-driver/lib to the rpath of the binaries. This is used in both triton, cudaPackages.saxpy and all other uses of systemRequiredFeatures = [ "cuda" ];

<...>
saxpy-unstable> autoFixElfFiles: using addDriverRunpath to fix /nix/store/ijbaa3l2fqkkzk9ispb4x361a1kv24ak-saxpy-unstable-2023-07-11/bin/saxpy
saxpy-unstable> Running phase: installCheckPhase
saxpy-unstable> no installcheck target in Makefile, doing nothing
saxpy-unstable> Start
saxpy-unstable> Runtime version: 12080
saxpy-unstable> Driver version: 0
saxpy-unstable> Host memory initialized, copying to the device
saxpy-unstable> CUDA error at cudaMalloc(&xDevice, N * sizeof(float)): CUDA driver version is insufficient for CUDA runtime version

The binary is indeed properly patched, and can run from a shell without problems:

fox$ patchelf --print-rpath /nix/store/ijbaa3l2fqkkzk9ispb4x361a1kv24ak-saxpy-unstable-2023-07-11/bin/saxpy
/run/opengl-driver/lib:/nix/store/v27vvpi4piwjgznd0165462civjj50lv-libcublas-12.8.4.1-lib/lib:/nix/store/i51fmwsd274z8ck17i0xkw37xgf24623-cuda_cudart-12.8.90-lib/lib:/nix/store/zdpby3l6azi78sl83cpad2qjpfj25aqx-glibc-2.40-66/lib:/nix/store/bmi5znnqk4kg2grkrhk6py0irc8phf6l-gcc-14.2.1.20250322-lib/lib

fox$ /nix/store/ijbaa3l2fqkkzk9ispb4x361a1kv24ak-saxpy-unstable-2023-07-11/bin/saxpy
Start
Runtime version: 12080
Driver version: 12080
Host memory initialized, copying to the device
Scheduled a cudaMemcpy, calling the kernel
Scheduled a kernel call
Max error: 0.000000

It seems that /run/opengl-driver is present in the sandbox, and we can see the link to the derivation containing the machine's graphic drivers, but we cannot actually access any further:

saxpy-unstable> Running phase: installCheckPhase
saxpy-unstable> #### CONTENTS OF /dev:
saxpy-unstable> dri   kvm             nvidia-uvm        nvidia1    pts     stderr  tty
saxpy-unstable> fd    null            nvidia-uvm-tools  nvidiactl  random  stdin   urandom
saxpy-unstable> full  nvidia-modeset  nvidia0           ptmx       shm     stdout  zero
saxpy-unstable> #### CONTENTS OF /run/:
saxpy-unstable> total 4
saxpy-unstable> drwxr-xr-x 2 nobody nogroup 120 Jul 17 09:41 binfmt
saxpy-unstable> lrwxrwxrwx 1 nobody nogroup  60 Jul 17 09:41 opengl-driver -> /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers
saxpy-unstable> #### CONTENTS OF /run/opengl-driver:
saxpy-unstable> ls: cannot access '/run/opengl-driver/': No such file or directory

list_dev> #### CONTENTS OF /run:
list_dev> total 4
list_dev> drwxr-xr-x 2 nobody nogroup 120 Jul 17 09:41 binfmt
list_dev> lrwxrwxrwx 1 nobody nogroup  60 Jul 17 09:41 opengl-driver -> /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers
list_dev> #### CONTENTS OF /run/opengl-driver:
list_dev> ls: cannot access '/run/opengl-driver/': No such file or directory

fox$ ls /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/ -l
total 20
dr-xr-xr-x 2 root root 4096 Jan  1  1970 bin
lrwxrwxrwx 1 root root   76 Jan  1  1970 etc -> /nix/store/abw75js074viqk8ksprqkz35ca9f2d7z-nvidia-x11-570.153.02-6.15.6/etc
lrwxrwxrwx 1 root root   63 Jan  1  1970 include -> /nix/store/cpwib3zazj49fm0y04y53w4xkbqsgrgm-mesa-25.0.7/include
dr-xr-xr-x 5 root root 4096 Jan  1  1970 lib
dr-xr-xr-x 4 root root 4096 Jan  1  1970 share

strace of ls /run/opengl-driver/ during installCheckPhase:

brk(NULL)                               = 0x56c000
brk(0x58d000)                           = 0x58d000
ioctl(1, TCGETS, 0x7fffffffdb00)        = -1 ENOTTY (Inappropriate ioctl for device)

statx(AT_FDCWD, "/run/opengl-driver/", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|STATX_MTIME|STATX_SIZE, 0x7fffffffd670) = -1 ENOENT (No such file or directory)

fcntl(1, F_GETFL)                       = 0x1 (flags O_WRONLY)
write(2, "ls: ", 4ls: )                     = 4
write(2, "cannot access '/run/opengl-drive"..., 35cannot access '/run/opengl-driver/') = 35
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1
)                       = 1
close(1)                                = 0
close(2)                                = 0
exit_group(2)                           = ?

If we hardcode ls -l /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/ in the same phase, we get the same result.

ls: cannot access '/nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/': No such file or directory
fox$ stat /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers
  File: /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 8,1     Inode: 20211451    Links: 5
Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-07-17 11:36:03.069431760 +0200
Modify: 1970-01-01 01:00:01.000000000 +0100
Change: 2025-07-17 11:36:03.079431980 +0200
 Birth: 2025-07-17 11:36:02.713423932 +0200

The permissions seem fine, and we can access other /nix/store paths without issues...

Things to try

  • programs.nix-required-mounts.allowedPatterns.nvidia-gpu.unsafeFollowSymlinks = true;
  • programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths = [ "/run/opengl-driver/*" ];
  • hardcode programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths with /run/opengl-driver/{lib,bin,include...}
  • breakpointHook to get a repl in the sandbox with cntr
  • give up and start a tomato farm
In #146 we tried to enable the `nix` mechanism to handle cuda inside the sandbox. It did not work. ### Findings `libcuda` is in `/run/opengl-driver/lib` ; which is set in the paths added to the nvidia mounts (the entry is duplicated, but I don't think that should be any issue?): ``` :p outputs.nixosConfigurations.fox.config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths [ "/run/opengl-driver" "/dev/dri" "/dev/nvidia*" "/run/opengl-driver" «derivation /nix/store/4l4wafckfgp8k6axkzq6m36jvqij5npg-mesa-25.0.7.drv» «derivation /nix/store/frzpf9jz7var510z96bnp68gqwb96b10-nvidia-x11-570.153.02-6.15.6.drv» «derivation /nix/store/bac7mbhj4nizfy40gjs0xj5s3lnifdg2-nvidia-vaapi-driver-0.0.13.drv» ] :p map toString outputs.nixosConfigurations.fox.config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths [ "/run/opengl-driver" "/dev/dri" "/dev/nvidia*" "/run/opengl-driver" "/nix/store/cpwib3zazj49fm0y04y53w4xkbqsgrgm-mesa-25.0.7" "/nix/store/abw75js074viqk8ksprqkz35ca9f2d7z-nvidia-x11-570.153.02-6.15.6" "/nix/store/vq74dby3q0s3sillqc6vp73vqbps1zvv-nvidia-vaapi-driver-0.0.13" ] ``` There is a hook `addDriverRunpath` which adds `/run/opengl-driver/lib` to the `rpath` of the binaries. This is used in both triton, cudaPackages.saxpy and all other uses of `systemRequiredFeatures = [ "cuda" ]`; ``` <...> saxpy-unstable> autoFixElfFiles: using addDriverRunpath to fix /nix/store/ijbaa3l2fqkkzk9ispb4x361a1kv24ak-saxpy-unstable-2023-07-11/bin/saxpy saxpy-unstable> Running phase: installCheckPhase saxpy-unstable> no installcheck target in Makefile, doing nothing saxpy-unstable> Start saxpy-unstable> Runtime version: 12080 saxpy-unstable> Driver version: 0 saxpy-unstable> Host memory initialized, copying to the device saxpy-unstable> CUDA error at cudaMalloc(&xDevice, N * sizeof(float)): CUDA driver version is insufficient for CUDA runtime version ``` The binary is indeed properly patched, and can run from a shell without problems: ``` fox$ patchelf --print-rpath /nix/store/ijbaa3l2fqkkzk9ispb4x361a1kv24ak-saxpy-unstable-2023-07-11/bin/saxpy /run/opengl-driver/lib:/nix/store/v27vvpi4piwjgznd0165462civjj50lv-libcublas-12.8.4.1-lib/lib:/nix/store/i51fmwsd274z8ck17i0xkw37xgf24623-cuda_cudart-12.8.90-lib/lib:/nix/store/zdpby3l6azi78sl83cpad2qjpfj25aqx-glibc-2.40-66/lib:/nix/store/bmi5znnqk4kg2grkrhk6py0irc8phf6l-gcc-14.2.1.20250322-lib/lib fox$ /nix/store/ijbaa3l2fqkkzk9ispb4x361a1kv24ak-saxpy-unstable-2023-07-11/bin/saxpy Start Runtime version: 12080 Driver version: 12080 Host memory initialized, copying to the device Scheduled a cudaMemcpy, calling the kernel Scheduled a kernel call Max error: 0.000000 ``` It seems that `/run/opengl-driver` is present in the sandbox, and we can see the link to the derivation containing the machine's graphic drivers, but we cannot actually access any further: ``` saxpy-unstable> Running phase: installCheckPhase saxpy-unstable> #### CONTENTS OF /dev: saxpy-unstable> dri kvm nvidia-uvm nvidia1 pts stderr tty saxpy-unstable> fd null nvidia-uvm-tools nvidiactl random stdin urandom saxpy-unstable> full nvidia-modeset nvidia0 ptmx shm stdout zero saxpy-unstable> #### CONTENTS OF /run/: saxpy-unstable> total 4 saxpy-unstable> drwxr-xr-x 2 nobody nogroup 120 Jul 17 09:41 binfmt saxpy-unstable> lrwxrwxrwx 1 nobody nogroup 60 Jul 17 09:41 opengl-driver -> /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers saxpy-unstable> #### CONTENTS OF /run/opengl-driver: saxpy-unstable> ls: cannot access '/run/opengl-driver/': No such file or directory list_dev> #### CONTENTS OF /run: list_dev> total 4 list_dev> drwxr-xr-x 2 nobody nogroup 120 Jul 17 09:41 binfmt list_dev> lrwxrwxrwx 1 nobody nogroup 60 Jul 17 09:41 opengl-driver -> /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers list_dev> #### CONTENTS OF /run/opengl-driver: list_dev> ls: cannot access '/run/opengl-driver/': No such file or directory fox$ ls /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/ -l total 20 dr-xr-xr-x 2 root root 4096 Jan 1 1970 bin lrwxrwxrwx 1 root root 76 Jan 1 1970 etc -> /nix/store/abw75js074viqk8ksprqkz35ca9f2d7z-nvidia-x11-570.153.02-6.15.6/etc lrwxrwxrwx 1 root root 63 Jan 1 1970 include -> /nix/store/cpwib3zazj49fm0y04y53w4xkbqsgrgm-mesa-25.0.7/include dr-xr-xr-x 5 root root 4096 Jan 1 1970 lib dr-xr-xr-x 4 root root 4096 Jan 1 1970 share ``` strace of `ls /run/opengl-driver/` during `installCheckPhase`: ``` brk(NULL) = 0x56c000 brk(0x58d000) = 0x58d000 ioctl(1, TCGETS, 0x7fffffffdb00) = -1 ENOTTY (Inappropriate ioctl for device) statx(AT_FDCWD, "/run/opengl-driver/", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|STATX_MTIME|STATX_SIZE, 0x7fffffffd670) = -1 ENOENT (No such file or directory) fcntl(1, F_GETFL) = 0x1 (flags O_WRONLY) write(2, "ls: ", 4ls: ) = 4 write(2, "cannot access '/run/opengl-drive"..., 35cannot access '/run/opengl-driver/') = 35 write(2, ": No such file or directory", 27: No such file or directory) = 27 write(2, "\n", 1 ) = 1 close(1) = 0 close(2) = 0 exit_group(2) = ? ``` If we hardcode `ls -l /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/` in the same phase, we get the same result. ``` ls: cannot access '/nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/': No such file or directory ``` ``` fox$ stat /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers File: /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers Size: 4096 Blocks: 8 IO Block: 4096 directory Device: 8,1 Inode: 20211451 Links: 5 Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2025-07-17 11:36:03.069431760 +0200 Modify: 1970-01-01 01:00:01.000000000 +0100 Change: 2025-07-17 11:36:03.079431980 +0200 Birth: 2025-07-17 11:36:02.713423932 +0200 ``` The permissions seem fine, and we can access other /nix/store paths without issues... ### Things to try - [ ] `programs.nix-required-mounts.allowedPatterns.nvidia-gpu.unsafeFollowSymlinks = true;` - [ ] `programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths = [ "/run/opengl-driver/*" ];` - [ ] hardcode `programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths` with `/run/opengl-driver/{lib,bin,include...}` - [ ] breakpointHook to [get a repl in the sandbox with cntr][1] - [ ] give up and start a tomato farm [1]: https://ryantm.github.io/nixpkgs/hooks/breakpoint/
abonerib added the confignix labels 2025-07-18 20:38:26 +02:00
Owner

More packages can be added to config.hardware.graphics.extraPackages or config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths to be available in the sandbox. Currently:


nix-repl> :p nixosConfigurations.fox.config.hardware.graphics.extraPackages
[
  «derivation /nix/store/frzpf9jz7var510z96bnp68gqwb96b10-nvidia-x11-570.153.02-6.15.6.drv»
  «derivation /nix/store/bac7mbhj4nizfy40gjs0xj5s3lnifdg2-nvidia-vaapi-driver-0.0.13.drv»
]

nix-repl> :p nixosConfigurations.fox.config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths
[
  "/run/opengl-driver"
  "/dev/dri"
  "/dev/nvidia*"
  "/run/opengl-driver"
  «derivation /nix/store/4l4wafckfgp8k6axkzq6m36jvqij5npg-mesa-25.0.7.drv»
  «derivation /nix/store/frzpf9jz7var510z96bnp68gqwb96b10-nvidia-x11-570.153.02-6.15.6.drv»
  «derivation /nix/store/bac7mbhj4nizfy40gjs0xj5s3lnifdg2-nvidia-vaapi-driver-0.0.13.drv»
]

The /nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/ derivation seems to be the output of joinSymlink with several graphics packages. We can find out where it is and add it to the list, but probably setting unsafeFollowSymlinks would do it.

More packages can be added to `config.hardware.graphics.extraPackages` or `config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths` to be available in the sandbox. Currently: ``` nix-repl> :p nixosConfigurations.fox.config.hardware.graphics.extraPackages [ «derivation /nix/store/frzpf9jz7var510z96bnp68gqwb96b10-nvidia-x11-570.153.02-6.15.6.drv» «derivation /nix/store/bac7mbhj4nizfy40gjs0xj5s3lnifdg2-nvidia-vaapi-driver-0.0.13.drv» ] nix-repl> :p nixosConfigurations.fox.config.programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths [ "/run/opengl-driver" "/dev/dri" "/dev/nvidia*" "/run/opengl-driver" «derivation /nix/store/4l4wafckfgp8k6axkzq6m36jvqij5npg-mesa-25.0.7.drv» «derivation /nix/store/frzpf9jz7var510z96bnp68gqwb96b10-nvidia-x11-570.153.02-6.15.6.drv» «derivation /nix/store/bac7mbhj4nizfy40gjs0xj5s3lnifdg2-nvidia-vaapi-driver-0.0.13.drv» ] ``` The `/nix/store/lv0wz2axhnvpk4zqkxhap5q1793x0n6l-graphics-drivers/` derivation seems to be the output of joinSymlink with several graphics packages. We can find out where it is and add it to the list, but probably setting `unsafeFollowSymlinks` would do it.
Owner

This patch seems to fix it:

diff --git a/m/module/nvidia.nix b/m/module/nvidia.nix
index 553e08d..2ddbbce 100644
--- a/m/module/nvidia.nix
+++ b/m/module/nvidia.nix
@@ -12,4 +12,8 @@
   # > requiredSystemFeatures = [ "cuda" ];
   programs.nix-required-mounts.enable = true;
   programs.nix-required-mounts.presets.nvidia-gpu.enable = true;
+  # They forgot to add the symlink
+  programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths = [
+    config.systemd.tmpfiles.settings.graphics-driver."/run/opengl-driver"."L+".argument
+  ];
 }

There is no easy way to get the outcome of the buildEnv command...

This patch seems to fix it: ```diff diff --git a/m/module/nvidia.nix b/m/module/nvidia.nix index 553e08d..2ddbbce 100644 --- a/m/module/nvidia.nix +++ b/m/module/nvidia.nix @@ -12,4 +12,8 @@ # > requiredSystemFeatures = [ "cuda" ]; programs.nix-required-mounts.enable = true; programs.nix-required-mounts.presets.nvidia-gpu.enable = true; + # They forgot to add the symlink + programs.nix-required-mounts.allowedPatterns.nvidia-gpu.paths = [ + config.systemd.tmpfiles.settings.graphics-driver."/run/opengl-driver"."L+".argument + ]; } ``` There is no easy way to get the outcome of the buildEnv command...
Owner

Fixed in #146

Fixed in https://jungle.bsc.es/git/rarias/jungle/pulls/146
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: rarias/jungle#147