-
-
Notifications
You must be signed in to change notification settings - Fork 14.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apptainer: unbreak --nv #279235
apptainer: unbreak --nv #279235
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a challenging task. Would love to see it work!
Cc: @misumisumi #230851 |
8a43c37
to
ea35813
Compare
ea35813
to
f24b311
Compare
...this way we expose and allow overriding the symlinkJoin constituent components
Also make configTemplate optional
...not the deprecated nvidia-docker wrapper
...update the rejected patches
f24b311
to
680bbed
Compare
@ShamrockLee @misumisumi I'd appreciate if you could run some tests at this point. It'd also be helpful if somebody could test podman/containerd/etc, though again, it's probably more efficient to split |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/using-nvidia-container-runtime-with-containerd-on-nixos/27865/30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently breaks virtualisation.docker.enableNvidia
:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig: No such file or directory. ...
...the cause is likely in the update, rather than in the patches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new error from docker run --runtime nvidia --rm ...
is
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /var/lib/docker/runtimes/nvidia did not terminate successfully: exit status 125: unknown.
The only thing I get in the journal (with debug: true
in the docker.daemon.settings
, and --debug
in runtimes.nvidia.runtimeArgs
) is a random cli tool's --help
I think: https://gist.github.com/SomeoneSerge/e6fbe4747d7ea4f6f5de2b9477c5f6d2. I tried grepping for parts of this message (including failed to read init pid file
) and for /var/lib/docker/runtimes/nvidia
but found no matches in libnvidia-container
, nvidia-container-toolkit
, runc
, or moby
.
If anybody knows how to make docker, or runc, or nvidia-container-runtime write logs please say so. Otherwise I'm stuck and will probably proceed by merging just the apptainer bits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github codesearch shows the message could've come from containerd. I thought containerd was somehow separate from docker? Anyway, hoping someone more experienced with the OCI can handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this PR didn't introduce the breakage. It turns out I didn't notice docker run --gpus all
failing on me on nixos-unstable already.
Nonetheless, I removed the updates and opened #280087
CC:@jmbaur just let you know this PR before you update the libcontainer-15 draft. |
This is bad scoping. I split out the apptainer chnages into #280076. I'll see if I have the bandwidth for libnvidia-docker (as I mentioned, I fail to retrieve any logs from the runtime and I'm not familiar with the ecosystem) |
Description of changes
This is work in progress.
Shortcut to the first error:
Progress so far
nvidia-container --list
andnvidia-container info
Still suffering from:
apptainer --nv --nvccli
still doesn't mount any of the libraries from/run/opengl-driver/lib
, even though nvidia-container-cli is being called, and itsldcache_resolve
does hit themthe
nvliblist
method is broken because we still haven't patched out the ldconfig abuseRunning
nvidia-container-cli --debug --user configure ...
from the shell (outside singularity) still fails, e.g.:It's still unclear to me if nvidia-container-cli was ever meant to be used by unprivileged users
libnvc's logs (
warnx
, etc) aren't visible; I don't even think apptainer eats them, I think they're never actually printed. TheNVC_DEBUG_FILE
is also ignored. Essentially,error_setx
is the only way I found so far to get any logs out of nvidia-container-cliMotivation
Apptainer and singularity's GPU support is (mostly) broken again. Note that singularity+setuid used to work (with the fake ldconfig). Recap:
ldconfig -P
) hoping to scan for the sonames listed inetc/nvliblist.conf
. This isn't new, but we should add a patch making them support at least nixos (ideally we need a generic posix solution accepted upstream); tracking upstream as well: --nv: make ldconfig optional in the nvliblist.conf apptainer/apptainer#1894--nvccli
unless the setuid bit is unset--nv
aborts unless the image is writable, but doesn't explain why it needs the write access--nvccli
uses nvidia-container-toolkit which abuses ld.so.cache, obscurely fails trying to manage capabilities, but also silently fails to copy or mount the host libraries into the container (even ifldcache_resolve
succeeds locating them)Naturally, we need to contact both apptainer, singularityce, and libnvidia-container about these issues. Ideally, they would reach out with the glibc maintainers and ask what would have been the reasonable approach to locating the host libraries.
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)CC @ShamrockLee
Add a 👍 reaction to pull requests you find important.