-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NixOS: Add support for CDI #284507
NixOS: Add support for CDI #284507
Conversation
83f87d0
to
6511fb2
Compare
6511fb2
to
46afc98
Compare
553836e
to
8d2508a
Compare
pkgs/applications/virtualization/nvidia-container-toolkit/default.nix
Outdated
Show resolved
Hide resolved
0d7dd00
to
95e9d84
Compare
The last commit contains more cleanup than is needed, since we support Docker < 25, and thus, we'll still need the container runtime approach, given the CDI support was included in Docker 25. When the final PR is submitted I'll get rid of that cleanup commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this cleanup makes it easier to keep track of all the nvidia dependencies.
One thing I'd like to add is that some sort of integration test would probably be a good idea. I'm not sure whether this is possible as they'd probably have to run in a NixOS VM though.
@SomeoneSerge Might have some comments on the new layout of the package.nix
/default.nix
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks much simpler and cleaner than what we had
4ad8cfa
to
a252928
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit disoriented, but I think all of my remaining comments are mere nitpicks. We also outlined a few TODOs for the immediate future (correct handling of closures, decisions about /usr/bin
, libc priorities, deciding about whether to delete nvidia-container-toolkit-cdi-generate
as a separate nixos module).
I think I'm convinced enough by the public-facing side of the PR. I also think this should just work for most of the basic use-cases (the overlay experience may need more looking into; and yes, I haven't yet tested this branch myself). In the (hopefully unlikely) case this PR does break somebody's workflow, I am online and respond to github pings, and I expect @ereslibre to be available as well, so we can always discuss and fix stuff.
...Is it time?
Likewise here, I'm happy to take ownership of this bit. |
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just tested this on an instance with a T4 GPU on GCP. Seems to "just work" out of the box.
One thing I'm noticing is that the generated cdi file is not symlinked to /etc/cdi/nvidia.yaml
as described in the nvidia docs. This doesn't seem to be an issue but higher level tooling (like the nvidia-operator for k8s) might have issues with this. I'd expect this to be solvable on the side of said tooling though, so this seems fine to me (and we can always change this later).
Fantastic work!
I just want to double check a couple things and answer all remaining conversations on this PR and mark it as ready to review, sorry for this last wait. Final stretch :) Thanks a ton for your confirmation @aaronmondal :) |
We'll have to check with the nvidia-operator and other components, but in principle both /etc/cdi and /var/run/cdi should be inspected by the runtimes (according to the CDI spec, prioritizing /var/run/cdi over /etc/cdi). With this implementation, users can provide their static configuration, that is placed under /etc/cdi, and the "dynamic configuration", that is only implemented for nvidia (as of today) will place the dynamic configuration under /run/cdi (to which /var/run is linked to in NixOS.) |
2b51729
to
19a77a3
Compare
19a77a3
to
6ac6aab
Compare
Everything seems to be fine, this is the only thing I cannot explain:
What is exactly the meaning of
Anyhow, I think we can move on, and fix any issues that might appear due to this change. As I have stated before, I take ownership of this logic and I'm happy to address those issues. |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/docker-rootless-with-nvidia-support/37069/21 |
I just want to say thanks for the work. I was watching this issue closely as I would benefit a lot from this. Thanks! |
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Show resolved
Hide resolved
nixos/modules/services/hardware/nvidia-container-toolkit-cdi-generator/cdi-generate.nix
Show resolved
Hide resolved
Likewise, I have been looking into this issue and just found this pull request. Thank you for all the work on this issue - greatly appreciated. I have linked this pull request to some reddit questions about GPU use with docker/podman on NixOS where others are trying to find their way through it. Many thanks! |
Thanks a lot for this @stuzenz. Just a reminder that this feature was updated in subsequent PR's and that |
Description of changes
This builds on top of #278969. It provides two main features to NixOS related to the Container Device Interface (CDI):
virtualisation.containers.cdi.static
attrset of JSONs. This can be used to provide static valid JSON files for CDI that will be placed under/etc/cdi
.virtualisation.containers.cdi.dynamic.nvidia.enable
boolean. When enabled, it will use the nvidia-container-toolkit to generate a CDI spec based on the current machine, as a systemd unit file. The generated JSON file will be patched to accomodate the NixOS host system, and will be placed under/run/cdi/nvidia-container-toolkit.json
.As per the CDI spec, CDI-compliant runtimes need to inspect
/etc/cdi
(static configuration) and/var/run/cdi
(dynamic configuration); prioritizing what is under the dynamic configuration over the static configuration in case of clash.So, this implementation is compliant with this specification.
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.