Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit: do not shadow docker executable #330197

Conversation

ereslibre
Copy link
Member

@ereslibre ereslibre commented Jul 26, 2024

Description of changes

At this time, the nvidia-container-toolkit derivation installs a docker executable that shadows the main one, and that is not thought to forward commands to the original docker command, causing issues to users when the nvidia-container-toolkit is in scope and they try to call to docker.

Fixes: #293857

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

At this time, the nvidia-container-toolkit derivation installs a
docker executable that shadows the main one, and that is not thought
to forward commands to the original docker command, causing issues to
users when the `nvidia-container-toolkit` is in scope and they try to
call to `docker`.
Comment on lines 59 to 64
subPackages = [
"cmd/nvidia-ctk"
"cmd/nvidia-container-runtime"
"cmd/nvidia-container-runtime-hook"
];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm stupid, what does this do?

Copy link
Member Author

@ereslibre ereslibre Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-container-toolkit has a lot of binaries in its tree. Some of them are under https://github.com/NVIDIA/nvidia-container-toolkit/tree/a470818ba7d9166be282cd0039dd2fc9b0a34d73/cmd. Others are under https://github.com/NVIDIA/nvidia-container-toolkit/tree/a470818ba7d9166be282cd0039dd2fc9b0a34d73/tools/container.

The last one is the one causing the shadowing for docker, but in reality the main issue is that some of those despite being named docker or crio are not meant to shadow the original binaries, but just named like that to set up the corresponding runtime configuration.

I believe we don't need to expose that kind of tooling to the user, and that NixOS will properly configure the given runtime with regular NixOS options. This is an implementation detail in my opinion.

So with subPackages what we are telling buildGoModule is: instead of building all the binaries that you can find in this tree structure, just build cmd/nvidia-ctk, cmd/nvidia-container-runtime and cmd/nvidia-container-runtime-hook.

To be honest, I would even just try with cmd/nvidia-ctk and then start adding more binaries as we see are missing, but this was a way to start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the list of binaries produced by nix-build -A nvidia-docker before this change:

❯ tree result/bin
result/bin
├── containerd -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/containerd
├── crio -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/crio
├── docker -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/docker
├── nvidia-container-cli -> /nix/store/570j23jms28cjb603a7hy695nmibfi0n-libnvidia-container-1.9.0/bin/nvidia-container-cli
├── nvidia-container-runtime -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime
├── nvidia-container-runtime.cdi -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime.cdi
├── nvidia-container-runtime-hook -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime-hook
├── nvidia-container-runtime.legacy -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime.legacy
├── nvidia-container-toolkit -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-toolkit
├── nvidia-ctk -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-ctk
├── nvidia-docker -> /nix/store/lc86iqnadxhc54n34n4g9rd94li39ls8-nvidia-docker-2.5.0/bin/nvidia-docker
├── nvidia-toolkit -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-toolkit
└── toolkit -> /nix/store/9933i2f5wnj6f994isn242vdnbzdry62-container-toolkit-container-toolkit-1.15.0-rc.3/bin/toolkit

And after this change:

❯ tree result/bin
result/bin
├── nvidia-container-cli -> /nix/store/570j23jms28cjb603a7hy695nmibfi0n-libnvidia-container-1.9.0/bin/nvidia-container-cli
├── nvidia-container-runtime -> /nix/store/iw49hbrnnxg2pf1ynha35gn96mgpg4xx-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime
├── nvidia-container-runtime-hook -> /nix/store/iw49hbrnnxg2pf1ynha35gn96mgpg4xx-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime-hook
├── nvidia-container-toolkit -> /nix/store/iw49hbrnnxg2pf1ynha35gn96mgpg4xx-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-toolkit
├── nvidia-ctk -> /nix/store/iw49hbrnnxg2pf1ynha35gn96mgpg4xx-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-ctk
└── nvidia-docker -> /nix/store/lc86iqnadxhc54n34n4g9rd94li39ls8-nvidia-docker-2.5.0/bin/nvidia-docker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of moving runtime wrappers to a separate output or prefix, this removes the wrappers entirely. I've no idea if the wrappers still have any users, I think we should go with the moving instead

Copy link
Member Author

@ereslibre ereslibre Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, note that the wrapper is nvidia-docker. The rest (containerd, crio, docker) are just helper binaries that configure the runtime after they are named: tools that perform imperative changes on the host to set up the runtime wrapper, but not the wrappers themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this PR as draft until I have double checked everything I want.

@ofborg ofborg bot requested a review from cpcloud July 26, 2024 16:14
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Jul 26, 2024
@ereslibre ereslibre marked this pull request as draft July 26, 2024 16:18
Although CDI should be used in order to not require container runtime
wrappers anymore, fix the nvidia-container-runtime integration with
Docker for cases when Docker < 25.
@ereslibre ereslibre closed this Jul 29, 2024
@ereslibre ereslibre deleted the fix-nvidia-container-toolkit-docker-contamination branch July 29, 2024 22:10
@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Jul 29, 2024
@ereslibre
Copy link
Member Author

Ouch, deleted the branch by mistake. Opening another PR since I cannot reopen this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose
2 participants