Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nvidia container toolkit docker contamination #331071

Conversation

ereslibre
Copy link
Member

@ereslibre ereslibre commented Jul 30, 2024

Description of changes

This PR implements three main changes.

Things done

Tested, with the NixOS options:

virtualisation = {
  containers.enable = true;
  podman.enable = true;
  docker = {
    enable = true;
    # Allows the CDI implementation, while also makes it possible to use the nvidia runtime wrappers.
    package = pkgs.docker_25;
    # For the runtime wrappers/non-CDI case...
    enableNvidia = true;
  };
};
hardware = {
  # Required by the runtime wrappers
  graphics.enable32Bit = true;
  # CDI generation
  nvidia-container-toolkit.enable = true;
};

The following works as expected:

  • Podman
# CDI
❯ podman run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi
Wed Jul 31 08:27:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:41:00.0 Off |                  Off |
| 61%   49C    P8             17W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 ...    On  |   00000000:61:00.0 Off |                  N/A |
|  0%   53C    P8             35W /  250W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# Wrapper
# Does not apply. CDI is recommended.
  • Docker
# CDI
❯ docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi
Wed Jul 31 08:28:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:41:00.0 Off |                  Off |
| 48%   50C    P8             17W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 ...    On  |   00000000:61:00.0 Off |                  N/A |
|  0%   54C    P8             35W /  250W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# Wrapper
❯ docker run --rm -it --runtime=nvidia --gpus=all ubuntu:latest nvidia-smi
Wed Jul 31 08:28:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:41:00.0 Off |                  Off |
| 43%   49C    P8             16W /  450W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 ...    On  |   00000000:61:00.0 Off |                  N/A |
|  0%   54C    P8             35W /  250W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Fixes: #293857
Fixes: #322400

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Jul 30, 2024
@ofborg ofborg bot requested a review from cpcloud July 30, 2024 13:21
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Jul 30, 2024
@ereslibre ereslibre force-pushed the fix-nvidia-container-toolkit-docker-contamination branch 6 times, most recently from 8ff9584 to a7c05c5 Compare July 31, 2024 08:09
@ereslibre ereslibre requested a review from SomeoneSerge July 31, 2024 08:32
@ereslibre ereslibre marked this pull request as ready for review July 31, 2024 08:39
@ereslibre ereslibre force-pushed the fix-nvidia-container-toolkit-docker-contamination branch from a7c05c5 to f7dcc2d Compare August 2, 2024 10:23
@ereslibre ereslibre marked this pull request as draft August 2, 2024 10:23
@ereslibre ereslibre force-pushed the fix-nvidia-container-toolkit-docker-contamination branch from f7dcc2d to 176a8c9 Compare August 16, 2024 10:41
@ereslibre ereslibre marked this pull request as ready for review August 16, 2024 10:42
@ereslibre ereslibre requested a review from SomeoneSerge August 16, 2024 10:42
At this time, the nvidia-container-toolkit derivation installs a
docker executable that shadows the main one, and that is not thought
to forward commands to the original docker command, causing issues to
users when the `nvidia-container-toolkit` is in scope and they try to
call to `docker`.
@ereslibre ereslibre force-pushed the fix-nvidia-container-toolkit-docker-contamination branch from 176a8c9 to af0c5aa Compare August 16, 2024 13:46
Although CDI should be used in order to not require container runtime
wrappers anymore, fix the nvidia-container-runtime integration with
Docker for cases when Docker < 25.
@ereslibre ereslibre force-pushed the fix-nvidia-container-toolkit-docker-contamination branch from af0c5aa to 0f9a29a Compare August 16, 2024 13:49
Since version 4.1.0, podman has support for CDI, and is the
recommended way to expose GPU's for containers for podman.

More information: https://web.archive.org/web/20240729183805/https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman
@ereslibre ereslibre force-pushed the fix-nvidia-container-toolkit-docker-contamination branch from 0f9a29a to 058e8f5 Compare August 16, 2024 16:16
@ereslibre ereslibre requested a review from SomeoneSerge August 16, 2024 18:32
@SomeoneSerge
Copy link
Contributor

@ofborg build nvidia-docker

@ereslibre
Copy link
Member Author

@SomeoneSerge: is this good to merge on its current status?

@SomeoneSerge SomeoneSerge merged commit 4c930c0 into NixOS:master Aug 22, 2024
31 of 32 checks passed
Copy link
Contributor

Backport failed for release-24.05, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-24.05
git worktree add -d .worktree/backport-331071-to-release-24.05 origin/release-24.05
cd .worktree/backport-331071-to-release-24.05
git switch --create backport-331071-to-release-24.05
git cherry-pick -x df2df4c3a61bb110694e00cde38d871c7761bc08 f7b4d57421d67baf096e0b46168699e774067812 058e8f5ef11cd5291cf03aa4a886e475cf7bdd71

@ereslibre ereslibre deleted the fix-nvidia-container-toolkit-docker-contamination branch August 22, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: clean-up 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10
Projects
None yet
2 participants