nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

teto · 2024-03-06T21:45:05Z

Describe the bug

I've had errors such as

$ docker compose
No help topic for 'compose'
$ docker-compose 
...works

I thought it was this https://stackoverflow.com/questions/66514436/difference-between-docker-compose-and-docker-compose when I found about it but that was a red herring, nixpkgs spelunking showed we have been using the go extension for a while now.

Investigating further, I realized that it was because I had the nvidia-podman program installed, which contains a docker executable serge that is nothing like pkgs.docker. This is confusing and I suppose a mistake. Can we rename it to something else ?
cc @SomeoneSerge

On nixos-unstable.

The text was updated successfully, but these errors were encountered:

SomeoneSerge · 2024-03-06T22:01:18Z

... nvidia-podman program ... contains a docker executable ... that is nothing like pkgs.docker. This is confusing and I suppose a mistake. Can we rename it to something else ?

My understanding is the nvidia "runtime wrappers" had been broken for a few months now. Note also that at least nvidia-docker is deprecated by nvidia. A few PRs were recently merged that updated and moved around a bunch of stuff (libnvidia-container, nvidia-container-toolkit, the integration for apptainer, etc). The runtime wrappers stayed pretty much as they were, i.e. broken to the best of my knowledge. They're also scheduled for removal.

Recently #284507 was merged which offers a better alternative that works and is also recommended by the upstream. It's only available in nixos-unstable (but also the next release is in 2 months). Basically, I doubt it'd be cost efficient to start fixing the wrappers.

nvidia-podman program ... contains a docker executable

❯ nix build .#nvidia-podman
❯ ./result/bin/docker --help
NAME:
   docker - Update docker config with the nvidia runtime
...
❯ ls -l ./result/bin/docker 
lrwxrwxrwx 2 root root 102 Jan  1  1970 ./result/bin/docker -> /nix/store/cwkiall27a7wg4m07qqx92fs4in5nsfw-container-toolkit-container-toolkit-1.15.0-rc.3/bin/docker

This is something generated by nvidia-container-toolkit, I'm not sure. CC @aaronmondal @ereslibre

teto · 2024-03-06T22:08:45Z

I dont know enough about the space to propose a solution. If the package is broken, maybe mark it as broken.
My inner child would rename the executable in postInstall mv $out/bin/docker $out/bin/docker-but-not-docker.

NB: Thanks for all you do for nixpkgs. I've been looking into LLM + nvidia and your PRs keep coming up.

SomeoneSerge · 2024-03-06T23:05:10Z

My inner child would rename the executable in postInstall mv $out/bin/docker $out/bin/docker-but-not-docker.

I'm not sure if what they do is intentional

ereslibre · 2024-03-07T07:45:14Z

My inner child would rename the executable in postInstall mv $out/bin/docker $out/bin/docker-but-not-docker.

I am also not super familiar with this specific bit, but from looking at it, it seems a helper binary for setting up Docker --there are also other helpers like containerd, and crio-- support for the nvidia-container-toolkit. I believe we can get rid of all of them on NixOS, given they are performing changes at the system level, and in a NixOS system I'd expect all this to be handled by the declarative configuration.

SomeoneSerge · 2024-03-07T09:13:57Z

I believe we can get rid of all of them on NixOS

Note that nvidia-container-toolkit is part of nixpkgs, it may be used outside NixOS. We should maybe split the outputs (install nvidia-ctk in the default output, docker and others in the other). How do we ensure there weren't any (possibly relative) references to ${nvidia-container-toolkit}/bin/docker to be broken by the renaming?

ereslibre · 2024-03-07T10:21:02Z

I agree in splitting the outputs and keeping nvidia-ctk as the default output, and the rest on a/multiple separate output(s).

sophronesis · 2024-06-16T20:22:59Z

same issue here, unable to run nvidia-docker

 ~/➤ nvidia-docker
NAME:
   docker - Update docker config with the nvidia runtime

USAGE:
   docker [global options] command [command options] [arguments...]

VERSION:
   1.15.0-rc.3

COMMANDS:
   setup    Trigger docker config to be updated
   cleanup  Trigger any updates made to docker config to be undone
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --help, -h     show help (default: false)
   --version, -v  print the version (default: false)

ereslibre · 2024-06-17T06:48:58Z

same issue here, unable to run nvidia-docker

In what version of nixpkgs/NixOS are you? What do you want to achieve by running nvidia-docker manually?

At the nixpkgs tree we still have Docker 24, but when we can assume Docker 25 as the minimum version, we can get rid of all these hooks in benefit of the CDI implementation.

sophronesis · 2024-06-19T01:33:39Z

In what version of nixpkgs/NixOS are you?

nixos-unstable (24.11)

What do you want to achieve by running nvidia-docker manually?

run gpu-powered container for deep learning related stuff

docker version

Docker version 24.0.9, build v24.0.9

ereslibre · 2024-06-19T07:21:07Z

nixos-unstable (24.11)
run gpu-powered container for deep learning related stuff
Docker version 24.0.9, build v24.0.9

In this case you can set the NixOS option hardware.nvidia-container-toolkit.enable = true and virtualisation.docker.package = pkgs.docker_25. This will automatically configure Docker to use the nvidia-container-toolkit CDI generated file.

Then you will be able to do docker run --device=nvidia.com/gpu=all ....

jl1990 · 2024-06-20T17:00:05Z

Hi,

First of all thank you all for the help.

I tried the proposed solution, but It doesn't work for me even after including your suggested changes.

I have this config:

  hardware.nvidia-container-toolkit.enable = true;
  hardware.nvidia-container-toolkit.mount-nvidia-executables = false;
  virtualisation = {
    docker = {
      enable = true;
      package = pkgs.docker_25;
      enableOnBoot = true;
      extraOptions = "--default-runtime=nvidia";
      rootless = {
        enable = true;
        setSocketVariable = true;
      };
    };
  };

When I try to run an nvidia docker with the following command:

docker run --runtime=nvidia \
--gpus all --rm \
...

If I have package nvidia-container-toolkit installed I get:

No help topic for 'run'

and If I don't have it installed I get:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.

The problem seems to be that when you install nvidia-container-toolkit, those additional binaries overwrite the docker command as the original post mentioned...

ereslibre · 2024-06-20T17:20:28Z

When I try to run an nvidia docker with the following command:

Please note that it's --device=nvidia.com/gpu=all with CDI (or 0, 1...), not --gpus; this was the case with the old runtime wrappers. You should also remove the --default-runtime from extraOptions. With CDI runtime wrappers are no longer necessary for exposing GPU's or hardware on the containers -- that is what effectively allows us to clean up these wrappers at some point when we can get rid of Docker 24. --

Also, I recommend you to set hardware.nvidia-container-toolkit.mount-nvidia-executables = true; at least while you are figuring out if it works. With this, you can try docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi.

Try to use the binary from Docker directly, not the wrapper that podman installs.

In any case, we should not install a docker binary that gets in the way of the user. I'll have a look at a fix that we can have while we figure out the Docker 24 -> Docker 25 and runtime wrapper removals.

sophronesis · 2024-06-21T19:27:15Z

nixos-unstable (24.11)
run gpu-powered container for deep learning related stuff
Docker version 24.0.9, build v24.0.9

In this case you can set the NixOS option hardware.nvidia-container-toolkit.enable = true and virtualisation.docker.package = pkgs.docker_25. This will automatically configure Docker to use the nvidia-container-toolkit CDI generated file.

Then you will be able to do docker run --device=nvidia.com/gpu=all ....

here is what i'm getting:

 ~/➤ docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi
docker: nvidia.com/gpu=all is not an absolute path.
See 'docker run --help'.

ereslibre · 2024-06-21T19:51:25Z

@sophronesis I can reproduce this issue if you don't set virtualisation.docker.package = pkgs.docker_25, and use Docker 24.

Please, let's keep this issue focused on the docker CLI contamination, and open a new issue if you face a different problem.

jl1990 · 2024-06-22T08:51:57Z

When I try to run an nvidia docker with the following command:

Please note that it's --device=nvidia.com/gpu=all with CDI (or 0, 1...), not --gpus; this was the case with the old runtime wrappers. You should also remove the --default-runtime from extraOptions. With CDI runtime wrappers are no longer necessary for exposing GPU's or hardware on the containers -- that is what effectively allows us to clean up these wrappers at some point when we can get rid of Docker 24. --

Also, I recommend you to set hardware.nvidia-container-toolkit.mount-nvidia-executables = true; at least while you are figuring out if it works. With this, you can try docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi.

Try to use the binary from Docker directly, not the wrapper that podman installs.

In any case, we should not install a docker binary that gets in the way of the user. I'll have a look at a fix that we can have while we figure out the Docker 24 -> Docker 25 and runtime wrapper removals.

Unfortunately no luck yet:

Config:

  hardware.nvidia-container-toolkit.enable = true;
  hardware.nvidia-container-toolkit.mount-nvidia-executables = true;
  virtualisation = {
    docker = {
      enable = true;
      package = pkgs.docker_25;
      enableOnBoot = true;
      rootless = {
        enable = true;
        setSocketVariable = true;
      };
    };
  };

With nvidia-container-toolkit:

➜  mysystem git:(main) ✗ docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
No help topic for 'run'

mysystem git:(main) ✗ /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

Without nvidia-container-toolkit:

➜  mysystem git:(main) ✗ docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

➜  mysystem git:(main) ✗ /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

Not sure what can be the issue...

ahirner · 2024-06-22T13:06:10Z

@jl1990 I'm facing quite similar issue with quite similar config. What's your output of this?

$ journalctl -b | grep nvidia
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=info msg="Auto-detected mode as \"nvml\""
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND"
Jun 22 14:48:54 barley nvidia-powerd[5421]: nvidia-powerd version:1.0(build 1)
Jun 22 14:48:54 barley nvidia-powerd[5421]: Error open pid file 13

ahirner · 2024-06-22T13:55:14Z

OK, during 24 upgrade I removed something that's still needed for the kernel to load the modules:

services.xserver.videoDrivers = ["nvidia"];
hardware.nvidia.nvidiaPersistenced = false;

nvidia-smi sees GPUs on the host and for sudo docker run , despite:

  virtualisation = {
    docker = {
      enable = true;
      # https://github.com/NixOS/nixpkgs/issues/293857#issuecomment-2177935545
      package = pkgs.docker_25;
      enableOnBoot = true;
      rootless = {
        enable = true;
        setSocketVariable = true;
      };
...

ereslibre · 2024-06-22T14:42:50Z

Not sure what can be the issue...

The issue here is that you are calling the wrapper docker (the docker that is not docker). Your configuration is likely correct but you have to call the real docker CLI by using its full path on /nix until we fix this issue.

jl1990 · 2024-06-22T17:40:25Z

Not sure what can be the issue...

The issue here is that you are calling the wrapper docker (the docker that is not docker). Your configuration is likely correct but you have to call the real docker CLI by using its full path on /nix until we fix this issue.

Yes, I think this is what I posted in my last response right?

➜  mysystem git:(main) ✗ /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

It doesn't work for me neither.

ereslibre · 2024-06-22T19:08:29Z

@jl1990 also depending on your version of nixpkgs your error could be fixed by #305312 (comment)

SomeoneSerge · 2024-06-22T21:20:18Z

@jl1990 @ahirner Please use NixOS Discourse or Matrix for support and questions.
@teto @ereslibre We need to figure out if upstream ever actually intended to deploy their wrapper as $prefix/bin/docker. AFAIU this "issue" might actually be the correct behaviour.

ereslibre · 2024-06-22T21:48:36Z

@teto @ereslibre We need to figure out if upstream ever actually intended to deploy their wrapper as $prefix/bin/docker. AFAIU this "issue" might actually be the correct behaviour.

I might be missing something, because the history is not completely clear to me, but I think it was never intended to replace the Docker CLI, but be installed as a runtime wrapper, configured in /etc/docker/daemon.json (https://github.com/NVIDIA/nvidia-docker/tree/2bf9c3455d030af0d47cb980262e75154017cd65).

jl1990 · 2024-06-23T00:10:02Z

@jl1990 I'm facing quite similar issue with quite similar config. What's your output of this?

$ journalctl -b | grep nvidia
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=info msg="Auto-detected mode as \"nvml\""
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND"
Jun 22 14:48:54 barley nvidia-powerd[5421]: nvidia-powerd version:1.0(build 1)
Jun 22 14:48:54 barley nvidia-powerd[5421]: Error open pid file 13

Result is quite large, I uploaded it to pastebin: https://pastebin.com/4fNeETeR

@jl1990 also depending on your version of nixpkgs your error could be fixed by #305312 (comment)

Thanks for the help, I still get the same result after including CDI with nixos unstable branch.

edit: Running with sudo (and full nix store path worked)

➜  mysystem git:(main) ✗ sudo /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
Sun Jun 23 00:23:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04              Driver Version: 555.52.04      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:01:00.0  On |                  N/A |
| 40%   37C    P8             11W /  260W |    3182MiB /  11264MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@jl1990 @ahirner Please use NixOS Discourse or Matrix for support and questions. @teto @ereslibre We need to figure out if upstream ever actually intended to deploy their wrapper as $prefix/bin/docker. AFAIU this "issue" might actually be the correct behaviour.

I only tried to provide feedback about the solutions that were suggested here (as they didn't work in my case).
Although I am not a docker/nvidia expert, It also sounds weird to me that expected behaviour is overwriting docker binary in the machine.

This link for example shows that nvidia container toolkit should modify /etc/docker/daemon.json file on the host so docker can use the nvidia container runtime

And this one shows the expected behaviour and how docker commands should be executed...

If this binary overwrite was intentional, the binary that is replacing the docker binary should be able to process the same parameters, but it does not... it complains about not understanding the "run" parameter:

No help topic for 'run'

ereslibre · 2024-06-23T06:00:18Z

@jl1990 we are mixing different problems here, please open a new issue or as @SomeoneSerge mentioned use Discourse or Matrix. Thank you!

SomeoneSerge · 2024-06-23T13:10:54Z

It also sounds weird to me that expected behaviour is overwriting docker binary in the machine.

Sounds wrong to me too, I've no idea what upstream's intentions were. The offending binary is packaged in pkgs/by-name/nv/nvidia-container-toolkit/package.nix and generated from https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/docker/docker.go

@ereslibre we can try moving it to a different output or a deeper prefix; then we can see what references we break

ereslibre · 2024-07-26T15:27:06Z

I have created #330197 to fix this shadowing. We can follow up on that PR, also please, give a heads up if you find out that anything is missing/wrong.

teto added the 0.kind: bug Something is broken label Mar 6, 2024

SomeoneSerge changed the title ~~No help topic for 'compose' caused by nvidia-podman bringing in a fake docker~~ nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose Mar 6, 2024

ereslibre self-assigned this Jul 26, 2024

ereslibre mentioned this issue Jul 26, 2024

nvidia-container-toolkit: do not shadow docker executable #330197

Closed

13 tasks

ereslibre mentioned this issue Jul 31, 2024

Fix nvidia container toolkit docker contamination #331071

Merged

13 tasks

SomeoneSerge closed this as completed in #331071 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

teto commented Mar 6, 2024

SomeoneSerge commented Mar 6, 2024 •

edited

Loading

teto commented Mar 6, 2024 •

edited

Loading

SomeoneSerge commented Mar 6, 2024

ereslibre commented Mar 7, 2024 •

edited

Loading

SomeoneSerge commented Mar 7, 2024

ereslibre commented Mar 7, 2024

sophronesis commented Jun 16, 2024

ereslibre commented Jun 17, 2024

sophronesis commented Jun 19, 2024 •

edited

Loading

ereslibre commented Jun 19, 2024 •

edited

Loading

jl1990 commented Jun 20, 2024 •

edited

Loading

ereslibre commented Jun 20, 2024 •

edited

Loading

sophronesis commented Jun 21, 2024

ereslibre commented Jun 21, 2024 •

edited

Loading

jl1990 commented Jun 22, 2024

ahirner commented Jun 22, 2024

ahirner commented Jun 22, 2024

ereslibre commented Jun 22, 2024

jl1990 commented Jun 22, 2024

ereslibre commented Jun 22, 2024

SomeoneSerge commented Jun 22, 2024 •

edited

Loading

ereslibre commented Jun 22, 2024

jl1990 commented Jun 23, 2024 •

edited

Loading

ereslibre commented Jun 23, 2024

SomeoneSerge commented Jun 23, 2024

ereslibre commented Jul 26, 2024

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

Comments

teto commented Mar 6, 2024

Describe the bug

SomeoneSerge commented Mar 6, 2024 • edited Loading

teto commented Mar 6, 2024 • edited Loading

SomeoneSerge commented Mar 6, 2024

ereslibre commented Mar 7, 2024 • edited Loading

SomeoneSerge commented Mar 7, 2024

ereslibre commented Mar 7, 2024

sophronesis commented Jun 16, 2024

ereslibre commented Jun 17, 2024

sophronesis commented Jun 19, 2024 • edited Loading

ereslibre commented Jun 19, 2024 • edited Loading

jl1990 commented Jun 20, 2024 • edited Loading

ereslibre commented Jun 20, 2024 • edited Loading

sophronesis commented Jun 21, 2024

ereslibre commented Jun 21, 2024 • edited Loading

jl1990 commented Jun 22, 2024

ahirner commented Jun 22, 2024

ahirner commented Jun 22, 2024

ereslibre commented Jun 22, 2024

jl1990 commented Jun 22, 2024

ereslibre commented Jun 22, 2024

SomeoneSerge commented Jun 22, 2024 • edited Loading

ereslibre commented Jun 22, 2024

jl1990 commented Jun 23, 2024 • edited Loading

ereslibre commented Jun 23, 2024

SomeoneSerge commented Jun 23, 2024

ereslibre commented Jul 26, 2024

SomeoneSerge commented Mar 6, 2024 •

edited

Loading

teto commented Mar 6, 2024 •

edited

Loading

ereslibre commented Mar 7, 2024 •

edited

Loading

sophronesis commented Jun 19, 2024 •

edited

Loading

ereslibre commented Jun 19, 2024 •

edited

Loading

jl1990 commented Jun 20, 2024 •

edited

Loading

ereslibre commented Jun 20, 2024 •

edited

Loading

ereslibre commented Jun 21, 2024 •

edited

Loading

SomeoneSerge commented Jun 22, 2024 •

edited

Loading

jl1990 commented Jun 23, 2024 •

edited

Loading