Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

Closed
teto opened this issue Mar 6, 2024 · 26 comments · Fixed by #331071
Closed

nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose #293857

teto opened this issue Mar 6, 2024 · 26 comments · Fixed by #331071
Assignees
Labels
0.kind: bug Something is broken

Comments

@teto
Copy link
Member

teto commented Mar 6, 2024

Describe the bug

I've had errors such as

$ docker compose
No help topic for 'compose'
$ docker-compose 
...works

I thought it was this https://stackoverflow.com/questions/66514436/difference-between-docker-compose-and-docker-compose when I found about it but that was a red herring, nixpkgs spelunking showed we have been using the go extension for a while now.

Investigating further, I realized that it was because I had the nvidia-podman program installed, which contains a docker executable serge that is nothing like pkgs.docker. This is confusing and I suppose a mistake. Can we rename it to something else ?
cc @SomeoneSerge

On nixos-unstable.

@teto teto added the 0.kind: bug Something is broken label Mar 6, 2024
@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Mar 6, 2024

... nvidia-podman program ... contains a docker executable ... that is nothing like pkgs.docker. This is confusing and I suppose a mistake. Can we rename it to something else ?

My understanding is the nvidia "runtime wrappers" had been broken for a few months now. Note also that at least nvidia-docker is deprecated by nvidia. A few PRs were recently merged that updated and moved around a bunch of stuff (libnvidia-container, nvidia-container-toolkit, the integration for apptainer, etc). The runtime wrappers stayed pretty much as they were, i.e. broken to the best of my knowledge. They're also scheduled for removal.

Recently #284507 was merged which offers a better alternative that works and is also recommended by the upstream. It's only available in nixos-unstable (but also the next release is in 2 months). Basically, I doubt it'd be cost efficient to start fixing the wrappers.

nvidia-podman program ... contains a docker executable

❯ nix build .#nvidia-podman
❯ ./result/bin/docker --help
NAME:
   docker - Update docker config with the nvidia runtime
...
❯ ls -l ./result/bin/docker 
lrwxrwxrwx 2 root root 102 Jan  1  1970 ./result/bin/docker -> /nix/store/cwkiall27a7wg4m07qqx92fs4in5nsfw-container-toolkit-container-toolkit-1.15.0-rc.3/bin/docker

This is something generated by nvidia-container-toolkit, I'm not sure. CC @aaronmondal @ereslibre

@teto
Copy link
Member Author

teto commented Mar 6, 2024

I dont know enough about the space to propose a solution. If the package is broken, maybe mark it as broken.
My inner child would rename the executable in postInstall mv $out/bin/docker $out/bin/docker-but-not-docker.

NB: Thanks for all you do for nixpkgs. I've been looking into LLM + nvidia and your PRs keep coming up.

@SomeoneSerge SomeoneSerge changed the title No help topic for 'compose' caused by nvidia-podman bringing in a fake docker nvidia-podman: contaminates PATH with a fake "docker" executable; breaks docker-compose Mar 6, 2024
@SomeoneSerge
Copy link
Contributor

My inner child would rename the executable in postInstall mv $out/bin/docker $out/bin/docker-but-not-docker.

I'm not sure if what they do is intentional

@ereslibre
Copy link
Member

ereslibre commented Mar 7, 2024

My inner child would rename the executable in postInstall mv $out/bin/docker $out/bin/docker-but-not-docker.

I am also not super familiar with this specific bit, but from looking at it, it seems a helper binary for setting up Docker --there are also other helpers like containerd, and crio-- support for the nvidia-container-toolkit. I believe we can get rid of all of them on NixOS, given they are performing changes at the system level, and in a NixOS system I'd expect all this to be handled by the declarative configuration.

@SomeoneSerge
Copy link
Contributor

I believe we can get rid of all of them on NixOS

Note that nvidia-container-toolkit is part of nixpkgs, it may be used outside NixOS. We should maybe split the outputs (install nvidia-ctk in the default output, docker and others in the other). How do we ensure there weren't any (possibly relative) references to ${nvidia-container-toolkit}/bin/docker to be broken by the renaming?

@ereslibre
Copy link
Member

I agree in splitting the outputs and keeping nvidia-ctk as the default output, and the rest on a/multiple separate output(s).

@sophronesis
Copy link

same issue here, unable to run nvidia-docker

 ~/➤ nvidia-docker
NAME:
   docker - Update docker config with the nvidia runtime

USAGE:
   docker [global options] command [command options] [arguments...]

VERSION:
   1.15.0-rc.3

COMMANDS:
   setup    Trigger docker config to be updated
   cleanup  Trigger any updates made to docker config to be undone
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --help, -h     show help (default: false)
   --version, -v  print the version (default: false)

@ereslibre
Copy link
Member

same issue here, unable to run nvidia-docker

In what version of nixpkgs/NixOS are you? What do you want to achieve by running nvidia-docker manually?

At the nixpkgs tree we still have Docker 24, but when we can assume Docker 25 as the minimum version, we can get rid of all these hooks in benefit of the CDI implementation.

@sophronesis
Copy link

sophronesis commented Jun 19, 2024

In what version of nixpkgs/NixOS are you?

nixos-unstable (24.11)

What do you want to achieve by running nvidia-docker manually?

run gpu-powered container for deep learning related stuff

docker version

Docker version 24.0.9, build v24.0.9

@ereslibre
Copy link
Member

ereslibre commented Jun 19, 2024

nixos-unstable (24.11)
run gpu-powered container for deep learning related stuff
Docker version 24.0.9, build v24.0.9

In this case you can set the NixOS option hardware.nvidia-container-toolkit.enable = true and virtualisation.docker.package = pkgs.docker_25. This will automatically configure Docker to use the nvidia-container-toolkit CDI generated file.

Then you will be able to do docker run --device=nvidia.com/gpu=all ....

@jl1990
Copy link

jl1990 commented Jun 20, 2024

Hi,

First of all thank you all for the help.

I tried the proposed solution, but It doesn't work for me even after including your suggested changes.

I have this config:

  hardware.nvidia-container-toolkit.enable = true;
  hardware.nvidia-container-toolkit.mount-nvidia-executables = false;
  virtualisation = {
    docker = {
      enable = true;
      package = pkgs.docker_25;
      enableOnBoot = true;
      extraOptions = "--default-runtime=nvidia";
      rootless = {
        enable = true;
        setSocketVariable = true;
      };
    };
  };

When I try to run an nvidia docker with the following command:

docker run --runtime=nvidia \
--gpus all --rm \
...

If I have package nvidia-container-toolkit installed I get:

No help topic for 'run'

and If I don't have it installed I get:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.

The problem seems to be that when you install nvidia-container-toolkit, those additional binaries overwrite the docker command as the original post mentioned...

@ereslibre
Copy link
Member

ereslibre commented Jun 20, 2024

When I try to run an nvidia docker with the following command:

Please note that it's --device=nvidia.com/gpu=all with CDI (or 0, 1...), not --gpus; this was the case with the old runtime wrappers. You should also remove the --default-runtime from extraOptions. With CDI runtime wrappers are no longer necessary for exposing GPU's or hardware on the containers -- that is what effectively allows us to clean up these wrappers at some point when we can get rid of Docker 24. --

Also, I recommend you to set hardware.nvidia-container-toolkit.mount-nvidia-executables = true; at least while you are figuring out if it works. With this, you can try docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi.

Try to use the binary from Docker directly, not the wrapper that podman installs.

In any case, we should not install a docker binary that gets in the way of the user. I'll have a look at a fix that we can have while we figure out the Docker 24 -> Docker 25 and runtime wrapper removals.

@sophronesis
Copy link

nixos-unstable (24.11)
run gpu-powered container for deep learning related stuff
Docker version 24.0.9, build v24.0.9

In this case you can set the NixOS option hardware.nvidia-container-toolkit.enable = true and virtualisation.docker.package = pkgs.docker_25. This will automatically configure Docker to use the nvidia-container-toolkit CDI generated file.

Then you will be able to do docker run --device=nvidia.com/gpu=all ....

here is what i'm getting:

 ~/➤ docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi
docker: nvidia.com/gpu=all is not an absolute path.
See 'docker run --help'.

@ereslibre
Copy link
Member

ereslibre commented Jun 21, 2024

@sophronesis I can reproduce this issue if you don't set virtualisation.docker.package = pkgs.docker_25, and use Docker 24.

Please, let's keep this issue focused on the docker CLI contamination, and open a new issue if you face a different problem.

@jl1990
Copy link

jl1990 commented Jun 22, 2024

When I try to run an nvidia docker with the following command:

Please note that it's --device=nvidia.com/gpu=all with CDI (or 0, 1...), not --gpus; this was the case with the old runtime wrappers. You should also remove the --default-runtime from extraOptions. With CDI runtime wrappers are no longer necessary for exposing GPU's or hardware on the containers -- that is what effectively allows us to clean up these wrappers at some point when we can get rid of Docker 24. --

Also, I recommend you to set hardware.nvidia-container-toolkit.mount-nvidia-executables = true; at least while you are figuring out if it works. With this, you can try docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi.

Try to use the binary from Docker directly, not the wrapper that podman installs.

In any case, we should not install a docker binary that gets in the way of the user. I'll have a look at a fix that we can have while we figure out the Docker 24 -> Docker 25 and runtime wrapper removals.

Unfortunately no luck yet:

Config:

  hardware.nvidia-container-toolkit.enable = true;
  hardware.nvidia-container-toolkit.mount-nvidia-executables = true;
  virtualisation = {
    docker = {
      enable = true;
      package = pkgs.docker_25;
      enableOnBoot = true;
      rootless = {
        enable = true;
        setSocketVariable = true;
      };
    };
  };

With nvidia-container-toolkit:

➜  mysystem git:(main) ✗ docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
No help topic for 'run'
mysystem git:(main) ✗ /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

Without nvidia-container-toolkit:

➜  mysystem git:(main) ✗ docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].
➜  mysystem git:(main) ✗ /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

Not sure what can be the issue...

@ahirner
Copy link
Contributor

ahirner commented Jun 22, 2024

@jl1990 I'm facing quite similar issue with quite similar config. What's your output of this?

$ journalctl -b | grep nvidia
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=info msg="Auto-detected mode as \"nvml\""
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND"
Jun 22 14:48:54 barley nvidia-powerd[5421]: nvidia-powerd version:1.0(build 1)
Jun 22 14:48:54 barley nvidia-powerd[5421]: Error open pid file 13

@ahirner
Copy link
Contributor

ahirner commented Jun 22, 2024

OK, during 24 upgrade I removed something that's still needed for the kernel to load the modules:

services.xserver.videoDrivers = ["nvidia"];
hardware.nvidia.nvidiaPersistenced = false;

nvidia-smi sees GPUs on the host and for sudo docker run , despite:

  virtualisation = {
    docker = {
      enable = true;
      # https://github.com/NixOS/nixpkgs/issues/293857#issuecomment-2177935545
      package = pkgs.docker_25;
      enableOnBoot = true;
      rootless = {
        enable = true;
        setSocketVariable = true;
      };
...

@ereslibre
Copy link
Member

Not sure what can be the issue...

The issue here is that you are calling the wrapper docker (the docker that is not docker). Your configuration is likely correct but you have to call the real docker CLI by using its full path on /nix until we fix this issue.

@jl1990
Copy link

jl1990 commented Jun 22, 2024

Not sure what can be the issue...

The issue here is that you are calling the wrapper docker (the docker that is not docker). Your configuration is likely correct but you have to call the real docker CLI by using its full path on /nix until we fix this issue.

Yes, I think this is what I posted in my last response right?

➜  mysystem git:(main) ✗ /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].

It doesn't work for me neither.

@ereslibre
Copy link
Member

@jl1990 also depending on your version of nixpkgs your error could be fixed by #305312 (comment)

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Jun 22, 2024

@jl1990 @ahirner Please use NixOS Discourse or Matrix for support and questions.
@teto @ereslibre We need to figure out if upstream ever actually intended to deploy their wrapper as $prefix/bin/docker. AFAIU this "issue" might actually be the correct behaviour.

@ereslibre
Copy link
Member

@teto @ereslibre We need to figure out if upstream ever actually intended to deploy their wrapper as $prefix/bin/docker. AFAIU this "issue" might actually be the correct behaviour.

I might be missing something, because the history is not completely clear to me, but I think it was never intended to replace the Docker CLI, but be installed as a runtime wrapper, configured in /etc/docker/daemon.json (https://github.com/NVIDIA/nvidia-docker/tree/2bf9c3455d030af0d47cb980262e75154017cd65).

@jl1990
Copy link

jl1990 commented Jun 23, 2024

@jl1990 I'm facing quite similar issue with quite similar config. What's your output of this?

$ journalctl -b | grep nvidia
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=info msg="Auto-detected mode as \"nvml\""
Jun 22 14:47:56 barley nvidia-cdi-generator[958]: time="2024-06-22T14:47:56+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND"
Jun 22 14:48:54 barley nvidia-powerd[5421]: nvidia-powerd version:1.0(build 1)
Jun 22 14:48:54 barley nvidia-powerd[5421]: Error open pid file 13

Result is quite large, I uploaded it to pastebin: https://pastebin.com/4fNeETeR

@jl1990 also depending on your version of nixpkgs your error could be fixed by #305312 (comment)

Thanks for the help, I still get the same result after including CDI with nixos unstable branch.

edit: Running with sudo (and full nix store path worked)

➜  mysystem git:(main) ✗ sudo /nix/store/5r352v296svf9phfad6ga6qxn7m5kbmg-docker-25.0.5/bin/docker run --device=nvidia.com/gpu=all --rm ubuntu:latest nvidia-smi
Sun Jun 23 00:23:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04              Driver Version: 555.52.04      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:01:00.0  On |                  N/A |
| 40%   37C    P8             11W /  260W |    3182MiB /  11264MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@jl1990 @ahirner Please use NixOS Discourse or Matrix for support and questions. @teto @ereslibre We need to figure out if upstream ever actually intended to deploy their wrapper as $prefix/bin/docker. AFAIU this "issue" might actually be the correct behaviour.

I only tried to provide feedback about the solutions that were suggested here (as they didn't work in my case).
Although I am not a docker/nvidia expert, It also sounds weird to me that expected behaviour is overwriting docker binary in the machine.

This link for example shows that nvidia container toolkit should modify /etc/docker/daemon.json file on the host so docker can use the nvidia container runtime

And this one shows the expected behaviour and how docker commands should be executed...

If this binary overwrite was intentional, the binary that is replacing the docker binary should be able to process the same parameters, but it does not... it complains about not understanding the "run" parameter:

No help topic for 'run'

@ereslibre
Copy link
Member

@jl1990 we are mixing different problems here, please open a new issue or as @SomeoneSerge mentioned use Discourse or Matrix. Thank you!

@SomeoneSerge
Copy link
Contributor

It also sounds weird to me that expected behaviour is overwriting docker binary in the machine.

Sounds wrong to me too, I've no idea what upstream's intentions were. The offending binary is packaged in pkgs/by-name/nv/nvidia-container-toolkit/package.nix and generated from https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/docker/docker.go

@ereslibre we can try moving it to a different output or a deeper prefix; then we can see what references we break

@ereslibre
Copy link
Member

I have created #330197 to fix this shadowing. We can follow up on that PR, also please, give a heads up if you find out that anything is missing/wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
6 participants