Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-docker error: "ldconfig failed with error code: 1: unknown" #173953

Open
aidalgol opened this issue May 22, 2022 · 4 comments
Open

nvidia-docker error: "ldconfig failed with error code: 1: unknown" #173953

aidalgol opened this issue May 22, 2022 · 4 comments
Labels
0.kind: bug Something is broken

Comments

@aidalgol
Copy link
Contributor

Describe the bug

nvidia-docker fails to run any image.

$ sudo -g docker docker run --privileged --gpus all -it --rm hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /nix/store/aaaa...-glibc-2.34-210-bin/bin/ldconfig failed with error code: 1: unknown.

Steps To Reproduce

  1. Set virtualisation.docker = { enable = true; enableNvidia = true; } in system configuration.nix`
  2. Rebuild and switch to new profile.
  3. Ensure plain docker works by running
    sudo -g docker docker run -it --rm hello-world
    If not, you may need to set virtualisation.docker.storageDriver explicitly, depending on your filesystem setup.
  4. Run sudo -g docker docker run --privileged --gpus all -it --rm hello-world

Expected behavior

For docker to run the hello-world container in privileged mode with host GPUs exposed.

Additional context

This appears to be at least partly an upstream bug, going by some issue reports I found.

I cannot tell whether the NixOS package/module simply needs to be updated, or whether there is NixOS-specific work that needs to be done in upstream.

Notify maintainers

  • @averelld (author of commit that added the virtualisation.docker.enableNvidia option)
  • @cpcloud (nvidia-docker maintainer)

Metadata

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.39, NixOS, 22.05 (Quokka), 22.05.20220518.48037fd`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.8.1`
 - channels(aidan): `"nixpkgs-22.05pre378171.ff691ed9ba2"`
 - channels(root): `"nixos"`
 - nixpkgs: `/home/aidan/.nix-defexpr/channels/nixpkgs`
@aidalgol aidalgol added the 0.kind: bug Something is broken label May 22, 2022
@akiross
Copy link

akiross commented Jun 30, 2022

I have a similar issue, unsure if it's related; I'm using podman as well, but I can see it's related to UIDs and permissions and the ldcache error is similar, so I hope it might provide a hint.

I can run an nvidia container with the nvidia runtime:

$ podman run -it --rm --runtime nvidia docker.io/nvidia/cudagl:11.4.2-devel id
uid=0(root) gid=0(root) groups=0(root)

but I cannot do so if uids are mapped in a different way:

$ podman run -it --rm --runtime nvidia --uidmap 1000:0:1 --uidmap 0:1:1000 docker.io/nvidia/cudagl:11.4.2-devel id
Error: OCI runtime error: nvidia: time="2022-06-30T12:31:20+02:00" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /nix/store/<hash>-glibc-2.34-210-bin/bin/ldconfig failed with error code: 1\n"

Regarding docker itself, I can also reproduce the error using the hello-world image:

sudo -g docker docker run --privileged --gpus all -it --rm hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /nix/store/h0cnbmfcn93xm5dg2x27ixhag1cwndga-glibc-2.34-210-bin/bin/ldconfig failed with error code: 1: unknown.

but I can run nvidia-smi with docker by using the nvidia image instead

$ sudo -g docker docker run --privileged --gpus all -it --rm docker.io/nvidia/cudagl:11.4.2-devel nvidia-smi
Thu Jun 30 10:50:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+

I hope it helps.

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.49, NixOS, 22.05 (Quokka), 22.05.20220626.cd90e77`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.9.1`
 - channels(root): `"nixos-22.05"`
 - channels(akiross): `""`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

@Avi-D-coder
Copy link

Are there any workarounds to this?
I'm still getting this issue.

I can use the ubuntu NVIDIA containers, but I cannot run a nix docker container built with pkgs.dockerTools, even if the container is just bash.

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /nix/store/lp8qrhb6hs42jwbapzq20l05jf4kyicq-glibc-2.35-224-bin/bin/ldconfig failed with error code: 1: unknown.

@Avi-D-coder
Copy link

The only workaround I was able to do was using the NVIDIA ubuntu docker images as a base image.

        nvidiaUbuntuImage = pkgs.dockerTools.pullImage {
          imageName = "nvidia/cuda";
          imageDigest = "sha256:1d36277d7f886815b2548cc457e3d510006c0252359e9b28c92ed617f28edd72";
          sha256 = "sha256-jX1QspVF2XIpdth13TECICeJut53oDEMOvi2Z8ySo88=";
        };
      in
      {
        packages.devContainer = pkgs.dockerTools.buildLayeredImage {
          name = "transformers-dev";
          tag = "latest";
          fromImage = nvidiaUbuntuImage;
          contents = commonPackages ++ transformersDependencies ++ devDependencies;
          config = {
            Cmd = [ "/bin/bash" ];
            Env = [
              "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:${pkgs.ncurses}/lib:${pkgs.glibc}/lib:${pkgs.zlib}/lib:${pkgs.cudatoolkit}/lib64:${pkgs.cudatoolkit.lib}/lib:${pkgs.linuxPackages.nvidia_x11}/lib"
            ];
          };
        };
      }

@colonelpanic8
Copy link
Contributor

colonelpanic8 commented Sep 27, 2024

This only applies if you're trying to access your gpu in the "old" way. With CDI, you should run your image with:

docker run --rm --device=nvidia.com/gpu=all -it image-name:latest bash

if you're using nixos, you can enable support for this with hardware.nvidia-container-toolkit.enable = true;. You will no longer need virtualisation.docker.enableNvidia = true;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

4 participants