Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit's cdi not completing before docker.service leading to failing containers which require nvidia gpu on boot #353059

Open
LuminarLeaf opened this issue Nov 2, 2024 · 1 comment
Assignees
Labels
0.kind: bug Something is broken

Comments

@LuminarLeaf
Copy link

LuminarLeaf commented Nov 2, 2024

Describe the bug

The service created by enabling the option hardware.nvidia-container-toolkit does not complete before the docker service starts and causes containers with restart:always which require the cdi to fail and show a exited(255) as the status.

Steps To Reproduce

Steps to reproduce the behavior:

  1. set hardware.nvidia-container-toolkit.enable to true
  2. use compose to create a docker container which uses the cdi and set restart to always or unless-stopped
    services:
      gpu-test:
        image: ubuntu
        container_name: gpu-test-container
        restart: always
        deploy:
          resources:
            reservations:
              devices:
                - driver: cdi
                  device_ids:
                    - nvidia.com/gpu=all
        command: ["nvidia-smi"]
  3. Start the compose service
  4. Reboot machine
  5. check status with docker ps -a
  6. the container would have a exited(255) as the status

Expected behavior

The container runs normally without exiting.

Additional context

I have checked my logs with journalctl -b 0 -u docker.service and it shows a error for the containers "could not select device driver \"cdi\" with capabilities: [[]]" and the rest of the containers start normally

my config https://github.com/LuminarLeaf/arboretum

relevant truncated logs as follows, and complete logs - https://pastebin.com/YXuE5TJE
(logs with immich compose - https://pastebin.com/r03cZ6rd)

Nov 02 00:46:10 maple systemd[1]: Starting Docker Application Container Engine...
Nov 02 00:46:10 maple dockerd[2887]: time="2024-11-02T00:46:10.262470523+05:30" level=info msg="Starting up"
Nov 02 00:46:10 maple dockerd[2887]: time="2024-11-02T00:46:10.262898926+05:30" level=info msg="containerd not running, starting managed containerd"
Nov 02 00:46:10 maple dockerd[2887]: time="2024-11-02T00:46:10.265245217+05:30" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=2961
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.327658883+05:30" level=info msg="starting containerd" revision=v1.7.23 version=v1.7.23
...
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.347355835+05:30" level=info msg="metadata content store policy set" policy=shared
...
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353238543+05:30" level=info msg="NRI interface is disabled by configuration."
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353580808+05:30" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353616214+05:30" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353648080+05:30" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353666151+05:30" level=info msg="containerd successfully booted in 0.026905s"
Nov 02 00:46:11 maple dockerd[2887]: time="2024-11-02T00:46:11.373722916+05:30" level=info msg="[graphdriver] using prior storage driver: overlay2"
Nov 02 00:46:11 maple dockerd[2887]: time="2024-11-02T00:46:11.424693769+05:30" level=info msg="Loading containers: start."
...
Nov 02 00:46:11 maple dockerd[2887]: time="2024-11-02T00:46:11.936838556+05:30" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
...
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.082494607+05:30" level=error msg="failed to start container" container=7b065b2dc8ba0ad754fa746ae8c31c570587ff0370de7430e89e52816740b2d1 error="could not select device driver \"cdi\" with capabilities: [[]]"
...
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.155547318+05:30" level=error msg="failed to start container" container=bf56a748eee87b979698ac3deb077ab939ff864dff5db0f1ed38e28017c677e0 error="could not select device driver \"cdi\" with capabilities: [[]]"
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.231282510+05:30" level=info msg="Loading containers: done."
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.243088411+05:30" level=info msg="Docker daemon" commit=v27.3.1 containerd-snapshotter=false storage-driver=overlay2 version=27.3.1
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.245801163+05:30" level=info msg="Daemon has completed initialization"
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:C 01 Nov 2024 19:16:12.254 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:C 01 Nov 2024 19:16:12.254 # Redis version=6.2.16, bits=64, commit=00000000, modified=0, pid=1, just started
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:C 01 Nov 2024 19:16:12.254 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.254 * Increased maximum number of open files to 10032 (it was originally set to 1024).
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.254 * monotonic clock: POSIX clock_gettime
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * Running mode=standalone, port=6379.
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 # Server initialized
...
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * Loading RDB produced by version 6.2.16
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * RDB age 30 seconds
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * RDB memory usage when created 2.01 Mb
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 # Done loading RDB, keys loaded: 15, keys expired: 14.
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * DB loaded from disk: 0.000 seconds
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * Ready to accept connections
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.264908734+05:30" level=info msg="API listen on /run/docker.sock"
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.264967446+05:30" level=info msg="API listen on /run/docker.sock"
Nov 02 00:46:12 maple systemd[1]: Started Docker Application Container Engine.
Nov 02 00:46:12 maple 232e1e1f077e[2887]: 
Nov 02 00:46:12 maple 232e1e1f077e[2887]: PostgreSQL Database directory appears to contain a database; Skipping initialization
Nov 02 00:46:12 maple 232e1e1f077e[2887]: 
Nov 02 00:46:12 maple 232e1e1f077e[2887]: 2024-11-01 19:16:12.376 UTC [1] LOG:  redirecting log output to logging collector process
Nov 02 00:46:12 maple 232e1e1f077e[2887]: 2024-11-01 19:16:12.376 UTC [1] HINT:  Future log output will appear in directory "log".

Notify maintainers

There's no meta.maintainers for nvidia but @ereslibre is the most recent contributor to the nvidia-container-toolkit code.

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 6.6.58, NixOS, 24.11 (Vicuna), 24.11.20241029.807e915`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.24.9`
 - channels(root): `"nixos-24.05"`
 - nixpkgs: `/nix/store/m68ikm8045fj7ys7qvgr608z9l70hh1k-source`

Add a 👍 reaction to issues you find important.

@LuminarLeaf LuminarLeaf added the 0.kind: bug Something is broken label Nov 2, 2024
@LuminarLeaf LuminarLeaf changed the title nvidia-container-toolkit's cdi not completing before docker.service nvidia-container-toolkit's cdi not completing before docker.service leading to failing containers which require nvidia gpu on boot Nov 2, 2024
@ereslibre ereslibre self-assigned this Nov 2, 2024
@ereslibre
Copy link
Member

Hello @LuminarLeaf! First of all, thanks a lot for your very detailed description and bug report.

I could reproduce the issue and I have identified a fix in Docker upstream that will fix this problem! I will follow up on this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

2 participants