You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The service created by enabling the option hardware.nvidia-container-toolkit does not complete before the docker service starts and causes containers with restart:always which require the cdi to fail and show a exited(255) as the status.
Steps To Reproduce
Steps to reproduce the behavior:
set hardware.nvidia-container-toolkit.enable to true
use compose to create a docker container which uses the cdi and set restart to always or unless-stopped
the container would have a exited(255) as the status
Expected behavior
The container runs normally without exiting.
Additional context
I have checked my logs with journalctl -b 0 -u docker.service and it shows a error for the containers "could not select device driver \"cdi\" with capabilities: [[]]" and the rest of the containers start normally
Nov 02 00:46:10 maple systemd[1]: Starting Docker Application Container Engine...
Nov 02 00:46:10 maple dockerd[2887]: time="2024-11-02T00:46:10.262470523+05:30" level=info msg="Starting up"
Nov 02 00:46:10 maple dockerd[2887]: time="2024-11-02T00:46:10.262898926+05:30" level=info msg="containerd not running, starting managed containerd"
Nov 02 00:46:10 maple dockerd[2887]: time="2024-11-02T00:46:10.265245217+05:30" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=2961
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.327658883+05:30" level=info msg="starting containerd" revision=v1.7.23 version=v1.7.23
...
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.347355835+05:30" level=info msg="metadata content store policy set" policy=shared
...
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353238543+05:30" level=info msg="NRI interface is disabled by configuration."
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353580808+05:30" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353616214+05:30" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353648080+05:30" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
Nov 02 00:46:10 maple dockerd[2961]: time="2024-11-02T00:46:10.353666151+05:30" level=info msg="containerd successfully booted in 0.026905s"
Nov 02 00:46:11 maple dockerd[2887]: time="2024-11-02T00:46:11.373722916+05:30" level=info msg="[graphdriver] using prior storage driver: overlay2"
Nov 02 00:46:11 maple dockerd[2887]: time="2024-11-02T00:46:11.424693769+05:30" level=info msg="Loading containers: start."
...
Nov 02 00:46:11 maple dockerd[2887]: time="2024-11-02T00:46:11.936838556+05:30" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
...
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.082494607+05:30" level=error msg="failed to start container" container=7b065b2dc8ba0ad754fa746ae8c31c570587ff0370de7430e89e52816740b2d1 error="could not select device driver \"cdi\" with capabilities: [[]]"
...
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.155547318+05:30" level=error msg="failed to start container" container=bf56a748eee87b979698ac3deb077ab939ff864dff5db0f1ed38e28017c677e0 error="could not select device driver \"cdi\" with capabilities: [[]]"
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.231282510+05:30" level=info msg="Loading containers: done."
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.243088411+05:30" level=info msg="Docker daemon" commit=v27.3.1 containerd-snapshotter=false storage-driver=overlay2 version=27.3.1
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.245801163+05:30" level=info msg="Daemon has completed initialization"
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:C 01 Nov 2024 19:16:12.254 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:C 01 Nov 2024 19:16:12.254 # Redis version=6.2.16, bits=64, commit=00000000, modified=0, pid=1, just started
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:C 01 Nov 2024 19:16:12.254 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.254 * Increased maximum number of open files to 10032 (it was originally set to 1024).
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.254 * monotonic clock: POSIX clock_gettime
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * Running mode=standalone, port=6379.
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 # Server initialized
...
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * Loading RDB produced by version 6.2.16
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * RDB age 30 seconds
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * RDB memory usage when created 2.01 Mb
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 # Done loading RDB, keys loaded: 15, keys expired: 14.
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * DB loaded from disk: 0.000 seconds
Nov 02 00:46:12 maple a9a2e66f5f33[2887]: 1:M 01 Nov 2024 19:16:12.255 * Ready to accept connections
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.264908734+05:30" level=info msg="API listen on /run/docker.sock"
Nov 02 00:46:12 maple dockerd[2887]: time="2024-11-02T00:46:12.264967446+05:30" level=info msg="API listen on /run/docker.sock"
Nov 02 00:46:12 maple systemd[1]: Started Docker Application Container Engine.
Nov 02 00:46:12 maple 232e1e1f077e[2887]:
Nov 02 00:46:12 maple 232e1e1f077e[2887]: PostgreSQL Database directory appears to contain a database; Skipping initialization
Nov 02 00:46:12 maple 232e1e1f077e[2887]:
Nov 02 00:46:12 maple 232e1e1f077e[2887]: 2024-11-01 19:16:12.376 UTC [1] LOG: redirecting log output to logging collector process
Nov 02 00:46:12 maple 232e1e1f077e[2887]: 2024-11-01 19:16:12.376 UTC [1] HINT: Future log output will appear in directory "log".
Notify maintainers
There's no meta.maintainers for nvidia but @ereslibre is the most recent contributor to the nvidia-container-toolkit code.
LuminarLeaf
changed the title
nvidia-container-toolkit's cdi not completing before docker.service
nvidia-container-toolkit's cdi not completing before docker.service leading to failing containers which require nvidia gpu on boot
Nov 2, 2024
Describe the bug
The service created by enabling the option
hardware.nvidia-container-toolkit
does not complete before the docker service starts and causes containers with restart:always which require the cdi to fail and show a exited(255) as the status.Steps To Reproduce
Steps to reproduce the behavior:
hardware.nvidia-container-toolkit.enable
to truedocker ps -a
Expected behavior
The container runs normally without exiting.
Additional context
I have checked my logs with
journalctl -b 0 -u docker.service
and it shows a error for the containers"could not select device driver \"cdi\" with capabilities: [[]]"
and the rest of the containers start normallymy config https://github.com/LuminarLeaf/arboretum
relevant truncated logs as follows, and complete logs - https://pastebin.com/YXuE5TJE
(logs with immich compose - https://pastebin.com/r03cZ6rd)
Notify maintainers
There's no
meta.maintainers
for nvidia but @ereslibre is the most recent contributor to the nvidia-container-toolkit code.Metadata
Add a 👍 reaction to issues you find important.
The text was updated successfully, but these errors were encountered: