suse tumbleweed & nvidia-container-toolkit & could not select device driver "" #1377

s4s0l · 2020-08-31T08:46:31Z

1. Issue or feature description

On tumbleweed (i know it's not supported) I'm unable to run:

→ docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Generally I'm not expecting solution but rather would like to understand how all of this should work together. My current findings are that my docker is not using /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json at all. I can put there any nonsense and I do not see anything complaining about it. I would like to understand why? As there is quite little documentation on docker side about how it uses oci hooks i do not know where to look for explanation. What should pick it up and under what circumstances? Is it docker itself or runc or ... ? I see in some packages on other distros hooks are placed in different paths like `/usr/share/containers/docker/...' or '/etc/containers/...'. I tried different versions of runc and some more random things but after reading documentation of docker, nvidia repos, oci specs i still cannot figure out how does it supposed to work. I would appreciate if someone could find a moment to write down how nvidia tools are integrated with docker. How does docker pick gpu "driver"? What makes nvidia hook to trigger only for containers started with '--gpus'? etc...

Driver seems to be running fine as far as i can tell (games, cuda based ML, blender). Issues i could find relate to docker not restarted after installation of toolkit or docker installed via snap, not my case.

2. Steps to reproduce the issue

Install docker nvidia drivers and nvidia-container-toolkit run container with --gpus .

3. Information

Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
nvidia-container-cli.txt
Kernel version from uname -a

→ uname -a
Linux sasol-desktop 5.8.2-1-default #1 SMP Wed Aug 19 09:43:15 UTC 2020 (71b519a) x86_64 x86_64 x86_64 GNU/Linux

Any relevant kernel output lines from dmesg

→ dmesg | grep nvidia
[   22.224202] nvidia: loading out-of-tree module taints kernel.
[   22.224210] nvidia: module license 'NVIDIA' taints kernel.
[   22.237123] audit: type=1400 audit(1598803048.338:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1283 comm="apparmor_parser"
[   22.237125] audit: type=1400 audit(1598803048.338:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1283 comm="apparmor_parser"
[   22.335086] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   22.368261] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[   22.368745] nvidia 0000:65:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   22.565981] nvidia-uvm: Loaded the UVM driver, major device number 236.
[   22.873668] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.57  Sun Jul  5 14:52:29 UTC 2020
[   22.996273] [drm] [nvidia-drm] [GPU ID 0x00006500] Loading driver
[   22.996276] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:65:00.0 on minor 0
[   23.052745] nvidia-gpu 0000:65:00.3: i2c timeout error e0000000
[   29.000700] caller _nv000743rm+0x1af/0x200 [nvidia] mapping multiple BARs

Driver information from nvidia-smi -a
nvidia-smi.txt
Docker version from docker version
docker-version.txt
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
zypper-packages.txt

→ rpm -qa '*nvidia*'
kernel-firmware-nvidia-20200807-1.2.noarch
libnvidia-container-static-1.1.1-1.3.x86_64
nvidia-container-toolkit-0.0+git.1580519869.60f165a-1.4.x86_64
libnvidia-container-devel-1.1.1-1.3.x86_64
libnvidia-container1-1.1.1-1.3.x86_64
nvidia-gfxG05-kmp-default-450.57_k5.7.9_1-38.2.x86_64
nvidia-glG05-450.57-38.1.x86_64
nvidia-computeG05-450.57-38.1.x86_64
x11-video-nvidiaG05-450.57-38.1.x86_64
libnvidia-container-tools-1.1.1-1.3.x86_64

NVIDIA container library version from nvidia-container-cli -V

→ nvidia-container-cli -V
version: 1.1.1
build date: 2020-08-25T14:52+00:00
build revision: 1.1.1
build compiler: gcc-10 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa]
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -I/usr/include/tirpc -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

NVIDIA container library logs
no logs created when running container
Docker command, image and tag used
any with --gpus

The text was updated successfully, but these errors were encountered:

klueska · 2020-08-31T11:17:32Z

This is an error from docker itself before it ever even tries to invoke the nvidia stack.
My guess is you have a mismatch on the version of your docker-cli and your actual docker packages.

klueska · 2020-08-31T11:22:50Z

To check that the nvidia stack is actually working, you can attempt to use the environment variable API instead of the --gpus option (this will require you to install the nvidia-container-runtime package as well though).

docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi

klueska · 2020-08-31T11:24:27Z

This is how the stack fits together:
#1268 (comment)

s4s0l · 2020-08-31T20:41:43Z

Thanks for guidelines. Above comment IMO should be part of README.md as is, it's just worth it.

TL;DR sles15.1 repo works in Thumbleweed, and fixes my problem.

In Thumbleweed nvidia container tooling comes from main repo but there is no nvidia-container-runtime.
I took a look at its sources and found that, as far as i could tell, there is nothing special in its rpm specs file
that could harm Thumbleweed. Same for libnvidia-container and anything else in nvidia-docker repos. So I just
went with using sles15.1 repo, upgraded everything as things in Thumbleweed repos were little out of date.

It works.

I still feel like my oryginal question remains unresolved: how does docker "know" nvidia tooling is installed? At this point its purely academic problem.

For any suse newbie encountering same problem, below is what I did.

  sudo zypper rm libnvidia-container-static libnvidia-container-devel libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit
  sudo zypper ar https://nvidia.github.io/nvidia-docker/sles15.1/nvidia-docker.repo
  sudo zypper in libnvidia-container1 nvidia-container-runtime

After that:

  → docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:20:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   50C    P8    17W / 120W |    965MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

This step is not necessary as it only adds nvidia runtime. It can be done also by modifying /etc/docker/daemon.json but i did it just for fun.

In /usr/lib/systemd/system/docker.service add --add-runtime nvidia=/usr/bin/nvidia-container-runtime to docker start command so it looks like:

ExecStart=/usr/bin/dockerd --add-runtime nvidia=/usr/bin/nvidia-container-runtime --add-runtime oci=/usr/sbin/docker-runc $DOCKE  R_NETWORK_OPTIONS $DOCKER_OPTS

Then:

sudo systemctl daemon-reload
sudo systemctl restart docker

After that:

  → docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:06:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   50C    P8    16W / 120W |    990MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Medoalmasry · 2023-08-12T11:52:17Z

@s4s0l I can NOT thank you enough. I have been delving down this rabbit hole for 2 days. Thank you

s4s0l closed this as completed Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suse tumbleweed & nvidia-container-toolkit & could not select device driver "" #1377

suse tumbleweed & nvidia-container-toolkit & could not select device driver "" #1377

s4s0l commented Aug 31, 2020 •

edited

Loading

klueska commented Aug 31, 2020

klueska commented Aug 31, 2020 •

edited

Loading

klueska commented Aug 31, 2020

s4s0l commented Aug 31, 2020

Medoalmasry commented Aug 12, 2023

suse tumbleweed & nvidia-container-toolkit & could not select device driver "" #1377

suse tumbleweed & nvidia-container-toolkit & could not select device driver "" #1377

Comments

s4s0l commented Aug 31, 2020 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information

klueska commented Aug 31, 2020

klueska commented Aug 31, 2020 • edited Loading

klueska commented Aug 31, 2020

s4s0l commented Aug 31, 2020

Medoalmasry commented Aug 12, 2023

s4s0l commented Aug 31, 2020 •

edited

Loading

klueska commented Aug 31, 2020 •

edited

Loading