Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Manual setup of nvidia-docker2 for Debian 11? #1537

Closed
DenizUgur opened this issue Aug 16, 2021 · 19 comments
Closed

Manual setup of nvidia-docker2 for Debian 11? #1537

DenizUgur opened this issue Aug 16, 2021 · 19 comments

Comments

@DenizUgur
Copy link

Is it possible to setup nvidia-docker2 before any stable release? Or are there any plans in the near future to bring support to Debian 11?

@elezar
Copy link
Member

elezar commented Aug 23, 2021

@DenizUgur is there a reason that the existing debian packages cannot be used? In many cases (e.g. ubuntu18.04 or greater) the packages for later releases are the same as the earlier packages.

@DenizUgur
Copy link
Author

Hi @elezar, I'm not sure but I believe there is something different with Debian 10 and 11 concerning cgroups V2 (#1447)

@Herbrant
Copy link

Hi! I have the same problem. Any news about debian 11 package?
Debian 11 is now the stable release, I think that should be officially supported.

@wentasah
Copy link

I was able to create debian11 packages myself by adding debian11 to the Makefile and running make debian11. Also I rebooted the system with systemd.unified_cgroup_hierarchy=false on kernel command line.

@elezar
Copy link
Member

elezar commented Aug 30, 2021

@wentasah which changes were required to the make file apart from adding it to the targets list? If this was all that was required, the existing debian 10 packages should not have any significant differences from the debian11 packages generated and could be used. The setting you specified is out of scope of libnvidia-container.

@wentasah
Copy link

I'm not much familiar with NVIDIA container stuff, I just needed GPU access from within the containers. I didn't succeed with debian10 packages, but it was perhaps due to cgroup v2 (I'm not sure now).

I just added debian11 to Makefiles of all needed packages (container-runtime, container-toolkit, libnvidia-container, nvidia-docker). For libnvidia-container I also needed a small workaround (found here):

diff --git a/mk/Dockerfile.debian b/mk/Dockerfile.debian
index 8e8a560..8de0b19 100644
--- a/mk/Dockerfile.debian
+++ b/mk/Dockerfile.debian
@@ -36,7 +36,8 @@ ENV WITH_LIBELF=${WITH_LIBELF}
 ENV WITH_TIRPC=${WITH_TIRPC}
 ENV WITH_SECCOMP=${WITH_SECCOMP}
 
-RUN make distclean && make -j"$(nproc)"
+RUN make distclean && make -j"$(nproc)" || mv -v 'deps/src/elftoolchain-0.7.1/libelf/name libelf.so.1' deps/src/elftoolchain-0.7.1/libelf/libelf.so.1
+RUN make install
 
 ENV DIST_DIR /dist
 VOLUME $DIST_DIR
diff --git a/mk/docker.mk b/mk/docker.mk
index efcfaed..24ae10c 100644
--- a/mk/docker.mk
+++ b/mk/docker.mk
@@ -27,14 +27,14 @@ DIST_DIR     ?= $(CURDIR)/dist
 MAKE_DIR     ?= $(CURDIR)/mk
 
 # Supported OSs by architecture
-AMD64_TARGETS := ubuntu20.04 ubuntu18.04 ubuntu16.04 debian10 debian9
+AMD64_TARGETS := ubuntu20.04 ubuntu18.04 ubuntu16.04 debian11 debian10 debian9
 X86_64_TARGETS := centos7 centos8 rhel7 rhel8 amazonlinux1 amazonlinux2 opensuse-leap15.1
 PPC64LE_TARGETS := ubuntu18.04 ubuntu16.04 centos7 centos8 rhel7 rhel8
 ARM64_TARGETS := ubuntu18.04

It would be better it works with cgroup v2, but it's not critical for me (at least for now).

@DenizUgur
Copy link
Author

I've tested what @wentasah suggested but it didn't work as expected.

Screen Shot 2021-08-30 at 12 58 05

@wentasah
Copy link

wentasah commented Aug 30, 2021

Not sure why it doesn't work, but this is what I see on my system:

user@server:~$ cat /etc/debian_version 
11.0
user@server:~$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Mon Aug 30 11:24:21 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:18:00.0 Off |                  N/A |
| 26%   35C    P0    19W / 250W |      0MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
user@server:~$ dlocate libnvidia-ml.so
libnvidia-ml1:amd64: /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.91.03
libnvidia-ml1:amd64: /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so
libnvidia-ml1:amd64: /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.1
nvidia-cuda-dev:amd64: /usr/lib/x86_64-linux-gnu/stubs/libnvidia-ml.so

libnvidia-ml1 is the official Debian package.

@DenizUgur
Copy link
Author

Alright, I've found the solution on #1163. Now I can confirm that CUDA is working as expected. I've tested it with python.

>>> import torch
>>> torch.cuda.is_available()
True

Since it is possible to setup nvidia-docker on Debian 11 I'm closing this issue. Thank you @wentasah for your workaround.

@Herbrant
Copy link

Herbrant commented Sep 2, 2021

Nice! But do you know something about an official debian package?

@klueska
Copy link
Contributor

klueska commented Oct 11, 2021

Please see this comment for an update regarding cgroupv2 support and debian 11 support:
#1549 (comment)

@jelmd
Copy link

jelmd commented Oct 11, 2021

FWIW: We switched over to LXC (Ubuntu 20.04) and it works flawless out of the box with systemd.unified_cgroup_hierarchy=1 (i.e. w/o all the docker junk/bloatware and we got much more flexibility + the freedom to do, what we want ;-) ).

@klueska
Copy link
Contributor

klueska commented Dec 8, 2021

We now have an RC of libnvidia-container out that adds support for cgroupv2.

If you would like to try it out, make sure and add the experimental repo to your apt sources and install the latest packages:

For DEBs

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update
sudo apt-get install -y libnvidia-container-tools libnvidia-container1

For RPMs

sudo yum-config-manager --enable libnvidia-container-experimental
sudo yum install -y libnvidia-container-tools libnvidia-container1

@klueska
Copy link
Contributor

klueska commented Jan 28, 2022

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.
Assuming you followed the above, a simple update --> install should give you the latest.

Note: This does not directly add debian11 support, but you can point to the debian10 repo and install from there for now.

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here:
https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

Debian 11 support has now been added such that running the following should now work as expected:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

@johnnynunez
Copy link

I'm using pop OS 21.10. How could I force it to install the last version? With experimental only find 2.8.0 which is not compatible with cgroupv2.

@elezar
Copy link
Member

elezar commented Feb 18, 2022

@johnnync13 specifying the distribution explicitly as either ubuntu18.04 or ubuntu20.04 should allow you to install the packages. For example:

distribution=ubuntu18.04 \
   && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

@johnnynunez
Copy link

I fixed it with this:
It is because Pop!_OS's own source for Nvidia driver has high priority than Nvidia's offical source. But the dependencies for nvidia-docker2 falls behind to Nvidia's offical source. To fix that, we could give nvdia docker source a hihger as folllows.

vi /etc/apt/preferences.d/nvidia-docker-pin-1002
with content;
Package: *
Pin: origin nvidia.github.io
Pin-Priority: 1002

PD: https://gist.github.com/kuang-da/2796a792ced96deaf466fdfb7651aa2e

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants