rootless, subuid-less GPU support with podman #145

qhaas · 2021-07-21T16:01:13Z

Given how Issue #85 is diverging in different directions and is becoming a catchall for all things podman, thought I'd break the issue described in this comment out into its own issue... In certain situations (e.g. podman issue 8580), it is not practical to setup subuid / subgid for each user, so we'd like to try to get GPU acceleration working without having to do such, of which singularity is capable

Test System (using the container-tools:3.0 appstream):

$ cat /etc/redhat-release 
CentOS Linux release 8.4.2105
$ uname -r
4.18.0-305.7.1.el8_4.x86_64
$ nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB
$ nvidia-smi | grep Version | awk '{print $3}'
470.42.01
$ nvidia-container-cli --version | head -1
version: 1.4.0
$ crun --version | grep version
crun version 0.18
$ runc --version | grep version
runc version spec: 1.0.2-dev
$ podman --version
podman version 3.0.2-dev

nvdia-container-runtime config (note that no-cgroups is now true and debug files are going to /tmp, per Issue #85):

$ cat /etc/nvidia-container-runtime/config.toml 
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/tmp/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/tmp/nvidia-container-runtime.log"

podman storage config (per Issue #85 and rootless podman guide):

$ cat ~/.config/containers/storage.conf
[storage]
driver = "overlay"
graphroot = "/tmp/${USER}-containers-peak"
rootless_storage_path = "${HOME}/.local/share/containers/storage"

[storage.options]
additionalimagestores = [
]

[storage.options.overlay]
ignore_chown_errors = "true"
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = "nodev,metacopy=on"

[storage.options.thinpool]

With subuid / subgid set, things work fine, logs posted as nct_works_log.txt

$ grep ${USER}: /etc/subuid | wc -l
1
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-0a55d110-f8ea-4209-baa7-0e5675c7e832)

Without subuid / subgid set, GPU acceleration fails, but non GPU acceleration works. Lots posted as nct_fails_log.txt

$ grep ${USER}: /etc/subuid | wc -l
0
$ podman run --rm docker.io/centos:8 cat /etc/redhat-release
CentOS Linux release 8.3.2011
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
Error: OCI runtime error: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request

Per suggestions online, I added the account without subuid / subgid to the video group, that did not help. I'm also not clear on the implications of adding a user to the video group, so I asked over on the nvidia forums

The text was updated successfully, but these errors were encountered:

qhaas · 2021-07-21T17:39:38Z

The above used runc, retried with crun, works fine without GPU acceleration, but still fails to run with it without subuid being set. Logs attached as nct_fails_crun_log.txt

$ grep 'runtime =' /usr/share/containers/containers.conf
runtime = "crun"
#runtime = "runc"
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
Error: OCI runtime error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1)
$ podman run --rm docker.io/centos:8 cat /etc/redhat-release
CentOS Linux release 8.3.2011

qhaas · 2021-08-10T12:00:25Z

As an alterative, created this issue over on the podman GitHub to see if Singularity's approach to GPU acceleration is applicable to podman.

elezar · 2023-10-20T13:46:18Z

We have recently reworked our podman support and now suggest using CDI to request devices. Please see the updated documentation and feel free to open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if problems persist.

qhaas mentioned this issue Aug 1, 2021

GPU Support containers/podman#11088

Closed

qhaas mentioned this issue Aug 10, 2021

Running nvidia-container-runtime with podman is blowing up. #85

Closed

elezar closed this as completed Oct 20, 2023

elezar mentioned this issue Jan 31, 2024

Failed to initialize NVML: Insufficient Permissions NVIDIA/nvidia-container-toolkit#210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rootless, subuid-less GPU support with podman #145

rootless, subuid-less GPU support with podman #145

qhaas commented Jul 21, 2021 •

edited

Loading

qhaas commented Jul 21, 2021

qhaas commented Aug 10, 2021

elezar commented Oct 20, 2023

rootless, subuid-less GPU support with podman #145

rootless, subuid-less GPU support with podman #145

Comments

qhaas commented Jul 21, 2021 • edited Loading

qhaas commented Jul 21, 2021

qhaas commented Aug 10, 2021

elezar commented Oct 20, 2023

qhaas commented Jul 21, 2021 •

edited

Loading