Skip to content
This repository has been archived by the owner on Oct 27, 2023. It is now read-only.

rootless, subuid-less GPU support with podman #145

Closed
qhaas opened this issue Jul 21, 2021 · 3 comments
Closed

rootless, subuid-less GPU support with podman #145

qhaas opened this issue Jul 21, 2021 · 3 comments

Comments

@qhaas
Copy link

qhaas commented Jul 21, 2021

Given how Issue #85 is diverging in different directions and is becoming a catchall for all things podman, thought I'd break the issue described in this comment out into its own issue... In certain situations (e.g. podman issue 8580), it is not practical to setup subuid / subgid for each user, so we'd like to try to get GPU acceleration working without having to do such, of which singularity is capable

Test System (using the container-tools:3.0 appstream):

$ cat /etc/redhat-release 
CentOS Linux release 8.4.2105
$ uname -r
4.18.0-305.7.1.el8_4.x86_64
$ nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB
$ nvidia-smi | grep Version | awk '{print $3}'
470.42.01
$ nvidia-container-cli --version | head -1
version: 1.4.0
$ crun --version | grep version
crun version 0.18
$ runc --version | grep version
runc version spec: 1.0.2-dev
$ podman --version
podman version 3.0.2-dev

nvdia-container-runtime config (note that no-cgroups is now true and debug files are going to /tmp, per Issue #85):

$ cat /etc/nvidia-container-runtime/config.toml 
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/tmp/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/tmp/nvidia-container-runtime.log"

podman storage config (per Issue #85 and rootless podman guide):

$ cat ~/.config/containers/storage.conf
[storage]
driver = "overlay"
graphroot = "/tmp/${USER}-containers-peak"
rootless_storage_path = "${HOME}/.local/share/containers/storage"

[storage.options]
additionalimagestores = [
]

[storage.options.overlay]
ignore_chown_errors = "true"
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = "nodev,metacopy=on"

[storage.options.thinpool]

With subuid / subgid set, things work fine, logs posted as nct_works_log.txt

$ grep ${USER}: /etc/subuid | wc -l
1
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-0a55d110-f8ea-4209-baa7-0e5675c7e832)

Without subuid / subgid set, GPU acceleration fails, but non GPU acceleration works. Lots posted as nct_fails_log.txt

$ grep ${USER}: /etc/subuid | wc -l
0
$ podman run --rm docker.io/centos:8 cat /etc/redhat-release
CentOS Linux release 8.3.2011
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
Error: OCI runtime error: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request

Per suggestions online, I added the account without subuid / subgid to the video group, that did not help. I'm also not clear on the implications of adding a user to the video group, so I asked over on the nvidia forums

@qhaas
Copy link
Author

qhaas commented Jul 21, 2021

The above used runc, retried with crun, works fine without GPU acceleration, but still fails to run with it without subuid being set. Logs attached as nct_fails_crun_log.txt

$ grep 'runtime =' /usr/share/containers/containers.conf
runtime = "crun"
#runtime = "runc"
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
Error: OCI runtime error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1)
$ podman run --rm docker.io/centos:8 cat /etc/redhat-release
CentOS Linux release 8.3.2011

@qhaas
Copy link
Author

qhaas commented Aug 10, 2021

As an alterative, created this issue over on the podman GitHub to see if Singularity's approach to GPU acceleration is applicable to podman.

@elezar
Copy link
Member

elezar commented Oct 20, 2023

We have recently reworked our podman support and now suggest using CDI to request devices. Please see the updated documentation and feel free to open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if problems persist.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants