-
Notifications
You must be signed in to change notification settings - Fork 159
Running nvidia-container-runtime with podman is blowing up. #85
Comments
The Podman team would like to work with you guys to get this to work well in both root full and rootless containers if possible. But we need someone to work with. |
@sjug FYI |
Hello! @rhatdan do you mind filling the following issue template: https://github.com/NVIDIA/nvidia-docker/blob/master/.github/ISSUE_TEMPLATE.md Thanks! |
I can work with the podman team. |
@hholst80 FYI |
@nvjmayo Thanks for the suggestions. Some good news and less good. This works rootless: In fact rootless works with or without /usr/share/containers/oci/hooks.d/01-nvhook.json installed using the image: nvcr.io/nvidia/cuda Running as root continues to fail when no-cgroups = true for either container, returning: |
Strange I would not expect podman to run a hook that did not have a json file describing the hook. |
@eaepstein I'm still struggling to reproduce the issue you see. Using docker.io/nvidia/cuda also works for me with the hooks dir.
without the hook I would expect to see a failure roughly like:
This is because the libraries and tools get installed by the hook in order to match the host drivers. (an unfortunate limitation of tightly coupled driver+library releases) I think there is a configuration issue and not an issue of the container image (docker.io/nvidia/cuda vs nvcr.io/nvidia/cuda). Reviewing my earlier posts, I recommend changing my 01-nvhook.json and remove the |
@nvjmayo we started from scratch with a new machine (CentOS Linux release 7.7.1908) and both docker.io and nvcr.io images are working for us now too. And --hooks-dir must now be specified for both to work. Thanks for the help! |
@rhatdan @nvjmayo Turns out that getting rootless podman working with nvidia on centos 7 is a bit more complicated, at least for us. Here is our scenario on brand new centos 7.7 machine
Conclusion: running nvidia container with podman as root changes the environment for rootless to work. Environment cleared on reboot. One other comment: podman as root and rootless podman cannot run with the same /etc/nvidia-container-runtime/config.toml - no-cgroups must =false for root and =true for rootless |
If the nvidia hook is doing any privileged operations like modifying /dev and adding devicenodes, then this will not work with rootless. (In rootless all processes are running with the Users UID. Probably when you run rootfull, it is doing the privileged operations, so the next time you run rootless, those activities do not need to be done. I would suggest for rootless systems, that the /dev and nvidia ops be done as a systemd unit file, so the system is preconfigured and then the rootless jobs will work fine. |
After running nvidia/cuda with rootfull podman, the following exist: None of these devices exist after boot. Running nvidia-smi rootless (no podman) creates: I created the other three entries using "sudo mknod -m 666 etc..." but that is not enough to run rootless. Something else is needed in the environment. Running nvidia/cuda with rootfull podman at boot would work, but not pretty. Thanks for the suggestion |
This behavior is documented in our installation guide: From a userns you can't There is already Another option is to use
|
Udev rules makes sense to me. |
@flx42 The udev file only created the first two and was not sufficient by itself. Many thanks for your help. |
Thanks guys, with insight from this issue and others, I was able to get podman working with my Quadro in EL7 using Once the dust settles on how to get GPU support in rootless podman in EL7, a step-by-step guide would make for a great blog post and/or entry into the podman and/or nvidia documentation. |
Hello @nvjmayo and @rhatdan. I'm wondering if there is an update on this issue or this one for how to access NVIDIA GPU's from containers run rootless with podman. On RHEL8.1, with default /etc/nvidia-container-runtime/config.toml, and running containers with root, GPU access works as expected. Rootless does not work by default, it fails with cgroup related errors (as expected). After modifying the config.toml file -- setting no-cgroups = true and changing the debug log file -- rootless works. However, these changes make GPU access fail in containers run as root, with error "Failed to initialize NVML: Unknown Error." Please let me know if there is any recent documentation on how to do this beyond these two issues. |
Steps to get it working on RHEL 8.1:
|
/cc @dagrayvid |
Thanks @jamescassell. I repeated those steps on RHEL8.1, and nvidia-smi works as expected when running rootless. However, once those changes are made, I am unable to run nvidia-smi in a container run as root. Is this behaviour expected, or is there some change in CLI flags needed when running as root? Running as root did work before making these changes. Is there a way to configure a system so that we can utilize GPUs with podman as root and non-root user? |
I can't run podman rootless with GPU, someone can help me?
Output:
But, runc exists:
|
See particularly step 4. #85 (comment) |
This looks like the nvidia plugin is searching for a hard coded path to runc? |
[updated] Hi @jamescassell , unfortunately do not work for me.
|
For anybody who has the same issue as me ( Add your user to group video if not present:
|
Hi, have you figure out the solution? Rootless running only works after launching container as root at least once. And reboot reset everything. |
For those dropping into this issue, nvidia has documented getting GPU acceleration working with podman. |
That's awesome! The documentation is almost the same as my fix here in this thread :D |
any chance they can update the version of podman in example. That one is pretty old. |
@fuomag9 Are you using |
Working for me with both |
For RHEL8 systems where selinux is |
For folks finding this issue, especially anyone trying to do this on RHEL8 after following https://www.redhat.com/en/blog/how-use-gpus-containers-bare-metal-rhel-8, here's the current status/known issues that I've encountered. Hopefully this saves someone some time. As noted in the comments above you can run containers as Things that need to be done ahead of time to run rootless containers are documented in https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#step-3-rootless-containers-setup but the cheat sheet version is:
If you do that, then running: Now for the workaround. You don't need this hook, you just need the Here's what it looks like: This is what the above is doing. We run
We then loop through each of those files and run
We loop through each of those files and use This effectively does what the hook does, using the tools the hook provides, but does not require the user running the container to be Hopefully this saves someone else a few hours. |
@decandia50 excellent information! your information really deserves to be highlighted. would you consider posting as a blog if we connect you with some people? |
Please do not write a blog post with the above information. While the procedure may work on some setups, it is not a supported use of the The better solution is to use podman's integrated CDI support to have podman do the work that libnvidia-container would have otherwise done instead. The future of the nvidia stack (and device support in container runtimes in general) is CDI, and starting to use this method now will future proof how you access generic devices in the future. Please see below for details on CDI: We have spent the last year rearchitecting the NVIDIA container stack to work together with CDI, and as part of this have a tool coming out with the next release that will be able to generate CDI specs for nvidia devices for use with podman (and any other CDI compatible runtimes). In the meantime, you can generate a CDI spec manually, or wait for @elezar to comment on a better method to get a CDI spec generated today. |
Here is an example of a (fully functional) CDI spec on my DGX-A100 machine (excluding MIG devices):
|
Maintaining a nvidia.json CDI spec file for multiple machines with different Nvidia drivers and other libs is a bit painful. |
@Ru13en we have a WIP Merge Request that adds an:
command to the NVIDIA Container Toolkit. This idea being that this could be run at boot or triggered on a driver installation / upgrade. We are working at getting an |
@elezar Any ETA by when we can expect the |
It will be released next week. |
I tried using the WIP version of |
Thanks for the confirmation @starry91. The official release of v1.12.0-rc.1 has been delayed a little but thanks for testing the toolking nonetheless. I will have a look a the issue you created and update the tooling before releasing the rc. |
We have recently updated our Podman support and now recommend using CDI -- which is supported natively in more recent Podman versions. See https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman for details. |
Issue or feature description
rootless and rootful podman does not work with the nvidia plugin
Steps to reproduce the issue
Install the nvidia plugin, configure it to run with podman
execute the podman command and check if the devices is configured correctly.
Information to attach (optional if deemed irrelevant)
Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
Kernel version from uname -a
Fedora 30 and later
Any relevant kernel output lines from dmesg
Driver information from nvidia-smi -a
Docker version from docker version
NVIDIA packages version from dpkg -l 'nvidia' or rpm -qa 'nvidia'
NVIDIA container library version from nvidia-container-cli -V
NVIDIA container library logs (see troubleshooting)
Docker command, image and tag used
I am reporting this based on other users complaining. This is what they said.
We discovered that the ubuntu 18.04 machine needed a configuration change to get rootless working with nvidia:
"no-cgroups = true" was set in /etc/nvidia-container-runtime/config.toml
Unfortunately this config change did not work on Centos 7, but it did change the rootless error to:
nvidia-container-cli: initialization error: cuda error: unknown error\\n\"""
This config change breaks podman running from root, with the error:
Failed to initialize NVML: Unknown Error
Interestingly, root on ubuntu gets the same error even though rootless works.
The text was updated successfully, but these errors were encountered: