-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newer runc versions break support for NVIDIA GPUs #3708
Comments
/cc @kolyshkin |
Well, all I can say that this is complicated; I was looking into that while reviewing PR #3568 (see its comments). The problem is, runtime spec (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md to be precise) defines a host device by its major:minor pair (path is only used for the in-container device node). Similarly, both cgroup v1 and cgroup v2 device controllesr use major:minor pair. OTOH systemd uses path to refer to a device. Since we don't have a way to specify a path (in runtime-spec, that is), we need a way to convert from a major:minor pair to a device name (to supply that to systemd). Fortunately, we have /dev/{block,char}/MM:mm nodes so we can translate easily. Unfortunately, this is not working for NVidia devices. A workaround is proposed in #3568 but in essence this is a hardcoded kludge, and unfortunately I don't see a clean way to handle this (other than introducing host path to runtime-spec, that is). The non-hardcoded kludge would be to read and parse the whole /dev, trying to find the node with the needed major:minor. This is time consuming (we need to Now, all this is happening because Nvidia doesn't follow the standard practice of device registration. So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts. If you have any ideas @klueska please let us know. |
Yeah, the hardcoded kludge in #3568 is definitely not the answer. It's not even guaranteed to work as the device majors (for some of the devices) can change, and the Two options I see that could help are:
This would allow us to insert a hook that creates the necessary Unfortunately, this won't work for the rootless case, but it would at least provide a well-defined "out" for most of our users. Regarding:
I agree the best solution is for NVIDIA to start registering its devices under /sys/dev. This problem would simply go away if this was done properly. Unfortunately, there is internal resistance to do so because of GPL compatibility issues. It could be done in the new open source version of the NVIDIA kernel module (https://github.com/NVIDIA/open-gpu-kernel-modules), but that is not production ready on all systems yet, and will likely take years for it to become so. Regarding:
Unfortunately simply running a post-module-loading script is not sufficient for two reasons:
Workarounds for both (1) and (2) are feasible (i.e. start a long-running script (on the host) that uses However, we'd much prefer a "just-in-time" solution (if possible) until the NVIDIA driver team is able to come up with a proper solution at the driver level. |
Thinking about it a bit more.... Is there any particular reason you can't just use the absolute path to the device node inside the container? From what I understood in your response, |
I might finally have a fix for that: #3842 |
@elezar can you take a look while I am away. |
Sure. Will have a look tomorrow. |
With the fix I was able to give access to a device which does not have an appropriate |
I can confirm that the failure mode is no longer triggered with the fixes in place. I first started to reproduce the behaviour that @klueska described using:
that was installed on my test system. I then cloned the {
"runtimes": {
"runc-dev": {
"args": [],
"path": "/home/ubuntu/src/runc/runc"
}
}
} For the tips of
This was from a clean boot and no nvidia-specific symlinks are present in
|
I will leave it to @klueska to make that call. I do believe he's out for a couple of weeks, so it won't be immediately. |
No response from @klueska, so I consider this fixed. A comment can still be added to a closed bug, so feel free to share your findings. |
runc 1.12 ,Redhat9 systemd 252 still has issues. |
Please file a separate bug with a detailed reproducer. |
Another thing I am curious about is that
|
@vsxen I don't see you explicitly granting access to any nvidia devices. Also, as I said earlier, please file a separate bug report if you feel something is broken. |
Every unit created by runc need daemon reload since systemd v230. This breaks support for NVIDIA GPUs, see opencontainers#3708 (comment) Add a workaround for the below systemd issue. systemd/systemd#35710 Instead of filling the empty DeviceAllow array, a new array is created with allowed devices. Remove the comment about it, since it's misleading. Closes opencontainers#4568 Signed-off-by: Jian Wen <[email protected]>
OK, the systemd restart issue is definitely not the one which was originally described here (and fixed by #3842 / #3845). Also, in #3708 (comment) the runc binary is not even runc. |
@wenjianhn I need a precompiled runc binary for testing (I haven't compiled runc myself yet).
This is a requirement from NVIDIA. The NVIDIA Container Runtime is a shim for OCI-compliant low-level runtimes such as runc. When a create command is detected, the incoming OCI runtime specification is modified in place and the command is forwarded to the low-level runtime. |
@vsxen it's easy to build one
|
The following logic originally introduced in
libcontainer/cgroups/systemd/common.go
and then later moved / extended to the following inlibcontainer/cgroups/devices/systemd.go
will never match against NVIDIA GPU devices.The end result is that containers have access to GPUs when they first come online, but then lose access to GPUs as soon as
systemd
is triggered to run some reevaluation of the cgroups it manages (e.g. with something as simple as asystemctl daemon-reload
).The reason being that the NVIDIA driver does not register any of its devices with
/sys/dev/char
. The NVIDIA driver is closed source / non-GPL, and thus is unable to call the standard linux helpers to register the devices it manages here. In fact, the device nodes that ultimately get created under/dev
for all NVIDIA devices are all triggered from user-space code.I have filed an issue internal to NVIDIA to investigate what can be done on the driver front here, but until something gets put in place, GPU support on newer
runc
s (with systemd cgroup integration) remains to be broken.We are also investigating workarounds to manually create the necessary symlinks under
/dev/char
, but so far nothing we have come up with is "fool-proof".Would it be possible to use the device node under
/dev
for theDeviceAllowList
here rather than relying on the existence of a node under/dev/char
or/sys/dev/char
?Here is a simple reproducer using docker on ubuntu22.04 (with a single K80 GPU on it):
Which is then fixed if I (manually) do the following:
Rerunning the above -- the issue is now resolved:
The text was updated successfully, but these errors were encountered: