-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safe Mounting of /dev/gdrdrv in a kubernetes environment - HostPath appears to fail #291
Comments
Hi @hassanbabaie, |
Thanks @pakmarkthub , I'm hoping there is a documented way as this should be something that I would expected is a growing scenario. |
Hi @hassanbabaie, mounting the device node with a We are working on adding gdrcopy support to NVIDIA Container Runtime (see https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/530), and it should make it into the next release. With this feature, you can inject cc @elezar |
Thanks this is great news @cdesiniotis, yes this will be much better as leveraging privileged is not desired. If possible can you post here when it's released we can then look to try it out |
Hi @cdesiniotis I can't seem to access https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/530 Do you happen to have any update on this? |
It looks like this is now covered in v1.15.0-rc.2 and it's worked it's way through to v1.15.0-rc.4, Do we happen to know the estimated release timeline? |
Hi @hassanbabaie, FYI, we have released a gdrdrv container image on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/gdrdrv. Running that image will automatically compile and install the gdrdrv driver on your system. It will also expose
@cdesiniotis Do you have anything that you can share regarding gdrdrv support in NVIDIA Container Runtime? |
@hassanbabaie apologies for the delayed response. NVIDIA Container Toolkit 1.15.0 has been released. You can set |
I tried to deploy this on OpenShift, and at the beginning I was not able to have the
and noticed that gdrcopy was not installed in
I don't know if this is the right way to do it but it works. |
@stefanomaxenti yes, if you are leveraging GPU Operator to install the GDRCopy driver, the device node will be present at |
While everything works fine on a privilged container, I am unable to use the env. variable NVIDIA_GDRCOPY=enabled inside a non-privileged pod using the NVIDIA GPU Operator. Without hostPath and with the variable, gdrdrv is not visible. But with hostPath, it is not usable since it requires R/W permission and the pod is not privileged. I think it is related to this issue NVIDIA/gpu-operator#713 on the operator side closed some days after releasing 24.3.0. I will try when it is avaiable to deploy. Do you maybe have any other ideas to why GDRCopy is not working as expected in this setup? Thank you. |
@stefanomaxenti ah I see you are on OpenShift. Unfortunately the |
Hi @pakmarkthub, sorry for the tag but it's related to this one:
#278 (comment)
Is there a recommend what to enable gdrcopy within a Kubernetes environment, I'm trying to use the HostPath method as a file but this looks to be incorrect.
I was hoping I'm not the first person to try and do this?
Thanks
The text was updated successfully, but these errors were encountered: