-
Notifications
You must be signed in to change notification settings - Fork 2k
nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678
Comments
I've noticed the same behavior for some time on Debian 11; at least since March as that is when I started regularly checking for Example from today where I can see the auto updates happening where in this case, telegraf is being updated and then a daemon-reload occurs, or at least that is what I believe I am seeing from
|
Do the packages in the debian repositories include the NVIDIA Drivers? |
Fair point and callout - they do. I was looking back, doing a cross-check of the times I detected the issue & sending myself a notification and the packages that were upgraded at the time, I recorded the following: 2022-09-14:
2022-08-17:
2022-08-13:
2022-07-27:
2022-07-13:
2022-07-06:
2022-05-29:
2022-05-20:
2022-05-17:
2022-04-29:
2022-04-27:
2022-04-20:
While I see telegraf frequently, it's not consistent. I may just be reading into it too much based on the |
I'm encountering the same issue. I'm currently testing some solutions proposed in NVIDIA/nvidia-container-toolkit#251 and #1671 and I will let you know if something works for me. |
@mbentley thanks for reminding,I will check if there are any auto upgrade packages in our production environments |
At least in my case where Same for rsyslog (not sure where the Debian packaging is source code wise but here is the postinst script). So far, I haven't seen any instances where driver upgrades have impacted running containers but I've only seen one instance where the drivers were updated on 9/12 so there is only a sample size of one to go on from my logs. It would be easy enough to add the nvidia-drivers to the package blacklist if it was causing an issue but at least from the best of what I can tell, that does not seem to be the trigger. |
We recently had to solve this for runc interactive issue. E.g.:
We only just realised we're hitting this now for GPUs dropping out in containers too. |
I am closing this as a duplicate of NVIDIA/nvidia-container-toolkit#48 -- a known issue with certain runc / systemd version combinations. Please see the steps to address this there or create a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if you are still having problems. |
1. Issue or feature description
Nvidia gpu works well upon the container has started, but when it runs a couple of times(maybe several days), gpus mounted by nvidia container runtime becomes invalid. Command Nvidia-smi returns "Failed to initialize NVML: Unknown Error" in container, while it works well on the host machine.
Nvidia-smi looks well on host,and we can see the training process information through host nvidia-smi command output. If now we stop the training process, it can no longer restart.
Referring to the solution from issue #1618 . We try to upgrade cgroup to v2 version, but it does not work.
Surprising, we cannot find any devices.list files in the container,which is mentioned in #1618
2. Steps to reproduce the issue
We find this issue can be reproduced when running "systemctl daemon-reload" on host,but actually we have not run any similar commands in our production environment
Can anyone give some good ideas for positioning this problem
3. Information to attach (optional if deemed irrelevant)
docker: 20.10.7
k8s: v1.22.5
nvidia driver version: 470.103.01
nvidia-container-runtime: 3.8.1-1
containerd: 1.5.5-0ubuntu3~20.04.2
The text was updated successfully, but these errors were encountered: