Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678

Closed
yuan6711043 opened this issue Sep 14, 2022 · 8 comments

Comments

@yuan6711043
Copy link

yuan6711043 commented Sep 14, 2022

1. Issue or feature description

Nvidia gpu works well upon the container has started, but when it runs a couple of times(maybe several days), gpus mounted by nvidia container runtime becomes invalid. Command Nvidia-smi returns "Failed to initialize NVML: Unknown Error" in container, while it works well on the host machine.

image

Nvidia-smi looks well on host,and we can see the training process information through host nvidia-smi command output. If now we stop the training process, it can no longer restart.

image

Referring to the solution from issue #1618 . We try to upgrade cgroup to v2 version, but it does not work.

image

Surprising, we cannot find any devices.list files in the container,which is mentioned in #1618

image

2. Steps to reproduce the issue

We find this issue can be reproduced when running "systemctl daemon-reload" on host,but actually we have not run any similar commands in our production environment

image

Can anyone give some good ideas for positioning this problem

3. Information to attach (optional if deemed irrelevant)

docker: 20.10.7

k8s: v1.22.5

nvidia driver version: 470.103.01

nvidia-container-runtime: 3.8.1-1

containerd: 1.5.5-0ubuntu3~20.04.2

@mbentley
Copy link

mbentley commented Sep 14, 2022

I've noticed the same behavior for some time on Debian 11; at least since March as that is when I started regularly checking for nvidia-smi functioning in containers, and thanks for calling out systemctl daemon-reload as something that triggers it. In my case, I have automatic updates enabled in Debian using unattended upgrades and your mention of daemon-reload makes me think that the package updates may be triggering a daemon-reload event to occur. I'm only updating packages from the Debian repos automatically, applying nvidia-docker and 3rd party repo updates manually.

Example from today where I can see the auto updates happening where in this case, telegraf is being updated and then a daemon-reload occurs, or at least that is what I believe I am seeing from systemd[1]: Reloading. based on the output when I manually run a systemctl daemon-reload:

Sep 14 06:46:17 athena systemd[1]: Starting Daily apt upgrade and clean activities...
Sep 14 06:46:50 athena systemd[1]: Stopping The plugin-driven server agent for reporting metrics into InfluxDB...
Sep 14 06:46:52 athena telegraf[6829]: 2022-09-14T10:46:52Z I! [agent] Hang on, flushing any cached metrics before shutdown
Sep 14 06:46:52 athena telegraf[6829]: 2022-09-14T10:46:52Z I! [agent] Stopping running outputs
Sep 14 06:46:52 athena systemd[1]: telegraf.service: Succeeded.
Sep 14 06:46:52 athena systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 14 06:46:52 athena systemd[1]: telegraf.service: Consumed 14h 46min 50.974s CPU time.
Sep 14 06:46:54 athena systemd[1]: Reloading.
Sep 14 06:46:54 athena systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Sep 14 06:46:54 athena systemd[1]: nut-monitor.service: Supervising process 7162 which is not our child. We'll most likely not notice when it exits.
Sep 14 06:46:54 athena systemd[1]: Reloading.
Sep 14 06:46:55 athena systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Sep 14 06:46:55 athena systemd[1]: nut-monitor.service: Supervising process 7162 which is not our child. We'll most likely not notice when it exits.
Sep 14 06:46:55 athena systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z W! DeprecationWarning: Option "perdevice" of plugin "inputs.docker" deprecated since version 1.18.0 and will be removed in 2.0.0: use 'perdevice_include' instead
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Starting Telegraf 1.24.0
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded inputs: cpu disk diskio docker exec file ipmi_sensor kernel mem net netstat nvidia_smi processes smart swap system zfs
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded aggregators:
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded processors:
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded outputs: influxdb_v2
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Tags enabled: host=athena
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z W! Deprecated inputs: 0 and 1 options
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"athena", Flush Interval:10s
Sep 14 06:46:55 athena systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 14 06:47:14 athena systemd[1]: apt-daily-upgrade.service: Succeeded.
Sep 14 06:47:14 athena systemd[1]: Finished Daily apt upgrade and clean activities.
Sep 14 06:47:14 athena systemd[1]: apt-daily-upgrade.service: Consumed 58.400s CPU time.
Sep 14 06:47:29 athena systemd[1]: Starting Cleanup of Temporary Directories...
Sep 14 06:47:29 athena systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Sep 14 06:47:29 athena systemd[1]: Finished Cleanup of Temporary Directories.
Sep 14 06:50:00 athena systemd[1]: Starting system activity accounting tool...
Sep 14 06:50:00 athena systemd[1]: sysstat-collect.service: Succeeded.
Sep 14 06:50:00 athena systemd[1]: Finished system activity accounting tool.

@elezar
Copy link
Member

elezar commented Sep 14, 2022

Do the packages in the debian repositories include the NVIDIA Drivers?

@mbentley
Copy link

mbentley commented Sep 14, 2022

Fair point and callout - they do. I was looking back, doing a cross-check of the times I detected the issue & sending myself a notification and the packages that were upgraded at the time, I recorded the following:

2022-09-14:

google-chrome-stable:amd64
telegraf:amd64
handbrake-cli:amd64
handbrake:amd64
handbrake-gtk:amd64

2022-08-17:

epiphany-browser-data:amd64
libjavascriptcoregtk-4.0-18:amd64
libsnmp40:amd64
libsnmp-base:amd64
telegraf:amd64
google-chrome-stable:amd64
epiphany-browser:amd64

2022-08-13:

python3-samba:amd64
libldb2:amd64
samba-vfs-modules:amd64
samba:amd64
libwbclient0:amd64
libsmbclient:amd64
samba-dsdb-modules:amd64
samba-common-bin:amd64
python3-ldb:amd64
samba-libs:amd64
samba-common:amd64

2022-07-27:

linux-kbuild-5.10:amd64
linux-compiler-gcc-10-x86:amd64
telegraf:amd64
linux-libc-dev:amd64
libcpupower1:amd64

2022-07-13:

telegraf:amd64

2022-07-06:

google-chrome-stable:amd64
telegraf:amd64

2022-05-29:

rsyslog:amd64

2022-05-20:

libldap-common:amd64
ldap-utils:amd64
libldap-2.4-2:amd64
libldap-2.4-2:i386

2022-05-17:

telegraf:amd64

2022-04-29:

telegraf:amd64

2022-04-27:

telegraf:amd64

2022-04-20:

libnvpair3linux:amd64
libuutil3linux:amd64
zfs-dkms:amd64
libzpool5linux:amd64
libzfs4linux:amd64
zfsutils-linux:amd64

While I see telegraf frequently, it's not consistent. I may just be reading into it too much based on the daemon-reload behavior but in almost every case, I can see where a package was upgraded that does have a system unit which I would expect is triggering a daemon-reload to deal with the update. Unfortunately I do not have syslog logs from that far back to match that in all cases but I can see that it doesn't seem to correspond to driver package updates.

@iFede94
Copy link

iFede94 commented Sep 15, 2022

I'm encountering the same issue. I'm currently testing some solutions proposed in NVIDIA/nvidia-container-toolkit#251 and #1671 and I will let you know if something works for me.

@yuan6711043
Copy link
Author

@mbentley thanks for reminding,I will check if there are any auto upgrade packages in our production environments

@mbentley
Copy link

mbentley commented Sep 20, 2022

At least in my case where telegraf is a big culprit, I can see that in the post install script it does call a systemctl daemon-reload which matches the behavior I've been seeing.

Same for rsyslog (not sure where the Debian packaging is source code wise but here is the postinst script).

So far, I haven't seen any instances where driver upgrades have impacted running containers but I've only seen one instance where the drivers were updated on 9/12 so there is only a sample size of one to go on from my logs. It would be easy enough to add the nvidia-drivers to the package blacklist if it was causing an issue but at least from the best of what I can tell, that does not seem to be the trigger.

@dcarrion87
Copy link

We recently had to solve this for runc interactive issue. E.g.:

We only just realised we're hitting this now for GPUs dropping out in containers too.

@elezar
Copy link
Member

elezar commented Nov 27, 2023

I am closing this as a duplicate of NVIDIA/nvidia-container-toolkit#48 -- a known issue with certain runc / systemd version combinations. Please see the steps to address this there or create a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if you are still having problems.

@elezar elezar closed this as completed Nov 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants