-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991
Comments
Hi, we are seeing the same issue with the We found in the log of
Then in the middle of the creation of one of the init containers of the
This looks like systemd also decides to restart The stuck pod shows
We are seeing this issue several times per day in our infrastructure. So if you have any ideas how to debug this further we should be able to reproduce it to provide more information. Thanks in advance for any help :) |
I am also experiencing the similar thing when attempting a test/dev deployment on K3d (uses a K3s-cuda base image). As part of the nvidia-container-toolkit's container installation of the toolkit onto the host, it sends a signal to restart containerd, which then cycles then entire cluster since If we disable the toolkit ( |
Original context and jounrnalctl logs here: containerd/containerd#10437
As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.
Timeline of race condition:
Today the only way for us to recover is to manually delete the stuck daemonset pods.
While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.
I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep
P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior
cc: @klueska
The text was updated successfully, but these errors were encountered: