Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

Open
chiragjn opened this issue Sep 13, 2024 · 5 comments

Comments

@chiragjn
Copy link

chiragjn commented Sep 13, 2024

Original context and jounrnalctl logs here: containerd/containerd#10437

As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.

Timeline of race condition:

  • nvidia-container-toolkit and nvidia-device-plugin schedules
  • nvidia-device-plugin waits on toolkit-ready file validation via init container
  • Patches the config to update nvidia runtime
  • Sends SIGHUP and writes toolkit-ready file
  • config-manager-init container from nvidia-device-plugin pod enters Created state
  • containerd restarts
  • config-manager-init forever stuck in Created, hence device plugin never gets to start

Today the only way for us to recover is to manually delete the stuck daemonset pods.

While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.

I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep

P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior

cc: @klueska

@ekeih
Copy link

ekeih commented Oct 10, 2024

Hi,

we are seeing the same issue with the gpu-operator-validator daemonset.

We found in the log of nvidia-container-toolkit-daemonset that it modified /etc/containerd/config.toml and then sends a SIGHUP to containerd:

nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Successfully signaled containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Completed 'setup' for containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Waiting for signal"

Then in the middle of the creation of one of the init containers of the gpu-operator-validator daemonset the kubelet fails to communicate with the containerd socket because containerd restarts.
After a bunch of transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused errors from the kubelet we see the following in our journald log:

Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopping Kubernetes Kubelet...
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped containerd container runtime.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Starting Load NVIDIA kernel modules...

This looks like systemd also decides to restart containerd after it should already have been restarted by the SIGHUP. We are unsure why this happens.

The stuck pod shows Warning Failed 24m kubelet Error: error reading from server: EOF in its events and the state of the pod shows the following for the plugin-validation init container:

    State:          Waiting
    Ready:          False
    Restart Count:  0

We are seeing this issue several times per day in our infrastructure. So if you have any ideas how to debug this further we should be able to reproduce it to provide more information.

Thanks in advance for any help :)

@justinthelaw
Copy link

justinthelaw commented Nov 7, 2024

I am also experiencing the similar thing when attempting a test/dev deployment on K3d (uses a K3s-cuda base image).

As part of the nvidia-container-toolkit's container installation of the toolkit onto the host, it sends a signal to restart containerd, which then cycles then entire cluster since containerd.service was restarted at a node's system-level.

If we disable the toolkit (toolkit.enabled: false) from the deployment and instead directly install the toolkit on the node, then it no longer cycles the entire cluster, and everything works fine.

@ramesius
Copy link

ramesius commented Jan 17, 2025

Same issue here. Pretty much the same information.
nvidia-container-toolkit-daemonset hangs on "Waiting for signal".
In our case nodes become Unschedulable and cause more to spin up, the cycle repeats and not too long after we have many excess nodes.

@chiragjn
Copy link
Author

chiragjn commented Jan 20, 2025

Sadly, till any solution is made available, for us the only way to reduce the probability of this happening is to remove any device plugin confimap that the chart can create (https://github.com/truefoundry/infra-charts/blob/04a5627734b33c486e5293281b4b2cd0e6936173/charts/tfy-gpu-operator/values.yaml#L95-L97) thereby eliminating the config-manager-init init container and reducing the likelihood of getting stuck

We are even considering writing a controller loop that restarts any device plugin pods stuck in pending for too long

@stefanandres
Copy link

We are even considering writing a controller loop that restarts any device plugin pods stuck in pending for too long

We actually do have a workaraound in place for that. We're using descheduler for that.

deschedulerPolicy:
  profiles:
    - name: RemoveFailedNvidiaInitPods
      # There is a race condition when the gpu-operator modifies the containerd configuration
      # and restarts it while other pods are created. This descheduler policy deletes the
      # stuck pods to force a restart.
      # Hopefully we can remove this when https://github.com/NVIDIA/gpu-operator/issues/991 is fixed.
      pluginConfig:
        - name: DefaultEvictor
          args:
            evictSystemCriticalPods: true
            evictDaemonSetPods: true
            evictLocalStoragePods: true
        - name: PodLifeTime
          args:
            maxPodLifeTimeSeconds: 300
            includingInitContainers: true
            states:
              - "PodInitializing"
              - "Pending"
            namespaces:
              include:
                - "gpu-operator"
            labelSelector:
              matchLabels:
                app: nvidia-operator-validator
      plugins:
        deschedule:
          enabled:
            - "PodLifeTime"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants