containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

chiragjn · 2024-09-13T20:14:43Z

Original context and jounrnalctl logs here: containerd/containerd#10437

As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.

Timeline of race condition:

nvidia-container-toolkit and nvidia-device-plugin schedules
nvidia-device-plugin waits on toolkit-ready file validation via init container
Patches the config to update nvidia runtime
Sends SIGHUP and writes toolkit-ready file
config-manager-init container from nvidia-device-plugin pod enters Created state
containerd restarts
config-manager-init forever stuck in Created, hence device plugin never gets to start

Today the only way for us to recover is to manually delete the stuck daemonset pods.

While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.

I think something similar was discovered here but different effect
963b8dc
and was fixed with a sleep

P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior

cc: @klueska

The text was updated successfully, but these errors were encountered:

ekeih · 2024-10-10T13:06:49Z

Hi,

we are seeing the same issue with the gpu-operator-validator daemonset.

We found in the log of nvidia-container-toolkit-daemonset that it modified /etc/containerd/config.toml and then sends a SIGHUP to containerd:

nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Successfully signaled containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Completed 'setup' for containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Waiting for signal"

Then in the middle of the creation of one of the init containers of the gpu-operator-validator daemonset the kubelet fails to communicate with the containerd socket because containerd restarts.
After a bunch of transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused errors from the kubelet we see the following in our journald log:

Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopping Kubernetes Kubelet...
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped containerd container runtime.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Starting Load NVIDIA kernel modules...

This looks like systemd also decides to restart containerd after it should already have been restarted by the SIGHUP. We are unsure why this happens.

The stuck pod shows Warning Failed 24m kubelet Error: error reading from server: EOF in its events and the state of the pod shows the following for the plugin-validation init container:

    State:          Waiting
    Ready:          False
    Restart Count:  0

We are seeing this issue several times per day in our infrastructure. So if you have any ideas how to debug this further we should be able to reproduce it to provide more information.

Thanks in advance for any help :)

justinthelaw · 2024-11-07T21:24:11Z

I am also experiencing the similar thing when attempting a test/dev deployment on K3d (uses a K3s-cuda base image).

As part of the nvidia-container-toolkit's container installation of the toolkit onto the host, it sends a signal to restart containerd, which then cycles then entire cluster since containerd.service was restarted at a node's system-level.

If we disable the toolkit (toolkit.enabled: false) from the deployment and instead directly install the toolkit on the node, then it no longer cycles the entire cluster, and everything works fine.

ramesius · 2025-01-17T11:28:27Z

Same issue here. Pretty much the same information.
~~nvidia-container-toolkit-daemonset hangs on "Waiting for signal".~~
In our case nodes become Unschedulable and cause more to spin up, the cycle repeats and not too long after we have many excess nodes.

chiragjn · 2025-01-20T18:18:17Z

Sadly, till any solution is made available, for us the only way to reduce the probability of this happening is to remove any device plugin confimap that the chart can create (https://github.com/truefoundry/infra-charts/blob/04a5627734b33c486e5293281b4b2cd0e6936173/charts/tfy-gpu-operator/values.yaml#L95-L97) thereby eliminating the config-manager-init init container and reducing the likelihood of getting stuck

We are even considering writing a controller loop that restarts any device plugin pods stuck in pending for too long

stefanandres · 2025-01-21T08:03:01Z

We are even considering writing a controller loop that restarts any device plugin pods stuck in pending for too long

We actually do have a workaraound in place for that. We're using descheduler for that.

deschedulerPolicy:
  profiles:
    - name: RemoveFailedNvidiaInitPods
      # There is a race condition when the gpu-operator modifies the containerd configuration
      # and restarts it while other pods are created. This descheduler policy deletes the
      # stuck pods to force a restart.
      # Hopefully we can remove this when https://github.com/NVIDIA/gpu-operator/issues/991 is fixed.
      pluginConfig:
        - name: DefaultEvictor
          args:
            evictSystemCriticalPods: true
            evictDaemonSetPods: true
            evictLocalStoragePods: true
        - name: PodLifeTime
          args:
            maxPodLifeTimeSeconds: 300
            includingInitContainers: true
            states:
              - "PodInitializing"
              - "Pending"
            namespaces:
              include:
                - "gpu-operator"
            labelSelector:
              matchLabels:
                app: nvidia-operator-validator
      plugins:
        deschedule:
          enabled:
            - "PodLifeTime"

unexge mentioned this issue Dec 18, 2024

Driver crashes unexpectedly with Failed to read /host/proc/mounts requiring pod restart awslabs/mountpoint-s3-csi-driver#284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

chiragjn commented Sep 13, 2024 •

edited

Loading

ekeih commented Oct 10, 2024

justinthelaw commented Nov 7, 2024 •

edited

Loading

ramesius commented Jan 17, 2025 •

edited

Loading

chiragjn commented Jan 20, 2025 •

edited

Loading

stefanandres commented Jan 21, 2025

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

Comments

chiragjn commented Sep 13, 2024 • edited Loading

ekeih commented Oct 10, 2024

justinthelaw commented Nov 7, 2024 • edited Loading

ramesius commented Jan 17, 2025 • edited Loading

chiragjn commented Jan 20, 2025 • edited Loading

stefanandres commented Jan 21, 2025

chiragjn commented Sep 13, 2024 •

edited

Loading

justinthelaw commented Nov 7, 2024 •

edited

Loading

ramesius commented Jan 17, 2025 •

edited

Loading

chiragjn commented Jan 20, 2025 •

edited

Loading