-
Notifications
You must be signed in to change notification settings - Fork 519
intermittent container networking errors when backed by containerd #2762
Comments
These errors are observed across Kubernetes versions, not restricted to v1.18.0 |
Confirmed that increasing the timeout to 20m does not remove the flakes. Next step: soak tests. |
From containerd logs on a vm that is running a pod/container exhibiting this symptom:
|
Can confirm that all pods/containers running on the affected vm has a common symptom of no networking. |
Also can confirm that the vm itself doesn't appear to have any networking issues. |
Ugh, this appears to be as simple as the coredns pod getting into a hung state.
|
One bit of data: I am unable to reproduce this on a cluster w/ a node-scheduled coredns (i.e., not scheduled onto a master vm). If I manually edit the coredns deployment to add a Something about containerd-backed containers running on master nodes yields a container w/ a non-working network stack. |
I've pretty confidently triaged the failure vector to whether or not a container (in this instance coredns is the container that's triggering E2E failures) is scheduled to a master node (w/ hostNetwork: false). Azure CNI, kubenet, are not suspected. |
Fixed in #2865 |
E2E tests regularly run this basic container networking DNS test after building a cluster:
We are getting intermittent failures to receive a terminal zero exit code state of the above on clusters running w/ Azure-built containerd:
The errors:
We wait up to 2 minutes before throwing an error in E2E.
The text was updated successfully, but these errors were encountered: