intermittent container networking errors when backed by containerd #2762

jackfrancis · 2020-02-21T17:20:14Z

E2E tests regularly run this basic container networking DNS test after building a cluster:

$ kubectl describe pod validate-dns-linux-4p75n -n default completed in 913.265905ms
 2020/02/21 16:22:35 
 Name:         validate-dns-linux-4p75n
 Namespace:    default
 Priority:     0
 Node:         k8s-agentpool1-13396981-vmss000000/10.240.0.34
 Start Time:   Fri, 21 Feb 2020 16:20:31 +0000
 Labels:       controller-uid=74e04685-aed4-4943-91d4-17eb49e6cd5d
               job-name=validate-dns-linux
 Annotations:  kubernetes.io/psp: privileged
 Status:       Running
 IP:           10.240.0.52
 IPs:
   IP:           10.240.0.52
 Controlled By:  Job/validate-dns-linux
 Containers:
   validate-bing-google:
     Container ID:  containerd://9ea0e6c78af111ff70224d4722d9ce6f0f8303e819bddffad3ebdfe3c73ac61d
     Image:         library/busybox
     Image ID:      docker.io/library/busybox@sha256:6915be4043561d64e0ab0f8f098dc2ac48e077fe23f488ac24b665166898115a
     Port:          <none>
     Host Port:     <none>
     Command:
       sh
       -c
       until nslookup www.bing.com || nslookup google.com; do echo waiting for DNS resolution; sleep 1; done;
     State:          Running
       Started:      Fri, 21 Feb 2020 16:20:35 +0000
     Ready:          True
     Restart Count:  0
     Environment:    <none>
     Mounts:
       /var/run/secrets/kubernetes.io/serviceaccount from default-token-rnh6k (ro)
 Conditions:
   Type              Status
   Initialized       True 
   Ready             True 
   ContainersReady   True 
   PodScheduled      True 
 Volumes:
   default-token-rnh6k:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  default-token-rnh6k
     Optional:    false
 QoS Class:       BestEffort
 Node-Selectors:  beta.kubernetes.io/os=linux
 Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                  node.kubernetes.io/unreachable:NoExecute for 300s
 Events:
   Type    Reason     Age        From                                         Message
   ----    ------     ----       ----                                         -------
   Normal  Scheduled  <unknown>  default-scheduler                            Successfully assigned default/validate-dns-linux-4p75n to k8s-agentpool1-13396981-vmss000000
   Normal  Pulling    2m3s       kubelet, k8s-agentpool1-13396981-vmss000000  Pulling image "library/busybox"
   Normal  Pulled     2m         kubelet, k8s-agentpool1-13396981-vmss000000  Successfully pulled image "library/busybox"
   Normal  Created    2m         kubelet, k8s-agentpool1-13396981-vmss000000  Created container validate-bing-google
   Normal  Started    2m         kubelet, k8s-agentpool1-13396981-vmss000000  Started container validate-bing-google

We are getting intermittent failures to receive a terminal zero exit code state of the above on clusters running w/ Azure-built containerd:

$ k get nodes -o json
 2020/02/21 16:14:45 NAME                                 STATUS   ROLES    AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
 k8s-agentpool1-13396981-vmss000000   Ready    <none>   46s   v1.18.0-alpha.5   10.240.0.34    <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
 k8s-agentpool1-13396981-vmss000001   Ready    <none>   46s   v1.18.0-alpha.5   10.240.0.65    <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
 k8s-master-13396981-0                Ready    <none>   46s   v1.18.0-alpha.5   10.255.255.5   <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure

The errors:

$ k logs validate-dns-linux-4p75n -c validate-bing-google -n default
;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

We wait up to 2 minutes before throwing an error in E2E.

The text was updated successfully, but these errors were encountered:

jackfrancis · 2020-02-21T17:20:44Z

These errors are observed across Kubernetes versions, not restricted to v1.18.0

jackfrancis · 2020-02-21T17:21:03Z

@cpuguy83 @ritazh FYI

jackfrancis · 2020-02-21T19:48:14Z

Confirmed that increasing the timeout to 20m does not remove the flakes. Next step: soak tests.

jackfrancis · 2020-02-24T23:47:06Z

From containerd logs on a vm that is running a pod/container exhibiting this symptom:

Feb 24 23:45:08 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:08.011012305Z" level=error msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded"
Feb 24 23:45:08 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:08.012069213Z" level=info msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" with command [sh -c nslookup www.bing.com || nslookup google.com] and timeout 1 (s)"
Feb 24 23:45:09 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:09.064354380Z" level=info msg="Timeout received while waiting for exec process kill \"d064f1a2043daf787b9c5a0414b2d0d1bd56050392424ad50d3517d21b38ffdc\" code 137 and error <nil>"
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.066558597Z" level=info msg="Finish piping \"stdout\" of container exec \"d064f1a2043daf787b9c5a0414b2d0d1bd56050392424ad50d3517d21b38ffdc\""
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.066706498Z" level=info msg="Finish piping \"stderr\" of container exec \"d064f1a2043daf787b9c5a0414b2d0d1bd56050392424ad50d3517d21b38ffdc\""
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.068765214Z" level=error msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded"
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.069690421Z" level=info msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" with command [sh -c nslookup www.bing.com || nslookup google.com] and timeout 1 (s)"
Feb 24 23:45:14 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:14.118517061Z" level=info msg="Timeout received while waiting for exec process kill \"f9c32d88f4e1d59ccd282306e59bf8b021ba696090fa88b30d3e3b94fbc70b0f\" code 137 and error <nil>"
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.126240120Z" level=info msg="Finish piping \"stderr\" of container exec \"f9c32d88f4e1d59ccd282306e59bf8b021ba696090fa88b30d3e3b94fbc70b0f\""
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.127006926Z" level=info msg="Finish piping \"stdout\" of container exec \"f9c32d88f4e1d59ccd282306e59bf8b021ba696090fa88b30d3e3b94fbc70b0f\""
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.130647653Z" level=error msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded"
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.132838469Z" level=info msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" with command [sh -c nslookup www.bing.com || nslookup google.com] and timeout 1 (s)"
Feb 24 23:45:19 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:19.204182178Z" level=info msg="Timeout received while waiting for exec process kill \"1a5f0bdd934116b2196f29cdf576d4c0d3dddbf49c44d86493fb515b8a1bd329\" code 137 and error <nil>"

jackfrancis · 2020-02-24T23:56:15Z

Can confirm that all pods/containers running on the affected vm has a common symptom of no networking.

jackfrancis · 2020-02-24T23:57:12Z

Also can confirm that the vm itself doesn't appear to have any networking issues.

jackfrancis · 2020-02-25T01:33:43Z

Ugh, this appears to be as simple as the coredns pod getting into a hung state.

[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:51966->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:54578->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:38348->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:55543->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:59305->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:51830->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:37616->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:47401->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:56034->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:46424->168.63.129.16:53: i/o timeout

jackfrancis · 2020-02-26T19:32:58Z

One bit of data: I am unable to reproduce this on a cluster w/ a node-scheduled coredns (i.e., not scheduled onto a master vm). If I manually edit the coredns deployment to add a kubernetes.io/role: master nodeSelector, I can reproduce very easily, it's roughly 50/50 it seems.

Something about containerd-backed containers running on master nodes yields a container w/ a non-working network stack.

jackfrancis · 2020-02-26T20:59:40Z

I've pretty confidently triaged the failure vector to whether or not a container (in this instance coredns is the container that's triggering E2E failures) is scheduled to a master node (w/ hostNetwork: false).

Azure CNI, kubenet, are not suspected.

jackfrancis · 2020-03-11T22:23:13Z

Fixed in #2865

jackfrancis added the bug Something isn't working label Feb 21, 2020

jackfrancis changed the title ~~intermittent container networking DNS errors when backed by containerd~~ intermittent container networking errors when backed by containerd Feb 26, 2020

craiglpeters assigned jackfrancis Mar 4, 2020

jackfrancis closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent container networking errors when backed by containerd #2762

intermittent container networking errors when backed by containerd #2762

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 24, 2020

jackfrancis commented Feb 24, 2020

jackfrancis commented Feb 24, 2020

jackfrancis commented Feb 25, 2020

jackfrancis commented Feb 26, 2020

jackfrancis commented Feb 26, 2020

jackfrancis commented Mar 11, 2020

intermittent container networking errors when backed by containerd #2762

intermittent container networking errors when backed by containerd #2762

Comments

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 21, 2020

jackfrancis commented Feb 24, 2020

jackfrancis commented Feb 24, 2020

jackfrancis commented Feb 24, 2020

jackfrancis commented Feb 25, 2020

jackfrancis commented Feb 26, 2020

jackfrancis commented Feb 26, 2020

jackfrancis commented Mar 11, 2020