Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

intermittent container networking errors when backed by containerd #2762

Closed
jackfrancis opened this issue Feb 21, 2020 · 10 comments
Closed

intermittent container networking errors when backed by containerd #2762

jackfrancis opened this issue Feb 21, 2020 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@jackfrancis
Copy link
Member

E2E tests regularly run this basic container networking DNS test after building a cluster:

$ kubectl describe pod validate-dns-linux-4p75n -n default completed in 913.265905ms
 2020/02/21 16:22:35 
 Name:         validate-dns-linux-4p75n
 Namespace:    default
 Priority:     0
 Node:         k8s-agentpool1-13396981-vmss000000/10.240.0.34
 Start Time:   Fri, 21 Feb 2020 16:20:31 +0000
 Labels:       controller-uid=74e04685-aed4-4943-91d4-17eb49e6cd5d
               job-name=validate-dns-linux
 Annotations:  kubernetes.io/psp: privileged
 Status:       Running
 IP:           10.240.0.52
 IPs:
   IP:           10.240.0.52
 Controlled By:  Job/validate-dns-linux
 Containers:
   validate-bing-google:
     Container ID:  containerd://9ea0e6c78af111ff70224d4722d9ce6f0f8303e819bddffad3ebdfe3c73ac61d
     Image:         library/busybox
     Image ID:      docker.io/library/busybox@sha256:6915be4043561d64e0ab0f8f098dc2ac48e077fe23f488ac24b665166898115a
     Port:          <none>
     Host Port:     <none>
     Command:
       sh
       -c
       until nslookup www.bing.com || nslookup google.com; do echo waiting for DNS resolution; sleep 1; done;
     State:          Running
       Started:      Fri, 21 Feb 2020 16:20:35 +0000
     Ready:          True
     Restart Count:  0
     Environment:    <none>
     Mounts:
       /var/run/secrets/kubernetes.io/serviceaccount from default-token-rnh6k (ro)
 Conditions:
   Type              Status
   Initialized       True 
   Ready             True 
   ContainersReady   True 
   PodScheduled      True 
 Volumes:
   default-token-rnh6k:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  default-token-rnh6k
     Optional:    false
 QoS Class:       BestEffort
 Node-Selectors:  beta.kubernetes.io/os=linux
 Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                  node.kubernetes.io/unreachable:NoExecute for 300s
 Events:
   Type    Reason     Age        From                                         Message
   ----    ------     ----       ----                                         -------
   Normal  Scheduled  <unknown>  default-scheduler                            Successfully assigned default/validate-dns-linux-4p75n to k8s-agentpool1-13396981-vmss000000
   Normal  Pulling    2m3s       kubelet, k8s-agentpool1-13396981-vmss000000  Pulling image "library/busybox"
   Normal  Pulled     2m         kubelet, k8s-agentpool1-13396981-vmss000000  Successfully pulled image "library/busybox"
   Normal  Created    2m         kubelet, k8s-agentpool1-13396981-vmss000000  Created container validate-bing-google
   Normal  Started    2m         kubelet, k8s-agentpool1-13396981-vmss000000  Started container validate-bing-google

We are getting intermittent failures to receive a terminal zero exit code state of the above on clusters running w/ Azure-built containerd:

$ k get nodes -o json
 2020/02/21 16:14:45 NAME                                 STATUS   ROLES    AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
 k8s-agentpool1-13396981-vmss000000   Ready    <none>   46s   v1.18.0-alpha.5   10.240.0.34    <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
 k8s-agentpool1-13396981-vmss000001   Ready    <none>   46s   v1.18.0-alpha.5   10.240.0.65    <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
 k8s-master-13396981-0                Ready    <none>   46s   v1.18.0-alpha.5   10.255.255.5   <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure

The errors:

$ k logs validate-dns-linux-4p75n -c validate-bing-google -n default
;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached
 
 ;; connection timed out; no servers could be reached
 
 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

We wait up to 2 minutes before throwing an error in E2E.

@jackfrancis jackfrancis added the bug Something isn't working label Feb 21, 2020
@jackfrancis
Copy link
Member Author

These errors are observed across Kubernetes versions, not restricted to v1.18.0

@jackfrancis
Copy link
Member Author

@cpuguy83 @ritazh FYI

@jackfrancis
Copy link
Member Author

Confirmed that increasing the timeout to 20m does not remove the flakes. Next step: soak tests.

@jackfrancis
Copy link
Member Author

From containerd logs on a vm that is running a pod/container exhibiting this symptom:

Feb 24 23:45:08 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:08.011012305Z" level=error msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded"
Feb 24 23:45:08 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:08.012069213Z" level=info msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" with command [sh -c nslookup www.bing.com || nslookup google.com] and timeout 1 (s)"
Feb 24 23:45:09 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:09.064354380Z" level=info msg="Timeout received while waiting for exec process kill \"d064f1a2043daf787b9c5a0414b2d0d1bd56050392424ad50d3517d21b38ffdc\" code 137 and error <nil>"
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.066558597Z" level=info msg="Finish piping \"stdout\" of container exec \"d064f1a2043daf787b9c5a0414b2d0d1bd56050392424ad50d3517d21b38ffdc\""
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.066706498Z" level=info msg="Finish piping \"stderr\" of container exec \"d064f1a2043daf787b9c5a0414b2d0d1bd56050392424ad50d3517d21b38ffdc\""
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.068765214Z" level=error msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded"
Feb 24 23:45:13 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:13.069690421Z" level=info msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" with command [sh -c nslookup www.bing.com || nslookup google.com] and timeout 1 (s)"
Feb 24 23:45:14 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:14.118517061Z" level=info msg="Timeout received while waiting for exec process kill \"f9c32d88f4e1d59ccd282306e59bf8b021ba696090fa88b30d3e3b94fbc70b0f\" code 137 and error <nil>"
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.126240120Z" level=info msg="Finish piping \"stderr\" of container exec \"f9c32d88f4e1d59ccd282306e59bf8b021ba696090fa88b30d3e3b94fbc70b0f\""
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.127006926Z" level=info msg="Finish piping \"stdout\" of container exec \"f9c32d88f4e1d59ccd282306e59bf8b021ba696090fa88b30d3e3b94fbc70b0f\""
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.130647653Z" level=error msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" failed" error="rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded"
Feb 24 23:45:18 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:18.132838469Z" level=info msg="ExecSync for \"501e2f632c219866ed094eb9b4eae2254da1986bdb8d7a8062d7740972ad1b3a\" with command [sh -c nslookup www.bing.com || nslookup google.com] and timeout 1 (s)"
Feb 24 23:45:19 k8s-agentpool1-15685329-vmss000000 containerd[12589]: time="2020-02-24T23:45:19.204182178Z" level=info msg="Timeout received while waiting for exec process kill \"1a5f0bdd934116b2196f29cdf576d4c0d3dddbf49c44d86493fb515b8a1bd329\" code 137 and error <nil>"

@jackfrancis
Copy link
Member Author

Can confirm that all pods/containers running on the affected vm has a common symptom of no networking.

@jackfrancis
Copy link
Member Author

Also can confirm that the vm itself doesn't appear to have any networking issues.

@jackfrancis
Copy link
Member Author

Ugh, this appears to be as simple as the coredns pod getting into a hung state.

[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:51966->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:54578->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:38348->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:55543->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:59305->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:51830->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:37616->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:47401->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:56034->168.63.129.16:53: i/o timeout
[ERROR] plugin/errors: 2 4609656293346902366.1441486296488557915. HINFO: read udp 10.240.0.15:46424->168.63.129.16:53: i/o timeout

@jackfrancis jackfrancis changed the title intermittent container networking DNS errors when backed by containerd intermittent container networking errors when backed by containerd Feb 26, 2020
@jackfrancis
Copy link
Member Author

One bit of data: I am unable to reproduce this on a cluster w/ a node-scheduled coredns (i.e., not scheduled onto a master vm). If I manually edit the coredns deployment to add a kubernetes.io/role: master nodeSelector, I can reproduce very easily, it's roughly 50/50 it seems.

Something about containerd-backed containers running on master nodes yields a container w/ a non-working network stack.

@jackfrancis
Copy link
Member Author

I've pretty confidently triaged the failure vector to whether or not a container (in this instance coredns is the container that's triggering E2E failures) is scheduled to a master node (w/ hostNetwork: false).

Azure CNI, kubenet, are not suspected.

@jackfrancis
Copy link
Member Author

Fixed in #2865

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant