DNS Resolution failing after cluster run a while #2100

hatzhang · 2021-03-03T06:18:49Z

My cluster will run into DNS resolution failures when up running for a while.
I am testing fluxcd using Kind. So inside my cluster, i have my application, and rabbitmqoperator defined rabbitmq cluster, also with fluxcd's CRDs; I run several Kind clusters in same Arch linux. When it is first up, all went while. But when you check those the other day, rabbitmq's pod just crushed hundreds times due to rabbitmq-nodes dns issue.

$ k get crd
NAME                                             CREATED AT
alerts.notification.toolkit.fluxcd.io            2021-03-01T10:13:52Z
buckets.source.toolkit.fluxcd.io                 2021-03-01T10:13:52Z
gitrepositories.source.toolkit.fluxcd.io         2021-03-01T10:13:52Z
helmcharts.source.toolkit.fluxcd.io              2021-03-01T10:13:52Z
helmreleases.helm.toolkit.fluxcd.io              2021-03-01T10:13:52Z
helmrepositories.source.toolkit.fluxcd.io        2021-03-01T10:13:52Z
imagepolicies.image.toolkit.fluxcd.io            2021-03-01T10:13:52Z
imagerepositories.image.toolkit.fluxcd.io        2021-03-01T10:13:52Z
imageupdateautomations.image.toolkit.fluxcd.io   2021-03-01T10:13:52Z
kustomizations.kustomize.toolkit.fluxcd.io       2021-03-01T10:13:52Z
providers.notification.toolkit.fluxcd.io         2021-03-01T10:13:52Z
rabbitmqclusters.rabbitmq.com                    2021-03-01T10:15:51Z
receivers.notification.toolkit.fluxcd.io         2021-03-01T10:13:52Z

If you try nslookup now, sometime it will success, but most will fail.

$ k get svc
NAME                                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
kubernetes                                                    ClusterIP   10.96.0.1       <none>        443/TCP                      44h
rabbitmq                                                      ClusterIP   10.96.227.42    <none>        5672/TCP,15672/TCP           43h
rabbitmq-nodes                                                ClusterIP   None            <none>        4369/TCP,25672/TCP           43h

# nslookup rabbitmq
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   rabbitmq.default.svc.cluster.local
Address: 10.96.227.42

# nslookup rabbitmq
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   rabbitmq.default.svc.cluster.local
Address: 10.96.227.42
;; connection timed out; no servers could be reached

The following logs are just follow dns-debugging-resolution, not quite sure what exactly the problem is.

#  nslookup kubernetes.default
;; connection timed out; no servers could be reached

# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-7h5lm   1/1     Running   3          44h
coredns-f9fd979d6-qrgsb   1/1     Running   3          44h

$ kubectl get svc --namespace=kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   44h

$ kubectl get endpoints kube-dns --namespace=kube-system
NAME       ENDPOINTS                                                 AGE
kube-dns   10.244.0.2:53,10.244.0.4:53,10.244.0.2:9153 + 3 more...   44h

$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0302 00:27:05.332935       1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332163608 +0000 UTC m=+0.046317506) (total time: 30.000676s):
Trace[1427131847]: [30.000676s] [30.000676s] END
E0302 00:27:05.332955       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332974       1 trace.go:116] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332282908 +0000 UTC m=+0.046436806) (total time: 30.0005644s):
Trace[939984059]: [30.0005644s] [30.0005644s] END
E0302 00:27:05.332979       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332988       1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332296408 +0000 UTC m=+0.046450306) (total time: 30.0005879s):
Trace[911902081]: [30.0005879s] [30.0005879s] END
E0302 00:27:05.332991       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 notification-controller. A: read udp 10.244.0.2:58912->172.21.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 notification-controller. A: read udp 10.244.0.2:34836->172.21.0.1:53: i/o timeout
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
Trace[939984059]: [30.0005644s] [30.0005644s] END
E0302 00:27:05.332979       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332988       1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332296408 +0000 UTC m=+0.046450306) (total time: 30.0005879s):
Trace[911902081]: [30.0005879s] [30.0005879s] END
E0302 00:27:05.332991       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 127.0.0.1:49268 - 16177 "HINFO IN 680491325768953042.3677157172776353241. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.009510801s

So what could be the reason, and what should be the next move to resovle this?

Thanks.

The text was updated successfully, but these errors were encountered:

BenTheElder · 2021-03-03T06:43:03Z

does it look like #1975 ?
we haven't quite root-caused this yet based on what logs etc. have been shared, and I don't have a reproducer.

hatzhang · 2021-03-04T00:18:40Z

Not exactly. The coredns pods are running well.
My arch is provisioned by Windows HyperV. And the physical matchine has been rebooted accidentally. Not sure if this is related.
There are some error showing at kube-controller-manager :

E0303 19:57:04.214172       1 leaderelection.go:321] error retrieving resource lock kube-system/kube-controller-manager: Get "https://172.21.0.3:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

Is there any way to rescue this situation?

BenTheElder · 2021-03-04T00:27:14Z

That's not broken DNS resolution but it is a different existing issue. #2045

hatzhang added the kind/support Categorizes issue or PR as a support question. label Mar 3, 2021

hatzhang closed this as completed Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS Resolution failing after cluster run a while #2100

DNS Resolution failing after cluster run a while #2100

hatzhang commented Mar 3, 2021 •

edited

Loading

BenTheElder commented Mar 3, 2021

hatzhang commented Mar 4, 2021

BenTheElder commented Mar 4, 2021

DNS Resolution failing after cluster run a while #2100

DNS Resolution failing after cluster run a while #2100

Comments

hatzhang commented Mar 3, 2021 • edited Loading

BenTheElder commented Mar 3, 2021

hatzhang commented Mar 4, 2021

BenTheElder commented Mar 4, 2021

hatzhang commented Mar 3, 2021 •

edited

Loading