Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Resolution failing after cluster run a while #2100

Closed
hatzhang opened this issue Mar 3, 2021 · 3 comments
Closed

DNS Resolution failing after cluster run a while #2100

hatzhang opened this issue Mar 3, 2021 · 3 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@hatzhang
Copy link

hatzhang commented Mar 3, 2021

My cluster will run into DNS resolution failures when up running for a while.
I am testing fluxcd using Kind. So inside my cluster, i have my application, and rabbitmqoperator defined rabbitmq cluster, also with fluxcd's CRDs; I run several Kind clusters in same Arch linux. When it is first up, all went while. But when you check those the other day, rabbitmq's pod just crushed hundreds times due to rabbitmq-nodes dns issue.

$ k get crd
NAME                                             CREATED AT
alerts.notification.toolkit.fluxcd.io            2021-03-01T10:13:52Z
buckets.source.toolkit.fluxcd.io                 2021-03-01T10:13:52Z
gitrepositories.source.toolkit.fluxcd.io         2021-03-01T10:13:52Z
helmcharts.source.toolkit.fluxcd.io              2021-03-01T10:13:52Z
helmreleases.helm.toolkit.fluxcd.io              2021-03-01T10:13:52Z
helmrepositories.source.toolkit.fluxcd.io        2021-03-01T10:13:52Z
imagepolicies.image.toolkit.fluxcd.io            2021-03-01T10:13:52Z
imagerepositories.image.toolkit.fluxcd.io        2021-03-01T10:13:52Z
imageupdateautomations.image.toolkit.fluxcd.io   2021-03-01T10:13:52Z
kustomizations.kustomize.toolkit.fluxcd.io       2021-03-01T10:13:52Z
providers.notification.toolkit.fluxcd.io         2021-03-01T10:13:52Z
rabbitmqclusters.rabbitmq.com                    2021-03-01T10:15:51Z
receivers.notification.toolkit.fluxcd.io         2021-03-01T10:13:52Z

If you try nslookup now, sometime it will success, but most will fail.

$ k get svc
NAME                                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
kubernetes                                                    ClusterIP   10.96.0.1       <none>        443/TCP                      44h
rabbitmq                                                      ClusterIP   10.96.227.42    <none>        5672/TCP,15672/TCP           43h
rabbitmq-nodes                                                ClusterIP   None            <none>        4369/TCP,25672/TCP           43h

# nslookup rabbitmq
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   rabbitmq.default.svc.cluster.local
Address: 10.96.227.42

# nslookup rabbitmq
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   rabbitmq.default.svc.cluster.local
Address: 10.96.227.42
;; connection timed out; no servers could be reached

The following logs are just follow dns-debugging-resolution, not quite sure what exactly the problem is.

#  nslookup kubernetes.default
;; connection timed out; no servers could be reached

# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-7h5lm   1/1     Running   3          44h
coredns-f9fd979d6-qrgsb   1/1     Running   3          44h

$ kubectl get svc --namespace=kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   44h

$ kubectl get endpoints kube-dns --namespace=kube-system
NAME       ENDPOINTS                                                 AGE
kube-dns   10.244.0.2:53,10.244.0.4:53,10.244.0.2:9153 + 3 more...   44h

$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0302 00:27:05.332935       1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332163608 +0000 UTC m=+0.046317506) (total time: 30.000676s):
Trace[1427131847]: [30.000676s] [30.000676s] END
E0302 00:27:05.332955       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332974       1 trace.go:116] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332282908 +0000 UTC m=+0.046436806) (total time: 30.0005644s):
Trace[939984059]: [30.0005644s] [30.0005644s] END
E0302 00:27:05.332979       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332988       1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332296408 +0000 UTC m=+0.046450306) (total time: 30.0005879s):
Trace[911902081]: [30.0005879s] [30.0005879s] END
E0302 00:27:05.332991       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 notification-controller. A: read udp 10.244.0.2:58912->172.21.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 notification-controller. A: read udp 10.244.0.2:34836->172.21.0.1:53: i/o timeout
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
Trace[939984059]: [30.0005644s] [30.0005644s] END
E0302 00:27:05.332979       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332988       1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332296408 +0000 UTC m=+0.046450306) (total time: 30.0005879s):
Trace[911902081]: [30.0005879s] [30.0005879s] END
E0302 00:27:05.332991       1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 127.0.0.1:49268 - 16177 "HINFO IN 680491325768953042.3677157172776353241. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.009510801s

So what could be the reason, and what should be the next move to resovle this?

Thanks.

@hatzhang hatzhang added the kind/support Categorizes issue or PR as a support question. label Mar 3, 2021
@BenTheElder
Copy link
Member

does it look like #1975 ?
we haven't quite root-caused this yet based on what logs etc. have been shared, and I don't have a reproducer.

@hatzhang
Copy link
Author

hatzhang commented Mar 4, 2021

Not exactly. The coredns pods are running well.
My arch is provisioned by Windows HyperV. And the physical matchine has been rebooted accidentally. Not sure if this is related.
There are some error showing at kube-controller-manager :

E0303 19:57:04.214172       1 leaderelection.go:321] error retrieving resource lock kube-system/kube-controller-manager: Get "https://172.21.0.3:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

Is there any way to rescue this situation?

@BenTheElder
Copy link
Member

That's not broken DNS resolution but it is a different existing issue. #2045

@hatzhang hatzhang closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

2 participants