You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My cluster will run into DNS resolution failures when up running for a while.
I am testing fluxcd using Kind. So inside my cluster, i have my application, and rabbitmqoperator defined rabbitmq cluster, also with fluxcd's CRDs; I run several Kind clusters in same Arch linux. When it is first up, all went while. But when you check those the other day, rabbitmq's pod just crushed hundreds times due to rabbitmq-nodes dns issue.
$ k get crd
NAME CREATED AT
alerts.notification.toolkit.fluxcd.io 2021-03-01T10:13:52Z
buckets.source.toolkit.fluxcd.io 2021-03-01T10:13:52Z
gitrepositories.source.toolkit.fluxcd.io 2021-03-01T10:13:52Z
helmcharts.source.toolkit.fluxcd.io 2021-03-01T10:13:52Z
helmreleases.helm.toolkit.fluxcd.io 2021-03-01T10:13:52Z
helmrepositories.source.toolkit.fluxcd.io 2021-03-01T10:13:52Z
imagepolicies.image.toolkit.fluxcd.io 2021-03-01T10:13:52Z
imagerepositories.image.toolkit.fluxcd.io 2021-03-01T10:13:52Z
imageupdateautomations.image.toolkit.fluxcd.io 2021-03-01T10:13:52Z
kustomizations.kustomize.toolkit.fluxcd.io 2021-03-01T10:13:52Z
providers.notification.toolkit.fluxcd.io 2021-03-01T10:13:52Z
rabbitmqclusters.rabbitmq.com 2021-03-01T10:15:51Z
receivers.notification.toolkit.fluxcd.io 2021-03-01T10:13:52Z
If you try nslookup now, sometime it will success, but most will fail.
$ k get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 44h
rabbitmq ClusterIP 10.96.227.42 <none> 5672/TCP,15672/TCP 43h
rabbitmq-nodes ClusterIP None <none> 4369/TCP,25672/TCP 43h
# nslookup rabbitmq
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: rabbitmq.default.svc.cluster.local
Address: 10.96.227.42
# nslookup rabbitmq
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: rabbitmq.default.svc.cluster.local
Address: 10.96.227.42
;; connection timed out; no servers could be reached
The following logs are just follow dns-debugging-resolution, not quite sure what exactly the problem is.
# nslookup kubernetes.default;; connection timed out; no servers could be reached
# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5
$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-7h5lm 1/1 Running 3 44h
coredns-f9fd979d6-qrgsb 1/1 Running 3 44h
$ kubectl get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 44h
$ kubectl get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.244.0.2:53,10.244.0.4:53,10.244.0.2:9153 + 3 more... 44h
$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0302 00:27:05.332935 1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332163608 +0000 UTC m=+0.046317506) (total time: 30.000676s):
Trace[1427131847]: [30.000676s] [30.000676s] END
E0302 00:27:05.332955 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332974 1 trace.go:116] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332282908 +0000 UTC m=+0.046436806) (total time: 30.0005644s):
Trace[939984059]: [30.0005644s] [30.0005644s] END
E0302 00:27:05.332979 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332988 1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332296408 +0000 UTC m=+0.046450306) (total time: 30.0005879s):
Trace[911902081]: [30.0005879s] [30.0005879s] END
E0302 00:27:05.332991 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 notification-controller. A: read udp 10.244.0.2:58912->172.21.0.1:53: i/o timeout
[ERROR] plugin/errors: 2 notification-controller. A: read udp 10.244.0.2:34836->172.21.0.1:53: i/o timeout
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
Trace[939984059]: [30.0005644s] [30.0005644s] END
E0302 00:27:05.332979 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get "https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
I0302 00:27:05.332988 1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-02 00:26:35.332296408 +0000 UTC m=+0.046450306) (total time: 30.0005879s):
Trace[911902081]: [30.0005879s] [30.0005879s] END
E0302 00:27:05.332991 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 127.0.0.1:49268 - 16177 "HINFO IN 680491325768953042.3677157172776353241. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.009510801s
So what could be the reason, and what should be the next move to resovle this?
Thanks.
The text was updated successfully, but these errors were encountered:
Not exactly. The coredns pods are running well.
My arch is provisioned by Windows HyperV. And the physical matchine has been rebooted accidentally. Not sure if this is related.
There are some error showing at kube-controller-manager :
E0303 19:57:04.214172 1 leaderelection.go:321] error retrieving resource lock kube-system/kube-controller-manager: Get "https://172.21.0.3:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
My cluster will run into DNS resolution failures when up running for a while.
I am testing fluxcd using Kind. So inside my cluster, i have my application, and rabbitmqoperator defined rabbitmq cluster, also with fluxcd's CRDs; I run several Kind clusters in same Arch linux. When it is first up, all went while. But when you check those the other day, rabbitmq's pod just crushed hundreds times due to rabbitmq-nodes dns issue.
If you try nslookup now, sometime it will success, but most will fail.
The following logs are just follow dns-debugging-resolution, not quite sure what exactly the problem is.
So what could be the reason, and what should be the next move to resovle this?
Thanks.
The text was updated successfully, but these errors were encountered: