-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342
Comments
I've hit this in #336 as well: https://github.com/cilium/cilium-cli/actions/runs/950610632 |
Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #342 Signed-off-by: Michi Mutsuzaki <[email protected]>
Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #342 Signed-off-by: Michi Mutsuzaki <[email protected]>
Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #342 Signed-off-by: Michi Mutsuzaki <[email protected]>
Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #342 Signed-off-by: Michi Mutsuzaki <[email protected]>
Looking at the code that checks for the service, it seems to perform an A few thoughts about this:
|
for some reason
sure, should this be a part of sysdump?
yeah that's possible. |
Ah, I was actually thinking about running
Probably. But I am a bit unsure about making this the default. While clearly useful in many contexts, coredns is not really Cilium related. Maybe add a |
I think it should already be the case that we run connectivity tests with |
Ah! Yes, you are correct. I was hitting and debugging this issue in
Not sure where that |
So turns out this is the context cancellation (i.e. the timeout being hit, and we send a CTRL+C to terminate nslookup). So apparently nslookup timed out because it never received an answer (i.e. no NXDOMAIN, which I would expect). From the client pod, we see a bunch of unanswered DNS queries leave: $ cat hubble-flows-cilium-4rl2f-20210914-145554.json| hubble observe --port 53
Sep 14 14:55:23.443: cilium-test/client-6488dcf5d4-tb96w:50026 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:23.443: cilium-test/client-6488dcf5d4-tb96w:50026 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:28.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:28.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:33.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:33.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:38.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:38.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP) I don't see any of these DNS packets arrive on the other node (where |
We've just re-enabled the AKS workflow over at |
One thing to point out is that this bug also seems to occur without a re-installation. Here it occurs just after |
I tried to restart CoreDNS pods right after the second |
I tried to mention that during the last community meeting, but my mic quality was bad: Given that we don't even see the DNS requests hitting the target node, I think the CoreDNS hypothesis does not apply to this flake here. I think the discussion related to restarting CoreDNS only apply to cilium/cilium#17401 - there DNS requests are hitting the target CoreDNS pod, but CoreDNS does not know about the service yet. I think these are two separate issues, even thought the symptoms (K8s service not found) are very similar, we should not mix them up: Symptom here #342: DNS lookup for service fails due to timeout (no answer) |
cilium/cilium-cli#342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryption enabled. We disable AKS testing with encryption enabled until it is fixed. Signed-off-by: Nicolas Busseneau <[email protected]>
cilium/cilium-cli#342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryption enabled. We disable AKS testing with encryption enabled until it is fixed. Signed-off-by: Nicolas Busseneau <[email protected]>
cilium/cilium-cli#342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryption enabled. We disable AKS testing with encryption enabled until it is fixed. Signed-off-by: Nicolas Busseneau <[email protected]>
cilium/cilium-cli#342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryption enabled. We disable AKS testing with encryption enabled until it is fixed. Signed-off-by: Nicolas Busseneau <[email protected]>
cilium/cilium-cli#342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryption enabled. We disable AKS testing with encryption enabled until it is fixed. Signed-off-by: Nicolas Busseneau <[email protected]>
It is possible for the tuples of node IP and port to be mismatched in the case of NodePort services, causing the connectivity test to try to establish a connection to an non-existent tuple. For example, see the following output: ``` ⌛ [gke_cilium-dev_us-west2-a_chris] Waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) to become ready... 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 130: Connectivity test failed: timeout reached waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) ``` Nodes: ``` $ k get nodes -o wide NAME INTERNAL-IP gke-chris-default-pool-1602ae11-bn2n 10.168.0.14 gke-chris-default-pool-1602ae11-ffsh 10.168.0.3 ``` Cilium pods: ``` $ ks get pods -o wide | rg cilium cilium-7sq59 10.168.0.14 gke-chris-default-pool-1602ae11-bn2n cilium-mbvxl 10.168.0.3 gke-chris-default-pool-1602ae11-ffsh ``` Services: ``` $ k -n cilium-test get svc NAME TYPE CLUSTER-IP PORT(S) echo-other-node NodePort 10.28.29.66 8080:30774/TCP echo-same-node NodePort 10.28.23.18 8080:32186/TCP ``` Echo pods: ``` $ k -n cilium-test get pods -o wide NAME READY STATUS IP NODE client-6488dcf5d4-bxlcp 1/1 Running 10.32.1.176 gke-chris-default-pool-1602ae11-bn2n client2-5998d566b4-lgxrt 1/1 Running 10.32.1.191 gke-chris-default-pool-1602ae11-bn2n echo-other-node-f4d46f75b-rgzbk 1/1 Running 10.32.0.11 gke-chris-default-pool-1602ae11-ffsh echo-same-node-745bd5c77-mxwp7 1/1 Running 10.32.1.63 gke-chris-default-pool-1602ae11-bn2n ``` If we take the pod "echo-other-node-f4d46f75b-rgzbk", it resides on node "gke-chris-default-pool-1602ae11-ffsh", which has node IP of 10.168.0.3. However, if we look at the CLI output, it is trying to establish a connection to the other node IP, 10.168.0.14, which is obviously wrong. Fix this by checking if the echo pod resides on the same node as the node for the service. Fixes: #342 Signed-off-by: Chris Tarazi <[email protected]>
It is possible for the tuples of node IP and port to be mismatched in the case of NodePort services, causing the connectivity test to try to establish a connection to an non-existent tuple. For example, see the following output: ``` ⌛ [gke_cilium-dev_us-west2-a_chris] Waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) to become ready... 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 130: Connectivity test failed: timeout reached waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) ``` Nodes: ``` $ k get nodes -o wide NAME INTERNAL-IP gke-chris-default-pool-1602ae11-bn2n 10.168.0.14 gke-chris-default-pool-1602ae11-ffsh 10.168.0.3 ``` Cilium pods: ``` $ ks get pods -o wide | rg cilium cilium-7sq59 10.168.0.14 gke-chris-default-pool-1602ae11-bn2n cilium-mbvxl 10.168.0.3 gke-chris-default-pool-1602ae11-ffsh ``` Services: ``` $ k -n cilium-test get svc NAME TYPE CLUSTER-IP PORT(S) echo-other-node NodePort 10.28.29.66 8080:30774/TCP echo-same-node NodePort 10.28.23.18 8080:32186/TCP ``` Echo pods: ``` $ k -n cilium-test get pods -o wide NAME READY STATUS IP NODE client-6488dcf5d4-bxlcp 1/1 Running 10.32.1.176 gke-chris-default-pool-1602ae11-bn2n client2-5998d566b4-lgxrt 1/1 Running 10.32.1.191 gke-chris-default-pool-1602ae11-bn2n echo-other-node-f4d46f75b-rgzbk 1/1 Running 10.32.0.11 gke-chris-default-pool-1602ae11-ffsh echo-same-node-745bd5c77-mxwp7 1/1 Running 10.32.1.63 gke-chris-default-pool-1602ae11-bn2n ``` If we take the pod "echo-other-node-f4d46f75b-rgzbk", it resides on node "gke-chris-default-pool-1602ae11-ffsh", which has node IP of 10.168.0.3. However, if we look at the CLI output, it is trying to establish a connection to the other node IP, 10.168.0.14, which is obviously wrong. Fix this by checking if the echo pod resides on the same node as the node for the service. Fixes: #342 Signed-off-by: Chris Tarazi <[email protected]>
I incorrectly thought #695 would fix this flake. I don't see us hitting this flake much anymore though. Could we close this issue? cc @nbusseneau |
This flake was happening a lot on AKS, but AKS testing has been disabled due to probable changes on AKS' side and nobody had time to look at it yet 😬 |
+1 on this, we hit these issues a lot when we run the connectivity tests, maybe we could standardize the tests, and refine the outputs/indicators to show what's the problem, or potential fixes. Thanks. |
Hiya, i face the same issue on cilium 1.14.4 running on kubernetes 1.28. Installed via helm, and VPC peered. |
This issue has been automatically marked as stale because it has not |
This issue has not seen any activity since it was marked stale. |
flake instances
symptoms
cilium connectivity test
times out waiting for echo-same-node or echo-other-node service.The text was updated successfully, but these errors were encountered: