flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342

michi-covalent · 2021-06-18T20:54:56Z

flake instances

symptoms

cilium connectivity test times out waiting for echo-same-node or echo-other-node service.

⌛ [cilium-cilium-cli-950610632] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for deployments [echo-other-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-blm9q to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-6zhqj to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-2nnln to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-rt6hg to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for Service cilium-test/echo-other-node to become ready...

Error: Connectivity test failed: timeout reached waiting for ***&Service***ObjectMeta:***echo-other-node  cilium-test /api/v1/namespaces/cilium-test/services/echo-other-node fd2c3c95-c8b4-48be-b159-1bdfb2041050 1725 0 2021-06-18 17:38:45 +0000 UTC <nil> <nil> map[kind:echo] map[] [] []  [***cilium Update v1 2021-06-18 17:38:45 +0000 UTC FieldsV1 ***"f:metadata":***"f:labels":***".":***,"f:kind":***,"f:spec":***"f:externalTrafficPolicy":***,"f:ports":***".":***,"k:***\"port\":8080,\"protocol\":\"TCP\"***":***".":***,"f:name":***,"f:port":***,"f:protocol":***,"f:targetPort":***,"f:selector":***".":***,"f:name":***,"f:sessionAffinity":***,"f:type":***]***,Spec:ServiceSpec***Ports:[]ServicePort***ServicePort***Name:echo-other-node,Protocol:TCP,Port:8080,TargetPort:***0 8080 ***,NodePort:30597,AppProtocol:nil,***,***,Selector:map[string]string***name: echo-other-node,***,ClusterIP:10.0.171.109,Type:NodePort,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Cluster,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,TopologyKeys:[],***,Status:ServiceStatus***LoadBalancer:LoadBalancerStatus***Ingress:[]LoadBalancerIngress***,***,***,***

The text was updated successfully, but these errors were encountered:

nbusseneau · 2021-06-19T11:13:38Z

I've hit this in #336 as well: https://github.com/cilium/cilium-cli/actions/runs/950610632

Print the number of times the validation command got executed before timeout was reached. I just want to confirm that the command is not getting stuck for a long time. Ref: #342 Signed-off-by: Michi Mutsuzaki <[email protected]>

tklauser · 2021-09-15T12:34:40Z

Seen on #540 as well.

https://github.com/cilium/cilium-cli/runs/3600185399?check_suite_focus=true
sysdump: https://github.com/cilium/cilium-cli/files/7170015/cilium-sysdump-out.zip.zip

gandro · 2021-09-15T15:22:54Z

Looking at the code that checks for the service, it seems to perform an nslookup on the service name to check. For some reason, that nslookup does not succeed.

A few thoughts about this:

Should we enable --debug to log the error returned by nslookup? Can we do that in a test PR or do we need to do this globally?
Can we collect coredns logs to check if coredns ever set up the service correctly?
Given that this is actually a DNS lookup failing, are we maybe hitting flake: allow-all/pod-to-world/http(s)-to-cilium-io fails #367 ?

michi-covalent · 2021-09-15T16:05:15Z

Should we enable --debug to log the error returned by nslookup?

for some reason nslookup -debug always exit with a non-zero return code 😐

Can we collect coredns logs to check if coredns ever set up the service correctly?

sure, should this be a part of sysdump?

Given that this is actually a DNS lookup failing, are we maybe hitting flake: allow-all/pod-to-world/http(s)-to-cilium-io fails #367 ?

yeah that's possible.

gandro · 2021-09-15T16:12:06Z

Should we enable --debug to log the error returned by nslookup?

for some reason nslookup -debug always exit with a non-zero return code neutral_face

Ah, I was actually thinking about running cilium connectivity test --debug, as it will give us the contents of stderr of nslookup. But it will make the test output a bit more noisy.

sure, should this be a part of sysdump?

Probably. But I am a bit unsure about making this the default. While clearly useful in many contexts, coredns is not really Cilium related. Maybe add a --capture-logs option that collects logs for pods that match a certain label selector?

tklauser · 2021-09-15T16:41:47Z

Should we enable --debug to log the error returned by nslookup?

for some reason nslookup -debug always exit with a non-zero return code neutral_face

Ah, I was actually thinking about running cilium connectivity test --debug, as it will give us the contents of stderr of nslookup. But it will make the test output a bit more noisy.

I think it should already be the case that we run connectivity tests with --debug, see #409. Might have missing some jobs though 🤔

gandro · 2021-09-15T16:45:30Z

I think it should already be the case that we run connectivity tests with --debug, see #409. Might have missing some jobs though thinking

Ah! Yes, you are correct. I was hitting and debugging this issue in cilium/cilium which does not have debug logs, but we do have some traces from cilium/cilium-cli with the error message. In that case no action needed regarding that point, we already do have the stderr (from #342 (comment)):

2021-09-14T14:55:43.345571900Z 🐛 Error waiting for service cilium-test/echo-other-node: command terminated with exit code 1: ^C

Not sure where that ^C comes from 🤔

gandro · 2021-09-16T09:32:58Z

Not sure where that ^C comes from

So turns out this is the context cancellation (i.e. the timeout being hit, and we send a CTRL+C to terminate nslookup). So apparently nslookup timed out because it never received an answer (i.e. no NXDOMAIN, which I would expect).

From the client pod, we see a bunch of unanswered DNS queries leave:

$ cat hubble-flows-cilium-4rl2f-20210914-145554.json| hubble observe --port 53                                 
Sep 14 14:55:23.443: cilium-test/client-6488dcf5d4-tb96w:50026 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:23.443: cilium-test/client-6488dcf5d4-tb96w:50026 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:28.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:28.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:33.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:33.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:38.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:38.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)

I don't see any of these DNS packets arrive on the other node (where kube-system/kube-dns-b4f5c58c7-jkl9s is hosted) however. So it seems like the DNS packets are getting dropped by the stack? Or we are missing some flow logs.

nbusseneau · 2021-10-14T15:07:26Z

We've just re-enabled the AKS workflow over at cilium/cilium, and this seems to be hit very frequently when trying to start a second connectivity test after a first test and uninstall/reinstall with encryption enabled:

gandro · 2021-10-14T15:15:37Z

One thing to point out is that this bug also seems to occur without a re-installation. Here it occurs just after cilium hubble enable, which does not reconfigure the datapath, but does restart the Cilium pods: #342 (comment)

nbusseneau · 2021-10-18T15:10:45Z

I tried to restart CoreDNS pods right after the second cilium install (with encryption enabled) as a mitigation test: cilium/cilium#17619
But still hitting the issue almost consistently: https://github.com/cilium/cilium/actions/runs/1346424261
Am I doing something wrong, or was our hypothesis that CoreDNS pods needed restarting incorrect?

gandro · 2021-10-18T16:00:53Z

I tried to mention that during the last community meeting, but my mic quality was bad: Given that we don't even see the DNS requests hitting the target node, I think the CoreDNS hypothesis does not apply to this flake here.

I think the discussion related to restarting CoreDNS only apply to cilium/cilium#17401 - there DNS requests are hitting the target CoreDNS pod, but CoreDNS does not know about the service yet.

I think these are two separate issues, even thought the symptoms (K8s service not found) are very similar, we should not mix them up:

Symptom here #342: DNS lookup for service fails due to timeout (no answer)
Symptom in cilium/cilium#17401: DNS lookup for service fails with NXDOMAIN

cilium/cilium-cli#342 is hit almost consistently on AKS when running the second `cilium connectivity test` with encryption enabled. We disable AKS testing with encryption enabled until it is fixed. Signed-off-by: Nicolas Busseneau <[email protected]>

It is possible for the tuples of node IP and port to be mismatched in the case of NodePort services, causing the connectivity test to try to establish a connection to an non-existent tuple. For example, see the following output: ``` ⌛ [gke_cilium-dev_us-west2-a_chris] Waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) to become ready... 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1: 🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 130: Connectivity test failed: timeout reached waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) ``` Nodes: ``` $ k get nodes -o wide NAME INTERNAL-IP gke-chris-default-pool-1602ae11-bn2n 10.168.0.14 gke-chris-default-pool-1602ae11-ffsh 10.168.0.3 ``` Cilium pods: ``` $ ks get pods -o wide | rg cilium cilium-7sq59 10.168.0.14 gke-chris-default-pool-1602ae11-bn2n cilium-mbvxl 10.168.0.3 gke-chris-default-pool-1602ae11-ffsh ``` Services: ``` $ k -n cilium-test get svc NAME TYPE CLUSTER-IP PORT(S) echo-other-node NodePort 10.28.29.66 8080:30774/TCP echo-same-node NodePort 10.28.23.18 8080:32186/TCP ``` Echo pods: ``` $ k -n cilium-test get pods -o wide NAME READY STATUS IP NODE client-6488dcf5d4-bxlcp 1/1 Running 10.32.1.176 gke-chris-default-pool-1602ae11-bn2n client2-5998d566b4-lgxrt 1/1 Running 10.32.1.191 gke-chris-default-pool-1602ae11-bn2n echo-other-node-f4d46f75b-rgzbk 1/1 Running 10.32.0.11 gke-chris-default-pool-1602ae11-ffsh echo-same-node-745bd5c77-mxwp7 1/1 Running 10.32.1.63 gke-chris-default-pool-1602ae11-bn2n ``` If we take the pod "echo-other-node-f4d46f75b-rgzbk", it resides on node "gke-chris-default-pool-1602ae11-ffsh", which has node IP of 10.168.0.3. However, if we look at the CLI output, it is trying to establish a connection to the other node IP, 10.168.0.14, which is obviously wrong. Fix this by checking if the echo pod resides on the same node as the node for the service. Fixes: #342 Signed-off-by: Chris Tarazi <[email protected]>

christarazi · 2022-02-02T00:29:43Z

I incorrectly thought #695 would fix this flake. I don't see us hitting this flake much anymore though. Could we close this issue?

cc @nbusseneau

nbusseneau · 2022-02-02T15:01:04Z

This flake was happening a lot on AKS, but AKS testing has been disabled due to probable changes on AKS' side and nobody had time to look at it yet 😬

jinsiang2008 · 2022-10-21T08:53:06Z

+1 on this, we hit these issues a lot when we run the connectivity tests, maybe we could standardize the tests, and refine the outputs/indicators to show what's the problem, or potential fixes. Thanks.

Rammurthy5 · 2023-11-24T21:41:58Z

Hiya, i face the same issue on cilium 1.14.4 running on kubernetes 1.28. Installed via helm, and VPC peered.
Has this been fixed? does anyone know the workaround please?

github-actions · 2024-09-28T02:03:22Z

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

github-actions · 2024-10-13T02:06:13Z

This issue has not seen any activity since it was marked stale.
Closing.

michi-covalent mentioned this issue Jun 18, 2021

Fix build on 32-bit platforms #331

Merged

michi-covalent added the area/CI Continuous Integration testing issue or flake label Jun 18, 2021

nbusseneau mentioned this issue Jun 19, 2021

misc: use hubble command rather than port-forwards #336

Merged

michi-covalent mentioned this issue Jun 22, 2021

check: Add extra info #348

Closed

gandro mentioned this issue Jun 29, 2021

Prepare for release v0.8.3 #381

Merged

nbusseneau mentioned this issue Sep 8, 2021

CI: ConformanceEKS: timeout reached waiting for Service cilium-test/echo-other-node cilium/cilium#17337

Closed

nbusseneau added the ci/flake Issues tracking failing (integration or unit) tests. label Sep 8, 2021

tklauser mentioned this issue Sep 15, 2021

CI: Error waiting for service cilium-test/echo-other-node #545

Closed

tklauser mentioned this issue Sep 15, 2021

sysdump: collect CiliumEgressNATPolicy and CiliumLocalRedirectPolicy resources #540

Merged

gandro mentioned this issue Sep 15, 2021

cmd: Add --extra-label-selectors log collector #547

Merged

twpayne mentioned this issue Oct 8, 2021

workflows: Disable unreliable ci-gke workflow cilium/cilium#17561

Closed

nbusseneau mentioned this issue Oct 18, 2021

workflows: fix AKS hitting CoreDNS timeouts cilium/cilium#17619

Closed

gandro mentioned this issue Oct 18, 2021

helm: Use batch/v1 apiVersion for CronJob in K8s 1.21+ cilium/cilium#16635

Merged

nbusseneau mentioned this issue Oct 19, 2021

CI: re-enable AKS testing with encryption enabled cilium/cilium#17644

Closed

nbusseneau mentioned this issue Oct 19, 2021

workflows: disable AKS testing with encryption enabled cilium/cilium#17645

Merged

nbusseneau mentioned this issue Nov 30, 2021

Validate guides for 1.11 release cilium/cilium#17926

Closed

17 tasks

christarazi mentioned this issue Jan 28, 2022

connectivity/check: Fix wrong NodePort service selection on validation #695

Closed

maxdrib mentioned this issue Apr 28, 2022

CloudStack upgrade e2e test is failing with timeout on move aws/eks-anywhere#1888

Closed

github-actions bot added the stale label Sep 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342

flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342

michi-covalent commented Jun 18, 2021 •

edited by nbusseneau

Loading

nbusseneau commented Jun 19, 2021

tklauser commented Sep 15, 2021

gandro commented Sep 15, 2021

michi-covalent commented Sep 15, 2021

gandro commented Sep 15, 2021

tklauser commented Sep 15, 2021

gandro commented Sep 15, 2021

gandro commented Sep 16, 2021

nbusseneau commented Oct 14, 2021

gandro commented Oct 14, 2021

nbusseneau commented Oct 18, 2021 •

edited

Loading

gandro commented Oct 18, 2021 •

edited

Loading

christarazi commented Feb 2, 2022

nbusseneau commented Feb 2, 2022

jinsiang2008 commented Oct 21, 2022

Rammurthy5 commented Nov 24, 2023

github-actions bot commented Sep 28, 2024

github-actions bot commented Oct 13, 2024

flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342

flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342

Comments

michi-covalent commented Jun 18, 2021 • edited by nbusseneau Loading

flake instances

symptoms

nbusseneau commented Jun 19, 2021

tklauser commented Sep 15, 2021

gandro commented Sep 15, 2021

michi-covalent commented Sep 15, 2021

gandro commented Sep 15, 2021

tklauser commented Sep 15, 2021

gandro commented Sep 15, 2021

gandro commented Sep 16, 2021

nbusseneau commented Oct 14, 2021

gandro commented Oct 14, 2021

nbusseneau commented Oct 18, 2021 • edited Loading

gandro commented Oct 18, 2021 • edited Loading

christarazi commented Feb 2, 2022

nbusseneau commented Feb 2, 2022

jinsiang2008 commented Oct 21, 2022

Rammurthy5 commented Nov 24, 2023

github-actions bot commented Sep 28, 2024

github-actions bot commented Oct 13, 2024

michi-covalent commented Jun 18, 2021 •

edited by nbusseneau

Loading

nbusseneau commented Oct 18, 2021 •

edited

Loading

gandro commented Oct 18, 2021 •

edited

Loading