Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: timeout reached waiting for service (echo-same-node or echo-other-node) #342

Closed
michi-covalent opened this issue Jun 18, 2021 · 18 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests. stale

Comments

@michi-covalent
Copy link
Contributor

michi-covalent commented Jun 18, 2021

flake instances

symptoms

cilium connectivity test times out waiting for echo-same-node or echo-other-node service.

⌛ [cilium-cilium-cli-950610632] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for deployments [echo-other-node] to become ready...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-blm9q to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-6zhqj to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-2nnln to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-rt6hg to appear...
⌛ [cilium-cilium-cli-950610632] Waiting for Service cilium-test/echo-other-node to become ready...

Error: Connectivity test failed: timeout reached waiting for ***&Service***ObjectMeta:***echo-other-node  cilium-test /api/v1/namespaces/cilium-test/services/echo-other-node fd2c3c95-c8b4-48be-b159-1bdfb2041050 1725 0 2021-06-18 17:38:45 +0000 UTC <nil> <nil> map[kind:echo] map[] [] []  [***cilium Update v1 2021-06-18 17:38:45 +0000 UTC FieldsV1 ***"f:metadata":***"f:labels":***".":***,"f:kind":***,"f:spec":***"f:externalTrafficPolicy":***,"f:ports":***".":***,"k:***\"port\":8080,\"protocol\":\"TCP\"***":***".":***,"f:name":***,"f:port":***,"f:protocol":***,"f:targetPort":***,"f:selector":***".":***,"f:name":***,"f:sessionAffinity":***,"f:type":***]***,Spec:ServiceSpec***Ports:[]ServicePort***ServicePort***Name:echo-other-node,Protocol:TCP,Port:8080,TargetPort:***0 8080 ***,NodePort:30597,AppProtocol:nil,***,***,Selector:map[string]string***name: echo-other-node,***,ClusterIP:10.0.171.109,Type:NodePort,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Cluster,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,TopologyKeys:[],***,Status:ServiceStatus***LoadBalancer:LoadBalancerStatus***Ingress:[]LoadBalancerIngress***,***,***,***
@nbusseneau
Copy link
Member

michi-covalent added a commit that referenced this issue Jun 22, 2021
Print the number of times the validation command got executed before
timeout was reached. I just want to confirm that the command is not
getting stuck for a long time.

Ref: #342

Signed-off-by: Michi Mutsuzaki <[email protected]>
michi-covalent added a commit that referenced this issue Jun 22, 2021
Print the number of times the validation command got executed before
timeout was reached. I just want to confirm that the command is not
getting stuck for a long time.

Ref: #342

Signed-off-by: Michi Mutsuzaki <[email protected]>
michi-covalent added a commit that referenced this issue Jun 22, 2021
Print the number of times the validation command got executed before
timeout was reached. I just want to confirm that the command is not
getting stuck for a long time.

Ref: #342

Signed-off-by: Michi Mutsuzaki <[email protected]>
michi-covalent added a commit that referenced this issue Jun 22, 2021
Print the number of times the validation command got executed before
timeout was reached. I just want to confirm that the command is not
getting stuck for a long time.

Ref: #342

Signed-off-by: Michi Mutsuzaki <[email protected]>
@nbusseneau nbusseneau added the ci/flake Issues tracking failing (integration or unit) tests. label Sep 8, 2021
@gandro
Copy link
Member

gandro commented Sep 15, 2021

Looking at the code that checks for the service, it seems to perform an nslookup on the service name to check. For some reason, that nslookup does not succeed.

A few thoughts about this:

  • Should we enable --debug to log the error returned by nslookup? Can we do that in a test PR or do we need to do this globally?
  • Can we collect coredns logs to check if coredns ever set up the service correctly?
  • Given that this is actually a DNS lookup failing, are we maybe hitting flake: allow-all/pod-to-world/http(s)-to-cilium-io fails #367 ?

@michi-covalent
Copy link
Contributor Author

Should we enable --debug to log the error returned by nslookup?

for some reason nslookup -debug always exit with a non-zero return code 😐

Can we collect coredns logs to check if coredns ever set up the service correctly?

sure, should this be a part of sysdump?

Given that this is actually a DNS lookup failing, are we maybe hitting flake: allow-all/pod-to-world/http(s)-to-cilium-io fails #367 ?

yeah that's possible.

@gandro
Copy link
Member

gandro commented Sep 15, 2021

Should we enable --debug to log the error returned by nslookup?

for some reason nslookup -debug always exit with a non-zero return code neutral_face

Ah, I was actually thinking about running cilium connectivity test --debug, as it will give us the contents of stderr of nslookup. But it will make the test output a bit more noisy.

sure, should this be a part of sysdump?

Probably. But I am a bit unsure about making this the default. While clearly useful in many contexts, coredns is not really Cilium related. Maybe add a --capture-logs option that collects logs for pods that match a certain label selector?

@tklauser
Copy link
Member

Should we enable --debug to log the error returned by nslookup?

for some reason nslookup -debug always exit with a non-zero return code neutral_face

Ah, I was actually thinking about running cilium connectivity test --debug, as it will give us the contents of stderr of nslookup. But it will make the test output a bit more noisy.

I think it should already be the case that we run connectivity tests with --debug, see #409. Might have missing some jobs though 🤔

@gandro
Copy link
Member

gandro commented Sep 15, 2021

I think it should already be the case that we run connectivity tests with --debug, see #409. Might have missing some jobs though thinking

Ah! Yes, you are correct. I was hitting and debugging this issue in cilium/cilium which does not have debug logs, but we do have some traces from cilium/cilium-cli with the error message. In that case no action needed regarding that point, we already do have the stderr (from #342 (comment)):

2021-09-14T14:55:43.345571900Z 🐛 Error waiting for service cilium-test/echo-other-node: command terminated with exit code 1: ^C

Not sure where that ^C comes from 🤔

@gandro
Copy link
Member

gandro commented Sep 16, 2021

Not sure where that ^C comes from

So turns out this is the context cancellation (i.e. the timeout being hit, and we send a CTRL+C to terminate nslookup). So apparently nslookup timed out because it never received an answer (i.e. no NXDOMAIN, which I would expect).

From the client pod, we see a bunch of unanswered DNS queries leave:

$ cat hubble-flows-cilium-4rl2f-20210914-145554.json| hubble observe --port 53                                 
Sep 14 14:55:23.443: cilium-test/client-6488dcf5d4-tb96w:50026 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:23.443: cilium-test/client-6488dcf5d4-tb96w:50026 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:28.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:28.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:33.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:33.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)
Sep 14 14:55:38.549: cilium-test/client-6488dcf5d4-tb96w:60098 <> kube-system/kube-dns:53 from-endpoint FORWARDED (UDP)
Sep 14 14:55:38.549: cilium-test/client-6488dcf5d4-tb96w:60098 -> kube-system/kube-dns-b4f5c58c7-jkl9s:53 to-stack FORWARDED (UDP)

I don't see any of these DNS packets arrive on the other node (where kube-system/kube-dns-b4f5c58c7-jkl9s is hosted) however. So it seems like the DNS packets are getting dropped by the stack? Or we are missing some flow logs.

@nbusseneau
Copy link
Member

We've just re-enabled the AKS workflow over at cilium/cilium, and this seems to be hit very frequently when trying to start a second connectivity test after a first test and uninstall/reinstall with encryption enabled:

@gandro
Copy link
Member

gandro commented Oct 14, 2021

One thing to point out is that this bug also seems to occur without a re-installation. Here it occurs just after cilium hubble enable, which does not reconfigure the datapath, but does restart the Cilium pods: #342 (comment)

@nbusseneau
Copy link
Member

nbusseneau commented Oct 18, 2021

I tried to restart CoreDNS pods right after the second cilium install (with encryption enabled) as a mitigation test: cilium/cilium#17619
But still hitting the issue almost consistently: https://github.com/cilium/cilium/actions/runs/1346424261
Am I doing something wrong, or was our hypothesis that CoreDNS pods needed restarting incorrect?

@gandro
Copy link
Member

gandro commented Oct 18, 2021

I tried to mention that during the last community meeting, but my mic quality was bad: Given that we don't even see the DNS requests hitting the target node, I think the CoreDNS hypothesis does not apply to this flake here.

I think the discussion related to restarting CoreDNS only apply to cilium/cilium#17401 - there DNS requests are hitting the target CoreDNS pod, but CoreDNS does not know about the service yet.

I think these are two separate issues, even thought the symptoms (K8s service not found) are very similar, we should not mix them up:

Symptom here #342: DNS lookup for service fails due to timeout (no answer)
Symptom in cilium/cilium#17401: DNS lookup for service fails with NXDOMAIN

nbusseneau added a commit to cilium/cilium that referenced this issue Oct 19, 2021
cilium/cilium-cli#342 is hit almost
consistently on AKS when running the second `cilium connectivity test`
with encryption enabled.

We disable AKS testing with encryption enabled until it is fixed.

Signed-off-by: Nicolas Busseneau <[email protected]>
nbusseneau added a commit to cilium/cilium that referenced this issue Oct 20, 2021
cilium/cilium-cli#342 is hit almost
consistently on AKS when running the second `cilium connectivity test`
with encryption enabled.

We disable AKS testing with encryption enabled until it is fixed.

Signed-off-by: Nicolas Busseneau <[email protected]>
aanm pushed a commit to cilium/cilium that referenced this issue Oct 20, 2021
cilium/cilium-cli#342 is hit almost
consistently on AKS when running the second `cilium connectivity test`
with encryption enabled.

We disable AKS testing with encryption enabled until it is fixed.

Signed-off-by: Nicolas Busseneau <[email protected]>
vincentmli pushed a commit to f5devcentral/cilium that referenced this issue Oct 25, 2021
cilium/cilium-cli#342 is hit almost
consistently on AKS when running the second `cilium connectivity test`
with encryption enabled.

We disable AKS testing with encryption enabled until it is fixed.

Signed-off-by: Nicolas Busseneau <[email protected]>
michi-covalent pushed a commit to cilium/cilium that referenced this issue Oct 27, 2021
cilium/cilium-cli#342 is hit almost
consistently on AKS when running the second `cilium connectivity test`
with encryption enabled.

We disable AKS testing with encryption enabled until it is fixed.

Signed-off-by: Nicolas Busseneau <[email protected]>
christarazi added a commit that referenced this issue Jan 28, 2022
It is possible for the tuples of node IP and port to be mismatched in
the case of NodePort services, causing the connectivity test to try to
establish a connection to an non-existent tuple.

For example, see the following output:

```
⌛ [gke_cilium-dev_us-west2-a_chris] Waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) to become ready...
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 130:
Connectivity test failed: timeout reached waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node)
```

Nodes:

```
$ k get nodes -o wide
NAME                                   INTERNAL-IP
gke-chris-default-pool-1602ae11-bn2n   10.168.0.14
gke-chris-default-pool-1602ae11-ffsh   10.168.0.3
```

Cilium pods:

```
$ ks get pods -o wide | rg cilium
cilium-7sq59   10.168.0.14   gke-chris-default-pool-1602ae11-bn2n
cilium-mbvxl   10.168.0.3    gke-chris-default-pool-1602ae11-ffsh
```

Services:

```
$ k -n cilium-test get svc
NAME              TYPE       CLUSTER-IP    PORT(S)
echo-other-node   NodePort   10.28.29.66   8080:30774/TCP
echo-same-node    NodePort   10.28.23.18   8080:32186/TCP
```

Echo pods:

```
$ k -n cilium-test get pods -o wide
NAME                              READY   STATUS    IP            NODE
client-6488dcf5d4-bxlcp           1/1     Running   10.32.1.176   gke-chris-default-pool-1602ae11-bn2n
client2-5998d566b4-lgxrt          1/1     Running   10.32.1.191   gke-chris-default-pool-1602ae11-bn2n
echo-other-node-f4d46f75b-rgzbk   1/1     Running   10.32.0.11    gke-chris-default-pool-1602ae11-ffsh
echo-same-node-745bd5c77-mxwp7    1/1     Running   10.32.1.63    gke-chris-default-pool-1602ae11-bn2n
```

If we take the pod "echo-other-node-f4d46f75b-rgzbk", it resides on node
"gke-chris-default-pool-1602ae11-ffsh", which has node IP of 10.168.0.3.
However, if we look at the CLI output, it is trying to establish a
connection to the other node IP, 10.168.0.14, which is obviously wrong.

Fix this by checking if the echo pod resides on the same node as the
node for the service.

Fixes: #342

Signed-off-by: Chris Tarazi <[email protected]>
christarazi added a commit that referenced this issue Jan 28, 2022
It is possible for the tuples of node IP and port to be mismatched in
the case of NodePort services, causing the connectivity test to try to
establish a connection to an non-existent tuple.

For example, see the following output:

```
⌛ [gke_cilium-dev_us-west2-a_chris] Waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) to become ready...
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 130:
Connectivity test failed: timeout reached waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node)
```

Nodes:

```
$ k get nodes -o wide
NAME                                   INTERNAL-IP
gke-chris-default-pool-1602ae11-bn2n   10.168.0.14
gke-chris-default-pool-1602ae11-ffsh   10.168.0.3
```

Cilium pods:

```
$ ks get pods -o wide | rg cilium
cilium-7sq59   10.168.0.14   gke-chris-default-pool-1602ae11-bn2n
cilium-mbvxl   10.168.0.3    gke-chris-default-pool-1602ae11-ffsh
```

Services:

```
$ k -n cilium-test get svc
NAME              TYPE       CLUSTER-IP    PORT(S)
echo-other-node   NodePort   10.28.29.66   8080:30774/TCP
echo-same-node    NodePort   10.28.23.18   8080:32186/TCP
```

Echo pods:

```
$ k -n cilium-test get pods -o wide
NAME                              READY   STATUS    IP            NODE
client-6488dcf5d4-bxlcp           1/1     Running   10.32.1.176   gke-chris-default-pool-1602ae11-bn2n
client2-5998d566b4-lgxrt          1/1     Running   10.32.1.191   gke-chris-default-pool-1602ae11-bn2n
echo-other-node-f4d46f75b-rgzbk   1/1     Running   10.32.0.11    gke-chris-default-pool-1602ae11-ffsh
echo-same-node-745bd5c77-mxwp7    1/1     Running   10.32.1.63    gke-chris-default-pool-1602ae11-bn2n
```

If we take the pod "echo-other-node-f4d46f75b-rgzbk", it resides on node
"gke-chris-default-pool-1602ae11-ffsh", which has node IP of 10.168.0.3.
However, if we look at the CLI output, it is trying to establish a
connection to the other node IP, 10.168.0.14, which is obviously wrong.

Fix this by checking if the echo pod resides on the same node as the
node for the service.

Fixes: #342

Signed-off-by: Chris Tarazi <[email protected]>
@christarazi
Copy link
Member

I incorrectly thought #695 would fix this flake. I don't see us hitting this flake much anymore though. Could we close this issue?

cc @nbusseneau

@nbusseneau
Copy link
Member

This flake was happening a lot on AKS, but AKS testing has been disabled due to probable changes on AKS' side and nobody had time to look at it yet 😬

@jinsiang2008
Copy link

+1 on this, we hit these issues a lot when we run the connectivity tests, maybe we could standardize the tests, and refine the outputs/indicators to show what's the problem, or potential fixes. Thanks.

@Rammurthy5
Copy link

Hiya, i face the same issue on cilium 1.14.4 running on kubernetes 1.28. Installed via helm, and VPC peered.
Has this been fixed? does anyone know the workaround please?

Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale label Sep 28, 2024
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests. stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants