Skip to content

Commit

Permalink
connectivity/check: Fix wrong NodePort service selection on validation
Browse files Browse the repository at this point in the history
It is possible for the tuples of node IP and port to be mismatched in
the case of NodePort services, causing the connectivity test to try to
establish a connection to an non-existent tuple.

For example, see the following output:

```
⌛ [gke_cilium-dev_us-west2-a_chris] Waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node) to become ready...
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 1:
🐛 Error waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node): command terminated with exit code 130:
Connectivity test failed: timeout reached waiting for NodePort 10.168.0.14:30774 (cilium-test/echo-other-node)
```

Nodes:

```
$ k get nodes -o wide
NAME                                   INTERNAL-IP
gke-chris-default-pool-1602ae11-bn2n   10.168.0.14
gke-chris-default-pool-1602ae11-ffsh   10.168.0.3
```

Cilium pods:

```
$ ks get pods -o wide | rg cilium
cilium-7sq59   10.168.0.14   gke-chris-default-pool-1602ae11-bn2n
cilium-mbvxl   10.168.0.3    gke-chris-default-pool-1602ae11-ffsh
```

Services:

```
$ k -n cilium-test get svc
NAME              TYPE       CLUSTER-IP    PORT(S)
echo-other-node   NodePort   10.28.29.66   8080:30774/TCP
echo-same-node    NodePort   10.28.23.18   8080:32186/TCP
```

Echo pods:

```
$ k -n cilium-test get pods -o wide
NAME                              READY   STATUS    IP            NODE
client-6488dcf5d4-bxlcp           1/1     Running   10.32.1.176   gke-chris-default-pool-1602ae11-bn2n
client2-5998d566b4-lgxrt          1/1     Running   10.32.1.191   gke-chris-default-pool-1602ae11-bn2n
echo-other-node-f4d46f75b-rgzbk   1/1     Running   10.32.0.11    gke-chris-default-pool-1602ae11-ffsh
echo-same-node-745bd5c77-mxwp7    1/1     Running   10.32.1.63    gke-chris-default-pool-1602ae11-bn2n
```

If we take the pod "echo-other-node-f4d46f75b-rgzbk", it resides on node
"gke-chris-default-pool-1602ae11-ffsh", which has node IP of 10.168.0.3.
However, if we look at the CLI output, it is trying to establish a
connection to the other node IP, 10.168.0.14, which is obviously wrong.

Fix this by checking if the echo pod resides on the same node as the
node for the service.

Fixes: #342

Signed-off-by: Chris Tarazi <[email protected]>
  • Loading branch information
christarazi committed Jan 28, 2022
1 parent 79e9ae2 commit 29a3f05
Showing 1 changed file with 12 additions and 0 deletions.
12 changes: 12 additions & 0 deletions connectivity/check/deployment.go
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,18 @@ func (ct *ConnectivityTest) waitForNodePorts(ctx context.Context, nodeIP string,
ctx, cancel := context.WithTimeout(ctx, ct.params.serviceReadyTimeout())
defer cancel()

found := false
for name, pod := range ct.echoPods {
if pod.Pod.Status.HostIP == nodeIP && strings.HasPrefix(name, service.Service.Name) {
found = true
break
}
}
if !found {
ct.Debugf("Skipping NodePort %s as it doesn't reside on node with IP %s", service.Name(), nodeIP)
return nil
}

for _, port := range service.Service.Spec.Ports {
nodePort := port.NodePort
if nodePort == 0 {
Expand Down

0 comments on commit 29a3f05

Please sign in to comment.