Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows: check status of Relay port-forward #390

Merged
merged 1 commit into from
Jul 19, 2021

Conversation

nbusseneau
Copy link
Member

Since we start a background process for port-forwarding Relay, we must check if it actually succeeded or failed before allowing workflows to continue.

We also standardize a 10s allowance time for port-forward to start.

This is a follow-up to the discussion here: #336 (comment)

@nbusseneau nbusseneau added the area/CI Continuous Integration testing issue or flake label Jul 1, 2021
@nbusseneau nbusseneau temporarily deployed to ci July 1, 2021 17:13 Inactive
@nbusseneau
Copy link
Member Author

Will remove draft mode when #336 is merged.

@nbusseneau
Copy link
Member Author

GKE https://github.com/cilium/cilium-cli/actions/runs/990923464 hit

 2021-07-01T17:20:32.707384582Z   [.] Action [no-policies/pod-to-service/curl-1: cilium-test/client-7b7bf54b85-gccqx (10.0.0.184) -> cilium-test/echo-same-node (echo-same-node:8080)]
2021-07-01T17:20:32.707392925Z   ❌ command "curl -w %***local_ip***:%***local_port*** -> %***remote_ip***:%***remote_port*** = %***response_code*** --silent --fail --show-error --connect-timeout 5 --output /dev/null http://echo-same-node:8080" failed: command terminated with exit code 28
2021-07-01T17:20:32.707403413Z   📄 Matching flows for pod cilium-test/client-7b7bf54b85-gccqx
2021-07-01T17:20:32.711239758Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:33.218208995Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:33.723960128Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:34.232832620Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:34.739280672Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:35.248487198Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:35.754880236Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:36.262535861Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:36.768820220Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:37.279013368Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:37.786437197Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:37.786602033Z   ⌛ Waiting (5s) for flows: Required flows not found yet
2021-07-01T17:20:38.296525875Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:38.806922199Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:39.315693588Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:39.822888920Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:40.342147733Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:40.850006580Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:41.356918238Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:41.866366287Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:42.380893717Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:42.703910124Z   ✅ DNS request found at 0
2021-07-01T17:20:42.703987313Z   ✅ DNS response found at 4
2021-07-01T17:20:42.703998068Z   ❌ Flow validation failed for pod cilium-test/client-7b7bf54b85-gccqx: 1 failures (first: 0, last: 4, matched: 2)

The flow logs look like #367 (DNS request made but no TCP connection) but this time it's from an internal pod-to-service curl 🤔
Does this ring a bell to anyone?

EKS https://github.com/cilium/cilium-cli/actions/runs/990923461 failed with:

Run [[ $(kubectl -n kube-system get ds/aws-node -o jsonpath='{.status.currentNumberScheduled}') == 0 ]]
Error from server (NotFound): daemonsets.apps "aws-node" not found
Error: Process completed with exit code 1.

This is because we are now not removing the ds/aws-node anymore following #208. After #336 is merged, I will rebase on master to address this.

@pchaigno pchaigno added the needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 1, 2021
@nbusseneau nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 97c9285 to e56619b Compare July 2, 2021 16:14
@nbusseneau nbusseneau marked this pull request as ready for review July 2, 2021 16:14
@nbusseneau nbusseneau requested review from a team as code owners July 2, 2021 16:14
@nbusseneau nbusseneau temporarily deployed to ci July 2, 2021 16:14 Inactive
@nbusseneau nbusseneau requested review from nebril and aanm July 2, 2021 16:14
@nbusseneau nbusseneau removed the needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 2, 2021
@tklauser tklauser force-pushed the pr/workflows-check-port-forward-exit-code branch from e56619b to 3d252a3 Compare July 5, 2021 11:59
@tklauser tklauser temporarily deployed to ci July 5, 2021 12:00 Inactive
@tklauser
Copy link
Member

tklauser commented Jul 5, 2021

Rebased to pick up EKS workflow fixes in #394.

@nbusseneau
Copy link
Member Author

nbusseneau commented Jul 5, 2021

GKE https://github.com/cilium/cilium-cli/actions/runs/1001017290 hit:

2021-07-05T12:33:47.221179076Z   [.] Action [allow-all/pod-to-service/curl-2: cilium-test/client2-666976c95b-mc9mq (10.4.1.77) -> cilium-test/echo-other-node (echo-other-node:8080)]
2021-07-05T12:33:47.221185112Z   📄 Following flows...
2021-07-05T12:33:47.221856284Z   ❌ command "curl -w %***local_ip***:%***local_port*** -> %***remote_ip***:%***remote_port*** = %***response_code*** --silent --fail --show-error --connect-timeout 5 --output /dev/null http://echo-other-node:8080" failed: Timeout occurred
2021-07-05T12:33:47.221975647Z   📄 Validating flows for peer cilium-test/client2-666976c95b-mc9mq
2021-07-05T12:33:57.222499529Z   ❌ Aborting flow matching: context deadline exceeded
2021-07-05T12:33:57.222546717Z   ❌ Flow validation failed for peer cilium-test/client2-666976c95b-mc9mq: 0 failures (first: -1, last: -1, matched: 0)

Never seen this one before (why am I hitting all the new flakes 🤔).

EDIT: created new issue in #399.

PR is blocked.

@nbusseneau
Copy link
Member Author

Added missing check for multicluster testing.

@nbusseneau nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 79bbdfa to 8bc435c Compare July 7, 2021 17:31
@nbusseneau nbusseneau temporarily deployed to ci July 7, 2021 17:31 Inactive
Since we start a background process for port-forwarding Relay, we must
check if it actually succeeded or failed before allowing workflows to
continue.

When passing additional flags to `cilium hubble port-forward`, both the
original command and `kubectl` child process might be conflated with
additional flags. We add wildcards for matching the additional flags to
both `pkill` cleanups and `pgrep` port-forward success checks.

We also standardize a 10s allowance time for port-forward to start.

Signed-off-by: Nicolas Busseneau <[email protected]>
@nbusseneau nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 8bc435c to 821af71 Compare July 8, 2021 19:52
@nbusseneau nbusseneau temporarily deployed to ci July 8, 2021 19:53 Inactive
@nbusseneau
Copy link
Member Author

It seems nobody else has been hitting #399.

Considering that:

It seems we are now in the "unreproducible flake" scenario, which is an exception to our zero-flake strategy.
After validating offline with @tklauser, I am removing the block and marking as ready-to-merge.

Issue will be left open in case someone does end up hitting the same flake 🤔

@nbusseneau nbusseneau added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/blocked Another PR must be merged before this one. labels Jul 19, 2021
@tklauser tklauser merged commit abcf4d5 into master Jul 19, 2021
@tklauser tklauser deleted the pr/workflows-check-port-forward-exit-code branch July 19, 2021 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants