workflows: check status of Relay port-forward #390

nbusseneau · 2021-07-01T17:13:17Z

Since we start a background process for port-forwarding Relay, we must check if it actually succeeded or failed before allowing workflows to continue.

We also standardize a 10s allowance time for port-forward to start.

This is a follow-up to the discussion here: #336 (comment)

nbusseneau · 2021-07-01T17:14:26Z

Will remove draft mode when #336 is merged.

nbusseneau · 2021-07-01T18:31:04Z

GKE https://github.com/cilium/cilium-cli/actions/runs/990923464 hit

 2021-07-01T17:20:32.707384582Z   [.] Action [no-policies/pod-to-service/curl-1: cilium-test/client-7b7bf54b85-gccqx (10.0.0.184) -> cilium-test/echo-same-node (echo-same-node:8080)]
2021-07-01T17:20:32.707392925Z   ❌ command "curl -w %***local_ip***:%***local_port*** -> %***remote_ip***:%***remote_port*** = %***response_code*** --silent --fail --show-error --connect-timeout 5 --output /dev/null http://echo-same-node:8080" failed: command terminated with exit code 28
2021-07-01T17:20:32.707403413Z   📄 Matching flows for pod cilium-test/client-7b7bf54b85-gccqx
2021-07-01T17:20:32.711239758Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:33.218208995Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:33.723960128Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:34.232832620Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:34.739280672Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:35.248487198Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:35.754880236Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:36.262535861Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:36.768820220Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:37.279013368Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:37.786437197Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:37.786602033Z   ⌛ Waiting (5s) for flows: Required flows not found yet
2021-07-01T17:20:38.296525875Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:38.806922199Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:39.315693588Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:39.822888920Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:40.342147733Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:40.850006580Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:41.356918238Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:41.866366287Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:42.380893717Z   ℹ️  SYN and(ip(src=10.0.0.184),or(tcp(dstPort=8080),tcp(dstPort=8080)),tcpflags(syn)) not found
2021-07-01T17:20:42.703910124Z   ✅ DNS request found at 0
2021-07-01T17:20:42.703987313Z   ✅ DNS response found at 4
2021-07-01T17:20:42.703998068Z   ❌ Flow validation failed for pod cilium-test/client-7b7bf54b85-gccqx: 1 failures (first: 0, last: 4, matched: 2)

The flow logs look like #367 (DNS request made but no TCP connection) but this time it's from an internal pod-to-service curl 🤔
Does this ring a bell to anyone?

EKS https://github.com/cilium/cilium-cli/actions/runs/990923461 failed with:

Run [[ $(kubectl -n kube-system get ds/aws-node -o jsonpath='{.status.currentNumberScheduled}') == 0 ]]
Error from server (NotFound): daemonsets.apps "aws-node" not found
Error: Process completed with exit code 1.

This is because we are now not removing the ds/aws-node anymore following #208. After #336 is merged, I will rebase on master to address this.

tklauser · 2021-07-05T12:00:14Z

Rebased to pick up EKS workflow fixes in #394.

nbusseneau · 2021-07-05T13:11:25Z

GKE https://github.com/cilium/cilium-cli/actions/runs/1001017290 hit:

2021-07-05T12:33:47.221179076Z   [.] Action [allow-all/pod-to-service/curl-2: cilium-test/client2-666976c95b-mc9mq (10.4.1.77) -> cilium-test/echo-other-node (echo-other-node:8080)]
2021-07-05T12:33:47.221185112Z   📄 Following flows...
2021-07-05T12:33:47.221856284Z   ❌ command "curl -w %***local_ip***:%***local_port*** -> %***remote_ip***:%***remote_port*** = %***response_code*** --silent --fail --show-error --connect-timeout 5 --output /dev/null http://echo-other-node:8080" failed: Timeout occurred
2021-07-05T12:33:47.221975647Z   📄 Validating flows for peer cilium-test/client2-666976c95b-mc9mq
2021-07-05T12:33:57.222499529Z   ❌ Aborting flow matching: context deadline exceeded
2021-07-05T12:33:57.222546717Z   ❌ Flow validation failed for peer cilium-test/client2-666976c95b-mc9mq: 0 failures (first: -1, last: -1, matched: 0)

Never seen this one before (why am I hitting all the new flakes 🤔).

EDIT: created new issue in #399.

PR is blocked.

nbusseneau · 2021-07-07T17:30:12Z

Added missing check for multicluster testing.

Since we start a background process for port-forwarding Relay, we must check if it actually succeeded or failed before allowing workflows to continue. When passing additional flags to `cilium hubble port-forward`, both the original command and `kubectl` child process might be conflated with additional flags. We add wildcards for matching the additional flags to both `pkill` cleanups and `pgrep` port-forward success checks. We also standardize a 10s allowance time for port-forward to start. Signed-off-by: Nicolas Busseneau <[email protected]>

nbusseneau · 2021-07-19T14:05:42Z

It seems nobody else has been hitting #399.

Considering that:

It's been 2 weeks.
The changes are unrelated to the flake.
The changes have been thoroughly tested and validated (merged in cilium/cilium in workflows: various fixes & consistency passes cilium#16787, smooth sailing ever since).

It seems we are now in the "unreproducible flake" scenario, which is an exception to our zero-flake strategy.
After validating offline with @tklauser, I am removing the block and marking as ready-to-merge.

Issue will be left open in case someone does end up hitting the same flake 🤔

nbusseneau added the area/CI Continuous Integration testing issue or flake label Jul 1, 2021

nbusseneau temporarily deployed to ci July 1, 2021 17:13 Inactive

nbusseneau mentioned this pull request Jul 1, 2021

misc: use hubble command rather than port-forwards #336

Merged

pchaigno added the needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 1, 2021

nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 97c9285 to e56619b Compare July 2, 2021 16:14

nbusseneau marked this pull request as ready for review July 2, 2021 16:14

nbusseneau requested review from a team as code owners July 2, 2021 16:14

nbusseneau temporarily deployed to ci July 2, 2021 16:14 Inactive

nbusseneau requested review from nebril and aanm July 2, 2021 16:14

maintainer-s-little-helper bot assigned nebril and aanm Jul 2, 2021

nbusseneau removed the needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 2, 2021

nebril approved these changes Jul 5, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned nebril Jul 5, 2021

aanm approved these changes Jul 5, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned aanm Jul 5, 2021

tklauser force-pushed the pr/workflows-check-port-forward-exit-code branch from e56619b to 3d252a3 Compare July 5, 2021 11:59

tklauser temporarily deployed to ci July 5, 2021 12:00 Inactive

nbusseneau mentioned this pull request Jul 5, 2021

flake: allow-all/pod-to-service/curl-2 timeout during curl exec, no flows retrieved #399

Closed

nbusseneau added the dont-merge/blocked Another PR must be merged before this one. label Jul 5, 2021

pchaigno mentioned this pull request Jul 6, 2021

workflows: various fixes & consistency passes cilium/cilium#16787

Merged

nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 3d252a3 to 79bbdfa Compare July 7, 2021 17:29

nbusseneau had a problem deploying to ci July 7, 2021 17:29 Error

nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 79bbdfa to 8bc435c Compare July 7, 2021 17:31

nbusseneau temporarily deployed to ci July 7, 2021 17:31 Inactive

nbusseneau force-pushed the pr/workflows-check-port-forward-exit-code branch from 8bc435c to 821af71 Compare July 8, 2021 19:52

nbusseneau temporarily deployed to ci July 8, 2021 19:53 Inactive

nbusseneau added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/blocked Another PR must be merged before this one. labels Jul 19, 2021

tklauser merged commit abcf4d5 into master Jul 19, 2021

tklauser deleted the pr/workflows-check-port-forward-exit-code branch July 19, 2021 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows: check status of Relay port-forward #390

workflows: check status of Relay port-forward #390

nbusseneau commented Jul 1, 2021

nbusseneau commented Jul 1, 2021

nbusseneau commented Jul 1, 2021

tklauser commented Jul 5, 2021

nbusseneau commented Jul 5, 2021 •

edited

Loading

nbusseneau commented Jul 7, 2021

nbusseneau commented Jul 19, 2021

workflows: check status of Relay port-forward #390

workflows: check status of Relay port-forward #390

Conversation

nbusseneau commented Jul 1, 2021

nbusseneau commented Jul 1, 2021

nbusseneau commented Jul 1, 2021

tklauser commented Jul 5, 2021

nbusseneau commented Jul 5, 2021 • edited Loading

PR is blocked.

nbusseneau commented Jul 7, 2021

nbusseneau commented Jul 19, 2021

nbusseneau commented Jul 5, 2021 •

edited

Loading