workflows: fix in-cluster job kubectl wait #451

nbusseneau · 2021-07-21T18:48:11Z

kubectl wait --for=condition=complete --timeout=X behaviour is a bit counterintuitive: it waits until either the job succeeds or timeout is hit. When the job fails, it does not stop waiting: it will continue waiting until timeout is hit.

For watching for failures, --for=condition=failed should be used. However, this will likewise wait until either the job fails or timeout is hit, and will not stop waiting if the job succeeds.

kubectl wait unfortunately does not allow waiting for multiple conditions. To work around this, we set up two concurrent background waits for both conditions, and actively wait for the first one to end.

This will ensure we do not wait for the whole allocated timeout everytime there is an error during the in-cluster script execution.

nbusseneau · 2021-07-21T20:56:35Z

Links to test runs of workflow changes:

EKS: https://github.com/cilium/cilium-cli/actions/runs/1053975796
EKS (tunnel): https://github.com/cilium/cilium-cli/actions/runs/1053975803
GKE: https://github.com/cilium/cilium-cli/actions/runs/1053975793
Multicluster: https://github.com/cilium/cilium-cli/actions/runs/1053975794
External workloads: https://github.com/cilium/cilium-cli/actions/runs/1053975802

christarazi

Nice find

`kubectl wait --for=condition=complete --timeout=X` behaviour is a bit counterintuitive: it waits until either the job succeeds or timeout is hit. When the job fails, it does not stop waiting: it will continue waiting until timeout is hit. For watching for failures, `--for=condition=failed` should be used. However, this will likewise wait until either the job fails or timeout is hit, and will not stop waiting if the job succeeds. `kubectl wait` unfortunately does not allow waiting for multiple conditions. To work around this, we set up two concurrent background waits for both conditions, and actively wait for the first one to end. This will ensure we do not wait for the whole allocated timeout everytime there is an error during the in-cluster script execution. Signed-off-by: Nicolas Busseneau <[email protected]>

nbusseneau · 2021-07-22T08:09:11Z

All test runs passed except EKS (tunnel), but that is because of the consistent failure discussed on Slack here.

Since:

The changes in this PR have been validated against other workflows and are exactly the same in EKS (tunnel).
The PR improves CI testing on cilium-cli (faster and more reliable runs).
The PR will help to debug the EKS (tunnel) issue itself, since the workflow is now more reliable and does not hit the "hard job timeout" (as seen in other runs of the workflow) which causes unreliable information gathering / cluster cleanup.

=> We are in one of the exception cases to the zero-flakes strategy and do not need to wait to rebase on top of the future (hopefully soon) fix for the EKS (tunnel) workflow.

This PR can be marked as ready-to-merge once reviews are in.

aanm

Needs similar changes in the *-v1.10 GH workflows

nbusseneau · 2021-07-22T14:37:46Z

Needs similar changes in the *-v1.10 GH workflows

Sir, this is cilium-cli. We don't do that here.

nbusseneau added the area/CI Continuous Integration testing issue or flake label Jul 21, 2021

nbusseneau temporarily deployed to ci July 21, 2021 18:48 Inactive

nbusseneau force-pushed the pr/workflows-fix-wait branch from dc04715 to 34f9c3f Compare July 21, 2021 19:05

nbusseneau temporarily deployed to ci July 21, 2021 19:05 Inactive

nbusseneau force-pushed the pr/workflows-fix-wait branch from 34f9c3f to 2dd1b37 Compare July 21, 2021 20:20

nbusseneau temporarily deployed to ci July 21, 2021 20:20 Inactive

nbusseneau force-pushed the pr/workflows-fix-wait branch from 2dd1b37 to 4597588 Compare July 21, 2021 20:26

nbusseneau temporarily deployed to ci July 21, 2021 20:26 Inactive

nbusseneau force-pushed the pr/workflows-fix-wait branch from 4597588 to ff82cc9 Compare July 21, 2021 20:51

nbusseneau temporarily deployed to ci July 21, 2021 20:51 Inactive

nbusseneau marked this pull request as ready for review July 21, 2021 21:02

nbusseneau requested review from a team as code owners July 21, 2021 21:02

nbusseneau requested a review from christarazi July 21, 2021 21:02

maintainer-s-little-helper bot assigned christarazi Jul 21, 2021

nbusseneau requested a review from aanm July 21, 2021 21:02

maintainer-s-little-helper bot assigned aanm Jul 21, 2021

michi-covalent approved these changes Jul 21, 2021

View reviewed changes

christarazi approved these changes Jul 21, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned christarazi Jul 21, 2021

nbusseneau force-pushed the pr/workflows-fix-wait branch from ff82cc9 to 3ef26b8 Compare July 22, 2021 08:09

nbusseneau temporarily deployed to ci July 22, 2021 08:09 Inactive

aanm requested changes Jul 22, 2021

View reviewed changes

nbusseneau requested a review from aanm July 22, 2021 14:38

aanm approved these changes Jul 22, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned aanm Jul 22, 2021

nbusseneau added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 22, 2021

michi-covalent merged commit 68282c3 into master Jul 22, 2021

michi-covalent deleted the pr/workflows-fix-wait branch July 22, 2021 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows: fix in-cluster job kubectl wait #451

workflows: fix in-cluster job kubectl wait #451

nbusseneau commented Jul 21, 2021 •

edited

Loading

nbusseneau commented Jul 21, 2021

christarazi left a comment

nbusseneau commented Jul 22, 2021 •

edited

Loading

aanm left a comment

nbusseneau commented Jul 22, 2021 •

edited

Loading

workflows: fix in-cluster job kubectl wait #451

workflows: fix in-cluster job kubectl wait #451

Conversation

nbusseneau commented Jul 21, 2021 • edited Loading

nbusseneau commented Jul 21, 2021

christarazi left a comment

Choose a reason for hiding this comment

nbusseneau commented Jul 22, 2021 • edited Loading

aanm left a comment

Choose a reason for hiding this comment

nbusseneau commented Jul 22, 2021 • edited Loading

nbusseneau commented Jul 21, 2021 •

edited

Loading

nbusseneau commented Jul 22, 2021 •

edited

Loading

nbusseneau commented Jul 22, 2021 •

edited

Loading