Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows: fix in-cluster job kubectl wait #451

Merged
merged 1 commit into from
Jul 22, 2021

Conversation

nbusseneau
Copy link
Member

@nbusseneau nbusseneau commented Jul 21, 2021

kubectl wait --for=condition=complete --timeout=X behaviour is a bit counterintuitive: it waits until either the job succeeds or timeout is hit. When the job fails, it does not stop waiting: it will continue waiting until timeout is hit.

For watching for failures, --for=condition=failed should be used. However, this will likewise wait until either the job fails or timeout is hit, and will not stop waiting if the job succeeds.

kubectl wait unfortunately does not allow waiting for multiple conditions. To work around this, we set up two concurrent background waits for both conditions, and actively wait for the first one to end.

This will ensure we do not wait for the whole allocated timeout everytime there is an error during the in-cluster script execution.

@nbusseneau nbusseneau added the area/CI Continuous Integration testing issue or flake label Jul 21, 2021
@nbusseneau nbusseneau temporarily deployed to ci July 21, 2021 18:48 Inactive
@nbusseneau nbusseneau force-pushed the pr/workflows-fix-wait branch from dc04715 to 34f9c3f Compare July 21, 2021 19:05
@nbusseneau nbusseneau temporarily deployed to ci July 21, 2021 19:05 Inactive
@nbusseneau nbusseneau force-pushed the pr/workflows-fix-wait branch from 34f9c3f to 2dd1b37 Compare July 21, 2021 20:20
@nbusseneau nbusseneau temporarily deployed to ci July 21, 2021 20:20 Inactive
@nbusseneau nbusseneau force-pushed the pr/workflows-fix-wait branch from 2dd1b37 to 4597588 Compare July 21, 2021 20:26
@nbusseneau nbusseneau temporarily deployed to ci July 21, 2021 20:26 Inactive
@nbusseneau nbusseneau force-pushed the pr/workflows-fix-wait branch from 4597588 to ff82cc9 Compare July 21, 2021 20:51
@nbusseneau nbusseneau temporarily deployed to ci July 21, 2021 20:51 Inactive
@nbusseneau nbusseneau marked this pull request as ready for review July 21, 2021 21:02
@nbusseneau nbusseneau requested review from a team as code owners July 21, 2021 21:02
@nbusseneau nbusseneau requested a review from christarazi July 21, 2021 21:02
@nbusseneau nbusseneau requested a review from aanm July 21, 2021 21:02
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find

`kubectl wait --for=condition=complete --timeout=X` behaviour is a bit
counterintuitive: it waits until either the job succeeds or timeout is
hit. When the job fails, it does not stop waiting: it will continue
waiting until timeout is hit.

For watching for failures, `--for=condition=failed` should be used.
However, this will likewise wait until either the job fails or timeout
is hit, and will not stop waiting if the job succeeds.

`kubectl wait` unfortunately does not allow waiting for multiple
conditions. To work around this, we set up two concurrent background
waits for both conditions, and actively wait for the first one to end.

This will ensure we do not wait for the whole allocated timeout
everytime there is an error during the in-cluster script execution.

Signed-off-by: Nicolas Busseneau <[email protected]>
@nbusseneau
Copy link
Member Author

nbusseneau commented Jul 22, 2021

All test runs passed except EKS (tunnel), but that is because of the consistent failure discussed on Slack here.

Since:

  • The changes in this PR have been validated against other workflows and are exactly the same in EKS (tunnel).
  • The PR improves CI testing on cilium-cli (faster and more reliable runs).
  • The PR will help to debug the EKS (tunnel) issue itself, since the workflow is now more reliable and does not hit the "hard job timeout" (as seen in other runs of the workflow) which causes unreliable information gathering / cluster cleanup.

=> We are in one of the exception cases to the zero-flakes strategy and do not need to wait to rebase on top of the future (hopefully soon) fix for the EKS (tunnel) workflow.

This PR can be marked as ready-to-merge once reviews are in.

@nbusseneau nbusseneau force-pushed the pr/workflows-fix-wait branch from ff82cc9 to 3ef26b8 Compare July 22, 2021 08:09
@nbusseneau nbusseneau temporarily deployed to ci July 22, 2021 08:09 Inactive
Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs similar changes in the *-v1.10 GH workflows

@nbusseneau
Copy link
Member Author

nbusseneau commented Jul 22, 2021

Needs similar changes in the *-v1.10 GH workflows

Sir, this is cilium-cli. We don't do that here.

@nbusseneau nbusseneau requested a review from aanm July 22, 2021 14:38
@nbusseneau nbusseneau added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 22, 2021
@michi-covalent michi-covalent merged commit 68282c3 into master Jul 22, 2021
@michi-covalent michi-covalent deleted the pr/workflows-fix-wait branch July 22, 2021 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants