Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long metric analysis cycle when confirm promotion gate is open #1141

Closed
cdlliuy opened this issue Mar 15, 2022 · 0 comments · Fixed by #1139
Closed

Long metric analysis cycle when confirm promotion gate is open #1141

cdlliuy opened this issue Mar 15, 2022 · 0 comments · Fixed by #1139

Comments

@cdlliuy
Copy link
Contributor

cdlliuy commented Mar 15, 2022

Describe the feature

per Flagger documentation https://docs.flagger.app/usage/webhooks

confirm-promotion hooks are executed before the promotion step. The canary promotion is paused until the hooks return HTTP 200. While the promotion is paused, Flagger will continue to run the metrics checks and rollout hooks.

The feature is great to continue the metric check & load test when the promotion gate is not open, so that we can detect further errors if any.

But it will run a completely evaluation cycle from the very beginning. Even if the promotion gate is open during the metric analysis, flagger controller need to complete all metric analysis iteration as well, as described in below diagram

pre-rollout check
metric analysis iteration 1/3
metric analysis iteration 2/3
metric analysis iteration 3/3  
check promotion gate --> false, not approved yet
pre-rollout check
metric analysis iteration 1/3
<approve the promotion gate here> 
metric analysis iteration 2/3
metric analysis iteration 3/3
check promotion gate --> true,  approved 
....

Proposed solution

To short the waiting time after promotion step is approved, I would like to raise this PR to achieve the below timeline:

pre-rollout check
metric analysis iteration 1/3
metric analysis iteration 2/3
metric analysis iteration 3/3  
check promotion gate --> false, not approved yet
metric analysis iteration 3/3
metric analysis iteration 3/3
metric analysis iteration 3/3
<approve the promotion gate here> 
check promotion gate --> true,  approved 
....

Also, I noticed that during the "additional metric analysis" duration, per the line https://github.com/fluxcd/flagger/blob/main/pkg/controller/scheduler.go#L350-L351 , when the canary status has been changed to "waitingpromoption", even if there are error detected, the canary still can't be set to "fail" and "rollback" , how to handle this conflict?

So, I would like to propose to trigger rollback when metric analysis fails when phase==waitingpromoption

Any alternatives you've considered?

No, I didn't figure an alternative with the current code base.

I raised PR #1139 to resolve the issue.
I am open to listen to comments to achieve the target scenario with other ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant