-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canary deployment receives traffic even if has zero pods using the confirm-rollout webhook #874
Comments
The traffic split is created no matter the number of pods, while the canary is at zero pods 100% of the traffic is routed to primary. |
The issue is that the traffic split is routing traffic (50%) to the Canary even if it has no pods are running. |
Who is scaling down the pods? |
No one, the confirm-rollout webhook returns a false response, hence no canary should be running. |
@stefanprodan We are also using FluxCD to support Gitops. Could there be a some sort of timeout if the confirm-rollout might take up to 12h? |
@stefanprodan Here is some more data regarding this issue:
Based on these logs, when using confirm-rollout webhook, traffic is routed to Canary deployment even if the scale up didn't occur:
{"level":"debug","ts":"2021-04-07T06:59:55.830Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"default\", Name:\"app\", UID:\"3128fc56-1784-462b-b2bb-78843bc2370c\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"244944904\", FieldPath:\"\"}): type: 'Warning' reason: 'Synced' Halt app.default advancement waiting for approval confirm"}
{"level":"debug","ts":"2021-04-07T07:00:55.711Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245418230", FieldPath:""}): type: 'Normal' reason: 'Synced' Confirm-rollout check confirm passed"} {"level":"debug","ts":"2021-04-07T07:01:55.783Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245418863", FieldPath:""}): type: 'Normal' reason: 'Synced' New revision detected! Restarting analysis for app.default"} {"level":"debug","ts":"2021-04-07T07:02:55.786Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245419538", FieldPath:""}): type: 'Normal' reason: 'Synced' Starting canary analysis for app.default"} {"level":"debug","ts":"2021-04-07T07:02:58.818Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245419538", FieldPath:""}): type: 'Normal' reason: 'Synced' Advance app.default canary iteration 1/1"} {"level":"debug","ts":"2021-04-07T07:10:26.844Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245420138", FieldPath:""}): type: 'Normal' reason: 'Synced' Routing all traffic to canary"} Log entry that describes failed e2e on canary: {"level":"info","ts":"2021-04-07T07:25:29.984Z","caller":"controller/events.go:45","msg":"Halt app.default advancement external check e2e failed Post "http://puerta.sre:8080/test\": context deadline exceeded","canary":"app.default"} Log entry that describes scale down to canary: {"level":"info","ts":"2021-04-07T07:25:33.361Z","caller":"controller/events.go:45","msg":"Canary failed! Scaling down app.default","canary":"app.default"} Stack trace: In order for canary deployment to succeed this should be the stack trace |
TL;DR: It appears that when the canary deployment is approved (returned 2xx), after it was declined (returned error) by the confirm-rollout webhook no pods are being scaled for the canary. func (c *Controller) runConfirmRolloutHooks(canary *flaggerv1.Canary, canaryController canary.Controller) bool {
for _, webhook := range canary.GetAnalysis().Webhooks {
if webhook.Type == flaggerv1.ConfirmRolloutHook {
err := CallWebhook(canary.Name, canary.Namespace, flaggerv1.CanaryPhaseProgressing, webhook)
if err != nil {
if canary.Status.Phase != flaggerv1.CanaryPhaseWaiting {
if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseWaiting); err != nil {
c.logger.With("canary", fmt.Sprintf("%s.%s", canary.Name, canary.Namespace)).Errorf("%v", err)
}
c.recordEventWarningf(canary, "Halt %s.%s advancement waiting for approval %s",
canary.Name, canary.Namespace, webhook.Name)
c.alert(canary, "Canary is waiting for approval.", false, flaggerv1.SeverityWarn)
}
return false
} else {
if canary.Status.Phase == flaggerv1.CanaryPhaseWaiting {
if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseInitializing); err != nil {
c.logger.With("canary", fmt.Sprintf("%s.%s", canary.Name, canary.Namespace)).Errorf("%v", err)
return false
}
c.recordEventInfof(canary, "Confirm-rollout check %s passed", webhook.Name)
return false
}
}
}
}
return true
} |
Hi, We have a repeating issue with Flagger since version 1.6.4. Our flow is using confirm-rollout webhook so we can promote new versions during working hours, e.g we query an internal service that returns true during working hours (very simple). Recently we see that canary deployment is being promoted without actual pods in the canary deployment. (there is a traffic split).
Anyone stumbled into this issue?
The text was updated successfully, but these errors were encountered: