Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary deployment receives traffic even if has zero pods using the confirm-rollout webhook #874

Closed
AlonGluz opened this issue Apr 5, 2021 · 7 comments · Fixed by #878
Closed

Comments

@AlonGluz
Copy link

AlonGluz commented Apr 5, 2021

Hi, We have a repeating issue with Flagger since version 1.6.4. Our flow is using confirm-rollout webhook so we can promote new versions during working hours, e.g we query an internal service that returns true during working hours (very simple). Recently we see that canary deployment is being promoted without actual pods in the canary deployment. (there is a traffic split).
Anyone stumbled into this issue?

@stefanprodan
Copy link
Member

The traffic split is created no matter the number of pods, while the canary is at zero pods 100% of the traffic is routed to primary.

@AlonGluz
Copy link
Author

AlonGluz commented Apr 5, 2021

The issue is that the traffic split is routing traffic (50%) to the Canary even if it has no pods are running.
Both for blue/green and canary.

@stefanprodan
Copy link
Member

Who is scaling down the pods?

@AlonGluz
Copy link
Author

AlonGluz commented Apr 5, 2021

No one, the confirm-rollout webhook returns a false response, hence no canary should be running.

@AlonGluz
Copy link
Author

AlonGluz commented Apr 6, 2021

@stefanprodan We are also using FluxCD to support Gitops. Could there be a some sort of timeout if the confirm-rollout might take up to 12h?

@AlonGluz
Copy link
Author

AlonGluz commented Apr 7, 2021

@stefanprodan Here is some more data regarding this issue:
According to the documentation the Blue/Green flow for service mesh should be:

  • detect new revision (deployment spec, secrets or configmaps changes)

  • scale up the canary (green)

  • run conformance tests for the canary pods

  • run load tests and metric checks for the canary pods every minute

  • abort the canary release if the failure threshold is reached

  • route traffic to canary

  • promote canary spec over primary (blue)

  • wait for primary rollout

  • route traffic to primary

  • scale down canary


Based on these logs, when using confirm-rollout webhook, traffic is routed to Canary deployment even if the scale up didn't occur:
Events log:

{"level":"debug","ts":"2021-04-07T06:59:55.830Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"default\", Name:\"app\", UID:\"3128fc56-1784-462b-b2bb-78843bc2370c\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"244944904\", FieldPath:\"\"}): type: 'Warning' reason: 'Synced' Halt app.default advancement waiting for approval confirm"}

{"level":"debug","ts":"2021-04-07T07:00:55.711Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245418230", FieldPath:""}): type: 'Normal' reason: 'Synced' Confirm-rollout check confirm passed"}

{"level":"debug","ts":"2021-04-07T07:01:55.783Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245418863", FieldPath:""}): type: 'Normal' reason: 'Synced' New revision detected! Restarting analysis for app.default"}

{"level":"debug","ts":"2021-04-07T07:02:55.786Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245419538", FieldPath:""}): type: 'Normal' reason: 'Synced' Starting canary analysis for app.default"}

{"level":"debug","ts":"2021-04-07T07:02:58.818Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245419538", FieldPath:""}): type: 'Normal' reason: 'Synced' Advance app.default canary iteration 1/1"}

{"level":"debug","ts":"2021-04-07T07:10:26.844Z","logger":"event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:"Canary", Namespace:"default", Name:"app", UID:"3128fc56-1784-462b-b2bb-78843bc2370c", APIVersion:"flagger.app/v1beta1", ResourceVersion:"245420138", FieldPath:""}): type: 'Normal' reason: 'Synced' Routing all traffic to canary"}

Log entry that describes failed e2e on canary:

{"level":"info","ts":"2021-04-07T07:25:29.984Z","caller":"controller/events.go:45","msg":"Halt app.default advancement external check e2e failed Post "http://puerta.sre:8080/test\": context deadline exceeded","canary":"app.default"}

Log entry that describes scale down to canary:

{"level":"info","ts":"2021-04-07T07:25:33.361Z","caller":"controller/events.go:45","msg":"Canary failed! Scaling down app.default","canary":"app.default"}

Stack trace:

In order for canary deployment to succeed this should be the stack trace
Run->scheduleCanaries->advanceCanary->shouldAdvance-> advanceCanary-> runConfirmRolloutHooks -> advanceCanary
-> checkCanaryStatus **-> Scaled canary deployment
s
After a 2xx response from the confirm-rollout server the canary.Status.Phase == flaggerv1.CanaryPhaseProgressing.

@AlonGluz AlonGluz changed the title Traffic split is created even if Canary has zero pods Canary deployment receives traffic even if has zero pods Apr 7, 2021
@AlonGluz AlonGluz changed the title Canary deployment receives traffic even if has zero pods Canary deployment receives traffic even if has zero pods using the confirm-rollout webhook Apr 7, 2021
@AlonGluz
Copy link
Author

AlonGluz commented Apr 7, 2021

TL;DR: It appears that when the canary deployment is approved (returned 2xx), after it was declined (returned error) by the confirm-rollout webhook no pods are being scaled for the canary.
And the default ProgressDeadlineSeconds kills the job.
I think this might be a solution for it.

func (c *Controller) runConfirmRolloutHooks(canary *flaggerv1.Canary, canaryController canary.Controller) bool {
	for _, webhook := range canary.GetAnalysis().Webhooks {
		if webhook.Type == flaggerv1.ConfirmRolloutHook {
			err := CallWebhook(canary.Name, canary.Namespace, flaggerv1.CanaryPhaseProgressing, webhook)
			if err != nil {
				if canary.Status.Phase != flaggerv1.CanaryPhaseWaiting {
					if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseWaiting); err != nil {
						c.logger.With("canary", fmt.Sprintf("%s.%s", canary.Name, canary.Namespace)).Errorf("%v", err)
					}
					c.recordEventWarningf(canary, "Halt %s.%s advancement waiting for approval %s",
						canary.Name, canary.Namespace, webhook.Name)
					c.alert(canary, "Canary is waiting for approval.", false, flaggerv1.SeverityWarn)
				}
				return false
			} else {
				if canary.Status.Phase == flaggerv1.CanaryPhaseWaiting {
					if err := canaryController.SetStatusPhase(canary, flaggerv1.CanaryPhaseInitializing); err != nil {
						c.logger.With("canary", fmt.Sprintf("%s.%s", canary.Name, canary.Namespace)).Errorf("%v", err)
						return false
					}
					c.recordEventInfof(canary, "Confirm-rollout check %s passed", webhook.Name)
					return false
				}
			}
		}
	}
	return true
}

#878 #878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants