Do not promote when not ready/not live and skipping analysis #362

giggio · 2019-11-07T03:43:32Z

In situations when analysis is being skipped and pods are not live, and are being restarted during a canary, the canary is being promoted incorrectly. It should rollback.

This seems to happen exactly when pods are restarting. Somehow the canary is considered successful.

These are the canary events:

Events:
  Type     Reason  Age                    From     Message
  ----     ------  ----                   ----     -------
  Warning  Synced  9m37s (x4 over 10m)    flagger  Halt advancement temp-worker-primary.temp waiting for rollout to finish: 0 of 1 updated replicas are available
  Normal   Synced  9m16s                  flagger  Initialization done! temp-worker.temp
  Normal   Synced  7m17s                  flagger  New revision detected! Scaling up temp-worker.temp
  Warning  Synced  6m16s (x3 over 6m56s)  flagger  Halt advancement temp-worker.temp waiting for rollout to finish: 0 of 1 updated replicas are available
  Normal   Synced  5m56s                  flagger  Copying temp-worker.temp template spec to temp-worker-primary.temp
  Normal   Synced  5m54s                  flagger  Promotion completed! Canary analysis was skipped for temp-worker.temp

Notice the 4th event stating the replicas were not available.

The docs say:

In emergency cases, you may want to skip the analysis phase and ship changes directly to production. At any time you can set the spec.skipAnalysis: true. When skip analysis is enabled, Flagger checks if the canary deployment is healthy and promotes it without analysing it. If an analysis is underway, Flagger cancels it and runs the promotion.

But it is not clear on what healthy means. Is it ready? Alive?

Edit: I removed the livenessProbe, and left only the readynessProbe, and even when the pods are not restarted the rollout continues and promotion happens.

The text was updated successfully, but these errors were encountered:

stefanprodan · 2019-11-07T06:40:03Z

Can you please post the logs instead of the Kubernetes events, the events compactation skips some logs. Flagger logs can be fetched with kubectl logs deploy/flagger | jq .msg. The check that Flagger does relies on the MinimumReplicasAvailable condition, here is the check https://github.com/weaveworks/flagger/blob/master/pkg/canary/ready.go#L67

giggio · 2019-11-22T19:07:38Z

I checked, the pods are available, but are not Ready. So the check is reporting correctly, but for all effects, the release is not working. This is the event from kubectl describe pod:

Events:
Type     Reason     Age                   From                                      Message
----     ------     ----                  ----                                      -------
Warning  Unhealthy  99s (x161 over 131m)  kubelet, aks-pool001-10943143-vmss000000  Readiness probe failed: cat: can't open '/tmp/healthy': No such file or directory

Is there a way to configure that? To check for readiness instead of availability?

stefanprodan · 2019-11-24T01:21:19Z

Thanks for digging into this. Going to try to replicate the bug and work on a fix.

stefanprodan · 2019-11-26T12:28:54Z

I can't reproduce this with podinfo without livenessProbe and with a failing readinessProbe. The deployment Available conditions is false and Flagger doesn't promote the canary even if the skip analysis is true.

My guess is that your app readinessProbe is flapping, maybe when it starts it works and fails afterwords.

giggio · 2019-11-26T21:32:47Z

The probe, which is a simple cat /tmp/healthy, start with the file non existent, and only creates it later. I don't think that is the case. The probe does have an initial delay, could that be it?

          readinessProbe:
            exec:
              command:
              - cat
              - /tmp/healthy
            initialDelaySeconds: 30
            periodSeconds: 30
            timeoutSeconds: 1

stefanprodan added the kind/bug Something isn't working label Nov 24, 2019

giggio mentioned this issue Nov 25, 2019

Canary stuck after failed deployment #379

Closed

worldtiki mentioned this issue Sep 14, 2020

Do not promote when not ready on skip analysis #695

Merged

stefanprodan closed this as completed in #695 Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not promote when not ready/not live and skipping analysis #362

Do not promote when not ready/not live and skipping analysis #362

giggio commented Nov 7, 2019 •

edited

Loading

stefanprodan commented Nov 7, 2019

giggio commented Nov 22, 2019

stefanprodan commented Nov 24, 2019

stefanprodan commented Nov 26, 2019 •

edited

Loading

giggio commented Nov 26, 2019

Do not promote when not ready/not live and skipping analysis #362

Do not promote when not ready/not live and skipping analysis #362

Comments

giggio commented Nov 7, 2019 • edited Loading

stefanprodan commented Nov 7, 2019

giggio commented Nov 22, 2019

stefanprodan commented Nov 24, 2019

stefanprodan commented Nov 26, 2019 • edited Loading

giggio commented Nov 26, 2019

giggio commented Nov 7, 2019 •

edited

Loading

stefanprodan commented Nov 26, 2019 •

edited

Loading