Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not promote when not ready/not live and skipping analysis #362

Closed
giggio opened this issue Nov 7, 2019 · 5 comments · Fixed by #695
Closed

Do not promote when not ready/not live and skipping analysis #362

giggio opened this issue Nov 7, 2019 · 5 comments · Fixed by #695
Labels
kind/bug Something isn't working

Comments

@giggio
Copy link

giggio commented Nov 7, 2019

In situations when analysis is being skipped and pods are not live, and are being restarted during a canary, the canary is being promoted incorrectly. It should rollback.

This seems to happen exactly when pods are restarting. Somehow the canary is considered successful.

These are the canary events:

Events:
  Type     Reason  Age                    From     Message
  ----     ------  ----                   ----     -------
  Warning  Synced  9m37s (x4 over 10m)    flagger  Halt advancement temp-worker-primary.temp waiting for rollout to finish: 0 of 1 updated replicas are available
  Normal   Synced  9m16s                  flagger  Initialization done! temp-worker.temp
  Normal   Synced  7m17s                  flagger  New revision detected! Scaling up temp-worker.temp
  Warning  Synced  6m16s (x3 over 6m56s)  flagger  Halt advancement temp-worker.temp waiting for rollout to finish: 0 of 1 updated replicas are available
  Normal   Synced  5m56s                  flagger  Copying temp-worker.temp template spec to temp-worker-primary.temp
  Normal   Synced  5m54s                  flagger  Promotion completed! Canary analysis was skipped for temp-worker.temp

Notice the 4th event stating the replicas were not available.

The docs say:

In emergency cases, you may want to skip the analysis phase and ship changes directly to production. At any time you can set the spec.skipAnalysis: true. When skip analysis is enabled, Flagger checks if the canary deployment is healthy and promotes it without analysing it. If an analysis is underway, Flagger cancels it and runs the promotion.

But it is not clear on what healthy means. Is it ready? Alive?

Edit: I removed the livenessProbe, and left only the readynessProbe, and even when the pods are not restarted the rollout continues and promotion happens.

@stefanprodan
Copy link
Member

Can you please post the logs instead of the Kubernetes events, the events compactation skips some logs. Flagger logs can be fetched with kubectl logs deploy/flagger | jq .msg. The check that Flagger does relies on the MinimumReplicasAvailable condition, here is the check https://github.com/weaveworks/flagger/blob/master/pkg/canary/ready.go#L67

@giggio
Copy link
Author

giggio commented Nov 22, 2019

I checked, the pods are available, but are not Ready. So the check is reporting correctly, but for all effects, the release is not working. This is the event from kubectl describe pod:

Events:
Type     Reason     Age                   From                                      Message
----     ------     ----                  ----                                      -------
Warning  Unhealthy  99s (x161 over 131m)  kubelet, aks-pool001-10943143-vmss000000  Readiness probe failed: cat: can't open '/tmp/healthy': No such file or directory     

Is there a way to configure that? To check for readiness instead of availability?

@stefanprodan
Copy link
Member

Thanks for digging into this. Going to try to replicate the bug and work on a fix.

@stefanprodan stefanprodan added the kind/bug Something isn't working label Nov 24, 2019
@stefanprodan
Copy link
Member

stefanprodan commented Nov 26, 2019

I can't reproduce this with podinfo without livenessProbe and with a failing readinessProbe. The deployment Available conditions is false and Flagger doesn't promote the canary even if the skip analysis is true.

My guess is that your app readinessProbe is flapping, maybe when it starts it works and fails afterwords.

@giggio
Copy link
Author

giggio commented Nov 26, 2019

The probe, which is a simple cat /tmp/healthy, start with the file non existent, and only creates it later. I don't think that is the case. The probe does have an initial delay, could that be it?

          readinessProbe:
            exec:
              command:
              - cat
              - /tmp/healthy
            initialDelaySeconds: 30
            periodSeconds: 30
            timeoutSeconds: 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants