Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Fluxcd suddenly deletes all resources though git is unchanged #3148

Closed
acm-073 opened this issue Jun 23, 2020 · 11 comments
Closed

Fluxcd suddenly deletes all resources though git is unchanged #3148

acm-073 opened this issue Jun 23, 2020 · 11 comments
Assignees
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug

Comments

@acm-073
Copy link

acm-073 commented Jun 23, 2020

Describe the bug

Flux suddenly deleted all resources it managed though no change was pushed to git.

This is a severe issue - has anybody else observed this?

To Reproduce

Since this seems to be a sporadic issue we have not seen before, I can't give a description how to reproduce the issue.

Expected behavior

If git remains unchanged, flux should not delete stuff.

Logs

The log below shows the last successful sync and then the start of the delete action. Please observe that the git commit id is unchanged between the last apply and the delete action.

Jun 22, 2020 @ 21:24:27.690	ts=2020-06-22T19:24:27.689816419Z caller=loop.go:133 component=sync-loop event=refreshed url=ssh://[email protected]/mycompany/force-flux.git branch=preprod HEAD=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:25:48.544	ts=2020-06-22T19:25:48.544604693Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=1aaf36045c357db33d83d3c6970da40d28788924 new=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:25:54.343	ts=2020-06-22T19:25:54.343208824Z caller=sync.go:539 method=Sync cmd=apply args= count=27
Jun 22, 2020 @ 21:25:55.071	ts=2020-06-22T19:25:55.070844159Z caller=sync.go:605 method=Sync cmd="kubectl apply -f -" took=727.543635ms err=null output="namespace/flux-system unchanged\nnamespace/flux-tiller unchanged\nnamespace/storage-operator unchanged\nnamespace/vault unchanged\nclusterrole.rbac.authorization.k8s.io/azure-storage-operator configured\nserviceaccount/azure-storage-operator unchanged\ncustomresourcedefinition.apiextensions.k8s.io/azurestorages.k8s.craft.supply unchanged\nserviceaccount/helm-operator unchanged\nclusterrole.rbac.authorization.k8s.io/helm-operator unchanged\ncustomresourcedefinition.apiextensions.k8s.io/helmreleases.helm.fluxcd.io configured\nserviceaccount/tiller unchanged\nservice/tiller-deploy unchanged\nclusterrolebinding.rbac.authorization.k8s.io/azure-storage-operator unchanged\nsecret/ca-secret unchanged\nclusterrolebinding.rbac.authorization.k8s.io/flux-tiller unchanged\nclusterrolebinding.rbac.authorization.k8s.io/helm-operator unchanged\nsecret/helm-repositories-992k4745f2 unchanged\ndeployment.apps/azure-storage-operator configured\ndeployment.apps/helm-operator unchanged\ndeployment.apps/tiller-deploy configured\nexternalsecret.kubernetes-client.io/azure-service-principal unchanged\nexternalsecret.kubernetes-client.io/azure-service-principal unchanged\nhelmrelease.helm.fluxcd.io/external-secrets unchanged\nhelmrelease.helm.fluxcd.io/prometheus-blackbox-exporter unchanged\npoddisruptionbudget.policy/tiller-deploy unchanged\nhelmrelease.helm.fluxcd.io/vault unchanged\nazurestorage.k8s.craft.supply/vault-azurestorage unchanged"
Jun 22, 2020 @ 21:29:29.074	ts=2020-06-22T19:29:29.074707482Z caller=loop.go:133 component=sync-loop event=refreshed url=ssh://[email protected]/mycompany/force-flux.git branch=preprod HEAD=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:30:55.997	ts=2020-06-22T19:30:55.997407382Z caller=sync.go:73 component=daemon info="trying to sync git changes to the cluster" old=1aaf36045c357db33d83d3c6970da40d28788924 new=1aaf36045c357db33d83d3c6970da40d28788924
Jun 22, 2020 @ 21:31:00.187	ts=2020-06-22T19:31:00.17682842Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=<cluster>:clusterrolebinding/flux-tiller
Jun 22, 2020 @ 21:31:00.187	ts=2020-06-22T19:31:00.187009366Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=flux-system:helmrelease/external-secrets
Jun 22, 2020 @ 21:31:00.187	ts=2020-06-22T19:31:00.187034366Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=<cluster>:customresourcedefinition/helmreleases.helm.fluxcd.io
Jun 22, 2020 @ 21:31:00.187	ts=2020-06-22T19:31:00.187052367Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=vault:azurestorage/vault-azurestorage
Jun 22, 2020 @ 21:31:00.187	ts=2020-06-22T19:31:00.187070567Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=<cluster>:customresourcedefinition/azurestorages.k8s.craft.supply
Jun 22, 2020 @ 21:31:00.187	ts=2020-06-22T19:31:00.187092067Z caller=sync.go:159 info="cluster resource not in resources to be synced; deleting" dry-run=false resource=vault:externalsecret/azure-service-principal
[...]
Jun 22, 2020 @ 21:31:00.187 | ts=2020-06-22T19:31:00.187523469Z caller=sync.go:539 method=Sync cmd=delete args= count=27

Additional context

  • Flux version: 1.19.0
  • Kubernetes version: 1.15.7 (Azure AKS)
  • Git provider: bitbucket
  • Container registry provider: n/a
@acm-073 acm-073 added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Jun 23, 2020
@acm-073
Copy link
Author

acm-073 commented Jun 30, 2020

The issue just became a bit clearer (even though not less scary). The very same thing happened on a different cluster at almost the same point in time. The one thing the two occurrences had in common was that they synced with the same GIT repository.
We suspect that at the given time, the bitbucket cloud GIT repo was (partly) unavailable, resulting in an erroneous checkout and consequently an empty output of the command from .flux.yaml. fluxd just did what it was supposed to do, deleting everything.

The question is, how can we safeguard against issues like this in the future? Any hints are welcome.

@acm-073
Copy link
Author

acm-073 commented Jul 8, 2020

OK... we tracked it down, and since it was such a pain for us, I'd like to share our findings here.

The root cause was in the generator command in .flux.yaml:

   generators:
     # use kustomize as manifest generator and replace ENV variables (i.e. ${VAR}) as post processing step via envsubst
     - command: kustomize build . | envsubst '${CLUSTER} ${STAGE} ${LINE}'

The problem with pipelines is that the overall exit code is generally the exit code of the last command, and even though kustomize may fail due to invalid input or a failed/interrupted git checkout, envsubst will always succeed, even if the pipeline produces empty output. This can lead to the deletion of resources, if flux garbage collection is on.

In order to fix it, we changed our generator command as follows:
/bin/bash -c 'set -o pipefail; kustomize build . | envsubst \${CLUSTER},\${STAGE},\${LINE}'

@kingdonb
Copy link
Member

Ouch.

I'm going to keep this open for a while, since envsubst is a current-events issue in both Flux v1 and v2 these days (#3407 is planned to include in the next release of Flux v1, that will be release 1.22.0)

I don't want to lose track of that resolution note @acm-073 (thanks for boiling it down!) maybe there is a doc mention that we can include in the release notes prominently, so new adopters of envsubst will not have to suffer through the same issue.

set -o pipefail is a good solution for this.

@kingdonb kingdonb self-assigned this Feb 22, 2021
@adusumillipraveen
Copy link
Contributor

We have his this issue over the weekend. But we do not use pipes/ envsusbt. https://github.com/hmcts/cnp-flux-config/blob/master/k8s/preview/cluster-00-overlay/.flux.yaml#L4 Thankfully, it occurred on our development clusters where all our admin apps like ingress controller also got deleted. Any thoughts on what could have caused this and what can be done to avoid it ?

@pierluigilenoci
Copy link

pierluigilenoci commented Apr 26, 2021

We had the same issue.

@kingdonb
Copy link
Member

What version of Flux are you both running?

@pierluigilenoci @adusumillipraveen

@adusumillipraveen
Copy link
Contributor

What version of Flux are you both running?

@pierluigilenoci @adusumillipraveen

@kingdonb We are using Flux :1.20.2

@pierluigilenoci
Copy link

For me was 1.21.0 and happend with syncGarbageCollection option to true.
Disabled the option and it has not happened anymore.

@kingdonb
Copy link
Member

kingdonb commented Apr 27, 2021

I don't have any concrete reason to suggest this has been fixed in the latest version of Flux v1 (at this time, 1.22.2)

However, I would still suggest upgrading, as we cannot very easily support older versions of Flux than the latest maintained release.

I can't imagine a higher priority issue and I'm not exactly certain all affected individuals have been able to isolate and resolve it on their own clusters. I would love to hear a suggestion that pinpoints the source, since it seems that differing reports are coming in about which features are enabled on affected deployments.

I have not seen this issue in action on any of my own repos as of yet, or any clear suggestion of how to reliably reproduce it.

Are all affected individuals using kustomize with generators? If not envsubst, are we using kustomize build with a pipe (where the pipefail option hasn't already been enabled in the command as explained by an earlier commenter?)

@adusumillipraveen
Copy link
Contributor

However, I would still suggest upgrading, as we cannot very easily support older versions of Flux than the latest maintained release.

We have now upgraded to latest version. Will report if we still hit any such issues.

@kingdonb
Copy link
Member

At this time, we hope you have already begun migration away from Flux v1 / legacy over to the current version of Flux.

We hope you've been able to upgrade to Flux v2, and note that Flux v1 remains in maintenance mode. Bugs with Flux v1 can be addressed as long as it remains in maintenance, according to the plan laid out in the Flux Migration Timetable. 👍

Thanks for using Flux.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug
Projects
None yet
Development

No branches or pull requests

4 participants