Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Concurrent sync attempt conflicts #437

Closed
seaneagan opened this issue May 29, 2020 · 5 comments · Fixed by #439
Closed

Concurrent sync attempt conflicts #437

seaneagan opened this issue May 29, 2020 · 5 comments · Fixed by #439
Labels
bug Something isn't working

Comments

@seaneagan
Copy link
Contributor

seaneagan commented May 29, 2020

Describe the bug

If a resync occurs (i.e. --charts-sync-interval is triggered) while the operator is still handling a prior sync of a given HelmRelease (before observedGeneration is set), and the current state of the helm release allows for upgrades, then an attempt to upgrade the helm release is made, due to this:

// If the current state of the release does not allow us to safely
// upgrade, we skip.
if s := curRel.Info.Status; !s.AllowsUpgrade() {
return SkipAction, nil, fmt.Errorf("status '%s' of release does not allow a safe upgrade", s.String())
}
// If this revision of the `HelmRelease` has not been synchronized
// yet, we attempt an upgrade.
if !status.HasSynced(hr) {
return UpgradeAction, curRel, nil
}

Currently the bulk of the syncing process time is spent installing or upgrading releases, and if a resync occurs during this it will detect that an upgrade is not currently allowed, and skip it. But if the resync occurs during e.g. chart fetching or dry-run upgrades, then the unwanted upgrade would occur. With #415 this is triggered more easily as helm tests take n significant amount of time.

To resolve, there should be some (ideally atomic) status update made at the very beginning of a sync attempt which locks a HelmRelease, and a corresponding status update at the end of the sync attempt to unlock it. This could be moving the observedGeneration update to before a sync, and simultaneously setting the Released condition to unknown. Setting Released to true or false would unlock it, and the lastUpdateTime could be used to eventually expire the lock in case the operator crashed before releasing the lock or similar. While making observedGeenration semantics changes, we may want to consider making it per-condition as per kubernetes/enhancements#1624. We could also consider aligning with the kstatus standardized conditions, although this may change soon based on the results of kubernetes/community#4521.

There could alternatively be an in-memory locking mechanism, but that assumes only replica of the helm operator ever running against a HelmRelease, which is the recommendation, but then it wouldn't fail gracefully if someone accidentally runs multiple.

To Reproduce

Difficult to reproduce this race condition, but something like this should work:

Steps to reproduce the behaviour:

  1. Set --charts-sync-interval to e.g. 1s
  2. Trigger a chart version update of a release where the chart takes longer than 1s to download.

Expected behavior

Only one release update should occur.

Logs

Additional context

  • Helm Operator version: 1.0.1
  • Kubernetes version:
  • Git provider:
  • Helm repository provider:
@seaneagan seaneagan added blocked needs validation In need of validation before further action bug Something isn't working labels May 29, 2020
@stefanprodan
Copy link
Member

I think locking based on status conditions is error prone as we need to account for helm-op restarts and we have to delay the reconciliation until the lock expires. I would opt for an in-process file lock, similar to https://github.com/fluxcd/kustomize-controller/blob/master/controllers/kustomization_controller.go#L175

@stefanprodan stefanprodan removed the blocked needs validation In need of validation before further action label May 29, 2020
@seaneagan
Copy link
Contributor Author

seaneagan commented May 29, 2020

Makes sense. I assume this documentation is prominent enough to warn users against running multiple replicas, which an in-memory lock would not account for?

@stefanprodan can you assign me (unless you want to take this)?

edit: Looks like there's a decent amount of code behind that file lock, is there a common location that could be moved, such as https://github.com/fluxcd/toolkit?

@seaneagan
Copy link
Contributor Author

@stefanprodan
Copy link
Member

stefanprodan commented May 29, 2020

Makes sense. I assume this documentation is prominent enough to warn users against running multiple replicas, which an in-memory lock would not account for?

I've removed the replicas from values.yaml, the docs need an update https://github.com/fluxcd/helm-operator/blob/master/chart/helm-operator/templates/deployment.yaml#L11

Looks like there's a decent amount of code behind that file loc

The filelock is taken from go internal packages, I would copy/paste it instead of making helm-op depend on the experimental toolkit.

Also I think this error message was accidentally copy/pasted from elsewhere

Good catch 💯

seaneagan added a commit to seaneagan/helm-operator that referenced this issue May 29, 2020
seaneagan added a commit to seaneagan/helm-operator that referenced this issue May 29, 2020
seaneagan added a commit to seaneagan/helm-operator that referenced this issue May 29, 2020
seaneagan added a commit to seaneagan/helm-operator that referenced this issue Jun 1, 2020
seaneagan added a commit to seaneagan/helm-operator that referenced this issue Jun 1, 2020
seaneagan added a commit to seaneagan/helm-operator that referenced this issue Jun 1, 2020
hiddeco pushed a commit to seaneagan/helm-operator that referenced this issue Jun 2, 2020
@stevehipwell
Copy link

@seaneagan did your work on this cover the scenario where the operator is updated during a sync and the pod is terminated before it's finished leaving the charts being updated in a pending state? If not I'll open a separate issue.

hiddeco added a commit to fluxcd/helm-controller that referenced this issue Sep 25, 2020
As the observed generation is now pushed before syncing the resource
(fluxcd/helm-operator#437), and the
controller runtime queue guarantuees there are no consistency
issues (see: https://openkruise.io/en-us/blog/blog2.html).
hiddeco added a commit to fluxcd/helm-controller that referenced this issue Sep 25, 2020
As the observed generation is now pushed before syncing the resource
(fluxcd/helm-operator#437), and the
controller runtime queue guarantuees there are no consistency
issues (see: https://openkruise.io/en-us/blog/blog2.html).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants