-
Notifications
You must be signed in to change notification settings - Fork 262
Concurrent sync attempt conflicts #437
Comments
I think locking based on status conditions is error prone as we need to account for helm-op restarts and we have to delay the reconciliation until the lock expires. I would opt for an in-process file lock, similar to https://github.com/fluxcd/kustomize-controller/blob/master/controllers/kustomization_controller.go#L175 |
Makes sense. I assume this documentation is prominent enough to warn users against running multiple replicas, which an in-memory lock would not account for? @stefanprodan can you assign me (unless you want to take this)? edit: Looks like there's a decent amount of code behind that file lock, is there a common location that could be moved, such as https://github.com/fluxcd/toolkit? |
Also I think this error message was accidentally copy/pasted from elsewhere: |
I've removed the replicas from values.yaml, the docs need an update https://github.com/fluxcd/helm-operator/blob/master/chart/helm-operator/templates/deployment.yaml#L11
The filelock is taken from go internal packages, I would copy/paste it instead of making helm-op depend on the experimental toolkit.
Good catch 💯 |
@seaneagan did your work on this cover the scenario where the operator is updated during a sync and the pod is terminated before it's finished leaving the charts being updated in a pending state? If not I'll open a separate issue. |
As the observed generation is now pushed before syncing the resource (fluxcd/helm-operator#437), and the controller runtime queue guarantuees there are no consistency issues (see: https://openkruise.io/en-us/blog/blog2.html).
As the observed generation is now pushed before syncing the resource (fluxcd/helm-operator#437), and the controller runtime queue guarantuees there are no consistency issues (see: https://openkruise.io/en-us/blog/blog2.html).
Describe the bug
If a resync occurs (i.e. --charts-sync-interval is triggered) while the operator is still handling a prior sync of a given HelmRelease (before observedGeneration is set), and the current state of the helm release allows for upgrades, then an attempt to upgrade the helm release is made, due to this:
helm-operator/pkg/release/release.go
Lines 208 to 218 in 389bd47
Currently the bulk of the syncing process time is spent installing or upgrading releases, and if a resync occurs during this it will detect that an upgrade is not currently allowed, and skip it. But if the resync occurs during e.g. chart fetching or dry-run upgrades, then the unwanted upgrade would occur. With #415 this is triggered more easily as helm tests take n significant amount of time.
To resolve, there should be some (ideally atomic) status update made at the very beginning of a sync attempt which locks a HelmRelease, and a corresponding status update at the end of the sync attempt to unlock it. This could be moving the observedGeneration update to before a sync, and simultaneously setting the
Released
condition to unknown. SettingReleased
to true or false would unlock it, and thelastUpdateTime
could be used to eventually expire the lock in case the operator crashed before releasing the lock or similar. While making observedGeenration semantics changes, we may want to consider making it per-condition as per kubernetes/enhancements#1624. We could also consider aligning with the kstatus standardized conditions, although this may change soon based on the results of kubernetes/community#4521.There could alternatively be an in-memory locking mechanism, but that assumes only replica of the helm operator ever running against a HelmRelease, which is the recommendation, but then it wouldn't fail gracefully if someone accidentally runs multiple.
To Reproduce
Difficult to reproduce this race condition, but something like this should work:
Steps to reproduce the behaviour:
Expected behavior
Only one release update should occur.
Logs
Additional context
The text was updated successfully, but these errors were encountered: