Concurrent sync attempt conflicts #437

seaneagan · 2020-05-29T15:03:01Z

Describe the bug

If a resync occurs (i.e. --charts-sync-interval is triggered) while the operator is still handling a prior sync of a given HelmRelease (before observedGeneration is set), and the current state of the helm release allows for upgrades, then an attempt to upgrade the helm release is made, due to this:

helm-operator/pkg/release/release.go

Lines 208 to 218 in 389bd47

    
           // If the current state of the release does not allow us to safely 
        
           // upgrade, we skip. 
        
           if s := curRel.Info.Status; !s.AllowsUpgrade() { 
        
           	return SkipAction, nil, fmt.Errorf("status '%s' of release does not allow a safe upgrade", s.String()) 
        
           } 
        
           // If this revision of the `HelmRelease` has not been synchronized 
        
           // yet, we attempt an upgrade. 
        
           if !status.HasSynced(hr) { 
        
           	return UpgradeAction, curRel, nil 
        
           }

Currently the bulk of the syncing process time is spent installing or upgrading releases, and if a resync occurs during this it will detect that an upgrade is not currently allowed, and skip it. But if the resync occurs during e.g. chart fetching or dry-run upgrades, then the unwanted upgrade would occur. With #415 this is triggered more easily as helm tests take n significant amount of time.

To resolve, there should be some (ideally atomic) status update made at the very beginning of a sync attempt which locks a HelmRelease, and a corresponding status update at the end of the sync attempt to unlock it. This could be moving the observedGeneration update to before a sync, and simultaneously setting the Released condition to unknown. Setting Released to true or false would unlock it, and the lastUpdateTime could be used to eventually expire the lock in case the operator crashed before releasing the lock or similar. While making observedGeenration semantics changes, we may want to consider making it per-condition as per kubernetes/enhancements#1624. We could also consider aligning with the kstatus standardized conditions, although this may change soon based on the results of kubernetes/community#4521.

There could alternatively be an in-memory locking mechanism, but that assumes only replica of the helm operator ever running against a HelmRelease, which is the recommendation, but then it wouldn't fail gracefully if someone accidentally runs multiple.

To Reproduce

Difficult to reproduce this race condition, but something like this should work:

Steps to reproduce the behaviour:

Set --charts-sync-interval to e.g. 1s
Trigger a chart version update of a release where the chart takes longer than 1s to download.

Expected behavior

Only one release update should occur.

Logs

Additional context

Helm Operator version: 1.0.1
Kubernetes version:
Git provider:
Helm repository provider:

The text was updated successfully, but these errors were encountered:

stefanprodan · 2020-05-29T15:16:49Z

I think locking based on status conditions is error prone as we need to account for helm-op restarts and we have to delay the reconciliation until the lock expires. I would opt for an in-process file lock, similar to https://github.com/fluxcd/kustomize-controller/blob/master/controllers/kustomization_controller.go#L175

seaneagan · 2020-05-29T15:41:06Z

Makes sense. I assume this documentation is prominent enough to warn users against running multiple replicas, which an in-memory lock would not account for?

@stefanprodan can you assign me (unless you want to take this)?

edit: Looks like there's a decent amount of code behind that file lock, is there a common location that could be moved, such as https://github.com/fluxcd/toolkit?

seaneagan · 2020-05-29T15:47:56Z

Also I think this error message was accidentally copy/pasted from elsewhere:

https://github.com/fluxcd/kustomize-controller/blob/0c47dfd5496917d033a394fae4863375e226ba0b/controllers/kustomization_controller.go#L177-L178

stefanprodan · 2020-05-29T16:27:15Z

Makes sense. I assume this documentation is prominent enough to warn users against running multiple replicas, which an in-memory lock would not account for?

I've removed the replicas from values.yaml, the docs need an update https://github.com/fluxcd/helm-operator/blob/master/chart/helm-operator/templates/deployment.yaml#L11

Looks like there's a decent amount of code behind that file loc

The filelock is taken from go internal packages, I would copy/paste it instead of making helm-op depend on the experimental toolkit.

Also I think this error message was accidentally copy/pasted from elsewhere

Good catch 💯

Fixes fluxcd#437

stevehipwell · 2020-06-19T10:06:56Z

@seaneagan did your work on this cover the scenario where the operator is updated during a sync and the pod is terminated before it's finished leaving the charts being updated in a pending state? If not I'll open a separate issue.

As the observed generation is now pushed before syncing the resource (fluxcd/helm-operator#437), and the controller runtime queue guarantuees there are no consistency issues (see: https://openkruise.io/en-us/blog/blog2.html).

seaneagan added blocked needs validation In need of validation before further action bug Something isn't working labels May 29, 2020

stefanprodan removed the blocked needs validation In need of validation before further action label May 29, 2020

seaneagan added a commit to seaneagan/helm-operator that referenced this issue May 29, 2020

Add locking to release sync process

54d69da

Fixes fluxcd#437

seaneagan mentioned this issue May 29, 2020

Add locking to release sync process #439

Merged

seaneagan added a commit to seaneagan/helm-operator that referenced this issue May 29, 2020

Add locking to release sync process

09da137

Fixes fluxcd#437

seaneagan added a commit to seaneagan/helm-operator that referenced this issue May 29, 2020

Add locking to release sync process

44541e9

Fixes fluxcd#437

seaneagan added a commit to seaneagan/helm-operator that referenced this issue Jun 1, 2020

Add locking to release sync process

523d0e0

Fixes fluxcd#437

seaneagan added a commit to seaneagan/helm-operator that referenced this issue Jun 1, 2020

Add locking to release sync process

acdca44

Fixes fluxcd#437

seaneagan added a commit to seaneagan/helm-operator that referenced this issue Jun 1, 2020

Add locking to release sync process

63b7250

Fixes fluxcd#437

hiddeco pushed a commit to seaneagan/helm-operator that referenced this issue Jun 2, 2020

Add locking to release sync process

77f8a86

Fixes fluxcd#437

hiddeco closed this as completed in #439 Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent sync attempt conflicts #437

Concurrent sync attempt conflicts #437

seaneagan commented May 29, 2020 •

edited

Loading

stefanprodan commented May 29, 2020

seaneagan commented May 29, 2020 •

edited

Loading

seaneagan commented May 29, 2020

stefanprodan commented May 29, 2020 •

edited

Loading

stevehipwell commented Jun 19, 2020

Concurrent sync attempt conflicts #437

Concurrent sync attempt conflicts #437

Comments

seaneagan commented May 29, 2020 • edited Loading

stefanprodan commented May 29, 2020

seaneagan commented May 29, 2020 • edited Loading

seaneagan commented May 29, 2020

stefanprodan commented May 29, 2020 • edited Loading

stevehipwell commented Jun 19, 2020

seaneagan commented May 29, 2020 •

edited

Loading

seaneagan commented May 29, 2020 •

edited

Loading

stefanprodan commented May 29, 2020 •

edited

Loading