Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: helm operator uninstall is not properly checking for existing release #3431

Merged
merged 4 commits into from
Feb 1, 2021
Merged

Bug fix: helm operator uninstall is not properly checking for existing release #3431

merged 4 commits into from
Feb 1, 2021

Conversation

mikeshng
Copy link
Contributor

@mikeshng mikeshng commented Jul 15, 2020

Signed-off-by: Mike Ng [email protected]

Description of the change:
The Helm History() function already returns a ErrReleaseNotFound error if no such release name exists.
see https://github.com/helm/helm/blob/v3.2.4/pkg/storage/storage.go#L148-L154

The current additional check of if len(h) == 0 might cause an issue where the release might actually exist but during decoding of the release, it ran into an error and not append it to the list of return result.
see https://github.com/helm/helm/blob/v3.2.4/pkg/storage/driver/secrets.go#L125-L138
Which leads to the Helm operator incorrectly determines that the release doesn't exist and skips the actual helm uninstall call.

Motivation for the change:
In some cases, the Helm-operator doesn't call the equivalent of the helm uninstall command before deleting the CR.
Some resources are removed due to Kubernetes garbage collection. But it leaves behind (at least) cluster scope resources that doesn't have the CR as the owner and can't be garbage collected.

Checklist

If the pull request includes user-facing changes, extra documentation is required:

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past (with Helm 2 at least) different drivers returned different errors (or not) so the len(h) check was necessary to handle all the nuances of the different drivers.

I haven't checked recently, but it could be that all drivers now actually do use the same semantics when a release is not found.

Either way, I'm having trouble seeing how this would solve the problem. If there is an error decoding the release, won't History() return that error, thus meaning we'll bubble that error up before getting to the if len(h) == 0 check (which only happens when err is nil)?

/hold

Comment on lines 357 to 356
if errors.Is(err, driver.ErrReleaseNotFound) {
return nil, err
}
return nil, fmt.Errorf("failed to get release history: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be necessary since we're wrapping the error and using errors.Is() on the calling side to see if it was a driver.ErrReleaseNotFound error.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2020
@mikeshng
Copy link
Contributor Author

@joelanford thanks for your quick reply.

Either way, I'm having trouble seeing how this would solve the problem. If there is an error decoding the release, won't History() return that error, thus meaning we'll bubble that error up before getting to the if len(h) == 0 check (which only happens when err is nil)?

https://github.com/helm/helm/blob/v3.2.4/pkg/storage/driver/secrets.go#L133-L138
For secret, it seems like its possible that helm ignores the decode error and returns an empty array. Maybe this is intentional or maybe its a bug? I will open a helm issue and ask them.

Other drivers like memory https://github.com/helm/helm/blob/v3.2.4/pkg/storage/driver/memory.go#L145-L146
and sql https://github.com/helm/helm/blob/v3.2.4/pkg/storage/driver/sql.go#L323-L324 behaves as you described by checking the length.

@joelanford
Copy link
Member

helm/[email protected]/pkg/storage/driver/secrets.go#L133-L138
For secret, it seems like its possible that helm ignores the decode error and returns an empty array. Maybe this is intentional or maybe its a bug? I will open a helm issue and ask them.

Interesting. Of course the one I looked at (helm/[email protected]/pkg/storage/driver/cfgmaps.go#L79) returns an error on decode rather than just logging and continuing.

@mikeshng
Copy link
Contributor Author

Opened the Helm issue helm/helm#8458

Interesting. Of course the one I looked at (helm/[email protected]/pkg/storage/driver/cfgmaps.go#L79) returns an error on decode rather than just logging and continuing.

I think that's the Get function. The Query function that history calls just logs the decode error and its possible to have an empty return.
https://github.com/helm/helm/blob/0ad800ef43d3b826f31a5ad8dfbb4fe05d143688/pkg/storage/driver/cfgmaps.go#L138-L147

So it seems like Get and Query behaves differently as well. I am going to mention that in the Helm. Thanks.

@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 3, 2020
@mikeshng
Copy link
Contributor Author

@joelanford Hi Joe, could you please take a second look at this PR.

I've checked all the drivers (cfgmaps,sql,secret,memory) under https://github.com/helm/helm/tree/0ad800ef43d3b826f31a5ad8dfbb4fe05d143688/pkg/storage/driver
and they all return ErrReleaseNotFound already. Thanks.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2020
@camilamacedo86
Copy link
Contributor

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2020
@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Nov 12, 2020

Hi @mikeshng,

I understand that it still a valid pr. Am I right? If yes, could you please add a change log entry for this one and rebase it with the master and push again? @joelanford since you was doing this review wdyt?

@joelanford
Copy link
Member

As far as I can tell, the current state of things is that:

  1. If there are no release objects (e.g. ConfigMaps or Secrets), Query() will return ErrReleaseNotFound
  2. If there are release objects, but they fail to decode, Query() will log error messages and return an empty slice with a nil error.

We're handling case 1 correctly right now.

If case 2 happens we fall into our if len(h) == 0 check, and return ErrReleaseNotFound, which then causes us to remove the finalizer without actually doing an uninstall, I think.

One thing I'm still confused on though. Are there any outstanding PRs that will have case 2 return the decode error rather than logging and ignoring? I got the impression that was happening in some of the comments.

So having said all that, I'm in agreement that this should fix it.

@mikeshng Before we merge, could you post the logs of runs before and after this fix. If I'm understanding this correctly, before should show successful finalizer removal and CR deletion (without an actual uninstall), and after should show an Uninstall attempt (but it isn't clear to me if uninstall will succeed or fail if the release can't be decoded).

@joelanford joelanford added this to the v1.3.0 milestone Nov 12, 2020
@joelanford joelanford self-assigned this Nov 12, 2020
@joelanford joelanford added kind/bug Categorizes issue or PR as related to a bug. language/helm Issue is related to a Helm operator project and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 12, 2020
@mikeshng
Copy link
Contributor Author

@camilamacedo86 I rebased, added a change log and made an additional fix commit.

@joelanford nice call on insisting that I double check the log output, I found a problem with the fix and added another commit. This was hard to reproduce so I had to hack around the code a bit to ensure everything was fine until the point where it performs the History() call. What I did was added a sleep which gave me enough time to modify the secret storage release field with some bad data before the uninstall call take place. I hope you are ok with that and don't consider it invalid testing. Here are my logs:

before the fix:

{"level":"info","ts":1605298740.1398098,"logger":"helm.manager","msg":"ABOUT TO UNINSTALL SLEEPING 60 SECONDS"} # this is what I manually added to give me some time to edit
{"level":"info","ts":1605298800.1435435,"logger":"helm.controller","msg":"Release not found, removing finalizer","namespace":"default","name":"nginx-sample","apiVersion":"example.com/v1alpha1","kind":"Nginx","release":"nginx-sample"}

after the fix:

{"level":"error","ts":1605300603.5514286,"logger":"helm.controller","msg":"Failed to uninstall release","namespace":"default","name":"nginx-sample","apiVersion":"example.com/v1alpha1","kind":"Nginx","release":"nginx-sample","error":"no release provided","errorVerbose":"no release provided\nhelm.sh/helm/v3/pkg/action.init\n\tgo/pkg/mod/helm.sh/helm/[email protected]/pkg/action/action.go:58\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5414\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tgo/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/operator-framework/operator-sdk/internal/helm/controller.HelmOperatorReconciler.Reconcile\n\toperator-sdk/internal/helm/controller/reconcile.go:107\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tgo/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tgo/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tgo/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90"}
{"level":"error","ts":1605300603.5566306,"logger":"controller","msg":"Reconciler error","controller":"nginx-controller","name":"nginx-sample","namespace":"default","error":"no release provided","errorVerbose":"no release provided\nhelm.sh/helm/v3/pkg/action.init\n\tgo/pkg/mod/helm.sh/helm/[email protected]/pkg/action/action.go:58\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5414\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:5409\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:190\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tgo/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tgo/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:246\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tgo/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\tgo/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\tgo/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90"}

endless spam of the above so I am not sure if this is the desire behaviour. I was running off make run FYI.

@mikeshng
Copy link
Contributor Author

mikeshng commented Dec 7, 2020

I created a new PR for the possible nil pointer in uninstall: #4288

@mikeshng
Copy link
Contributor Author

mikeshng commented Dec 7, 2020

rebased now that #4288 is merged.

@estroz estroz modified the milestones: v1.3.0, v1.5.0 Dec 18, 2020
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more suggestion, and I think we can get this merged. Thanks for the patience on this!

@@ -352,17 +352,10 @@ func createJSONMergePatch(existingJSON, expectedJSON []byte) ([]byte, error) {
// UninstallRelease performs a Helm release uninstall.
func (m manager) UninstallRelease(ctx context.Context, opts ...UninstallOption) (*rpb.Release, error) {
// Get history of this release
h, err := m.storageBackend.History(m.releaseName)
if err != nil {
if _, err := m.storageBackend.History(m.releaseName); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this check as well. When we call uninstall.Run(), Helm will check history. If no release exists, it will return a wrapped driver.ErrReleaseNotFound, which the callee will be able to detect with errors.Is().

TL;DR: This check is duplicative since uninstall.Run() calls this already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. Done as suggested. Tested locally with a normal delete and it seems fine. Tested with a delete with a 0 release and it seems fine as well:

{"level":"info","ts":1611587596.7093048,"logger":"helm.controller","msg":"Release not found, removing finalizer","namespace":"default","name":"nginx-sample","apiVersion":"example.com/v1alpha1","kind":"Nginx","release":"nginx-sample"}

…dy taken care of by helm uninstall library call

Signed-off-by: Mike Ng <[email protected]>
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2021
@estroz estroz merged commit 133beb2 into operator-framework:master Feb 1, 2021
@estroz
Copy link
Member

estroz commented Feb 1, 2021

/cherry-pick v1.3.x

@openshift-cherrypick-robot

@estroz: new pull request created: #4457

In response to this:

/cherry-pick v1.3.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mikeshng mikeshng deleted the helm-uninstall-fix branch February 1, 2021 18:22
reinvantveer pushed a commit to reinvantveer/operator-sdk that referenced this pull request Feb 4, 2021
reinvantveer pushed a commit to reinvantveer/operator-sdk that referenced this pull request Feb 4, 2021
reinvantveer pushed a commit to reinvantveer/operator-sdk that referenced this pull request Feb 5, 2021
reinvantveer pushed a commit to reinvantveer/operator-sdk that referenced this pull request Feb 5, 2021
reinvantveer pushed a commit to reinvantveer/operator-sdk that referenced this pull request Feb 5, 2021
rearl-scwx pushed a commit to rearl-scwx/operator-sdk that referenced this pull request Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. language/helm Issue is related to a Helm operator project lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants