-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove digest check to never ignore helm uninstall errors #1024
Conversation
b830e08
to
dbf6ede
Compare
@@ -138,10 +138,7 @@ func (r *Uninstall) Reconcile(ctx context.Context, req *Request) error { | |||
// Handle any error. | |||
if err != nil { | |||
r.failure(req, logBuf, err) | |||
if req.Object.Status.History.Latest().Digest == cur.Digest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwrau the tests are failing, please run make test
before opening a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I thought I didn't understand the code well enough 😅
I don't understand why one wouldn't want to return an error in case of an error, even in this test case. The error is timed out waiting for the condition
, why not return an error to be retried?
Or rather, why is this working in this case, but not in the case of #1021?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I shared my research about this in a separate comment below. I hope that makes this clear. We can still discuss any other doubts about it.
Closes fluxcd#1021 Signed-off-by: Chris Werner Rau <[email protected]>
Signed-off-by: Sunny <[email protected]>
Signed-off-by: Sunny <[email protected]>
Hi, I'll help take this forward so that we can include this in the upcoming release. As mentioned in #1021 (comment), the change seems harmless and we want to go ahead with it. I did some research and testing to understand the issue and discussed the consequences of this change and how to handle them privately with other maintainers. I would like to share the details below. A major concern about this change is that this has the potential to result in HelmReleases to get stuck in an uninstall retry loop if the cause of the uninstall failure never gets resolved on its own. To understand this better and also to find out how the Helm CLI behaves in such scenarios, I did some manual testing. For a simple chart to test with, I added a delete hook in the podinfo chart, see the diff. With this, I was able to observe the difference in behavior of pre and post-delete hooks. In case of post-delete hooks, helm first deletes the resources and the helm storage and then runs the post-delete hook. If the hook fails, it results in the following error:
As evident from the error, the uninstallation completed, and the release has been deleted completely. Uninstall can't be re-run anymore. Depending on the delete policy of the hook, if the hook resources are not deleted automatically, they have to be deleted manually. In case of pre-delete hooks, helm first runs the hooks. If a hook fails, it blocks the whole uninstallation with the following error
The release enters uninstalling state, but is stuck due to the failing hook. Uninstallation can be re-run to re-run the hook. If the hook doesn't succeed, I tried the same scenarios with helm-controller along with the patch proposed in this change. It behaves the same as the CLI now. Uninstall error is not ignored, HelmRelease object remains until uninstall succeeds. For post-delete hook failure, uninstall fails initially, but because the helm storage gets deleted, the uninstall retry cannot find the release and results in a successful uninstall, deleting the HelmRelease object. An equivalent of Based on the above observations, I believe we understand the consequences of the change and what to tell the users to do if uninstall gets into a retry loop. I'll add a docs section about it under the "Working with HelmRelease" section in our spec docs. Regarding the code change, it made me wonder about the I have also added a controller test for uninstall failure due to a failing delete hook, in addition to fixing the Uninstall reconciler unit tests. I'll push the changes as separate commits. We can discuss further if there's anything incorrect in my analysis and reasoning above. |
dbf6ede
to
7fee60e
Compare
Amazing, I'm currently struggling with the tests 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uninstall is called outside of the AtomicRelease reconciliation. Whenever an object is marked to be deleted or the release target configuration has changed, Uninstall is called directly and the result of Uninstall is taken as it is by the caller, which is mostly the main reconciliation loop. Any error returned by Uninstall is critical to determine the result of it.
In case of UninstallRemediation, it is always called from the atomic reconciliation, which is not strict about the result of the action reconciler. After running an action reconciler, the atomic reconciler separately determines the state of the release to make a decision on the next action. It checks for any failure in the release due to the previous action by analyzing the state of the release in the storage. It doesn't need to depend on the returned error from the UninstallRemediation.
@darkowlzz I can understand (and am OK with) everything before the paragraph above. If you have time, could you please elaborate this part a bit more? Thanks! 🙏
In any case, LGTM!
Thanks @cwrau and @darkowlzz for working on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @darkowlzz and @cwrau
Amazing! Thanks for the help @darkowlzz! |
@matheuscscp HelmRelease reconciler has multiple sub-reconcilers for different helm actions that are managed by the atomic release reconciler. In
UninstallRemediation is one of the actions that is run by atomic reconciler. Uninstall is not run from the atomic reconciler. Both of these are used in different ways. If the release is marked to be deleted, the atomic reconciliation is not called, only Uninstall is called directly from the main reconciliation loop, refer
|
Closes #1021