Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change behavior when deleting default SriovNetworkNodePolicy and SriovOperatorConfig #566

Conversation

vasrem
Copy link
Contributor

@vasrem vasrem commented Dec 27, 2023

Opening this PR for discussion as it looks to me like the simplest way forward. I'm not sure how bad is it to delete the default CRs and if warning the user instead of blocking the user is sufficient.


This commit changes the behavior of the webhook when deleting the default SriovNetworkNodePolicy and SriovOperatorConfig. This change is needed so that we can smoothly uninstall Helm releases that have the webhooks enabled.

Without this change, when running helm uninstall, it fails because the webhook doesn't allow deletion of such resources mentioned above.

Since these Webhook Configurations resources are deployed via the controller itself and not Helm, it's much more difficult to handle the lifecycle of them via Helm in the current state. Instead, it's easier to send a warning that these resources should not be deleted.

Testing

Used Case 5 of #561 and managed to install and uninstall without any issue. helm uninstall log looks like this:

...
client.go:310: [debug] Starting delete for "default" SriovNetworkNodePolicy
W1227 13:20:14.885788  123736 warnings.go:70] default SriovNetworkNodePolicy shouldn't be deleted
client.go:310: [debug] Starting delete for "default" SriovOperatorConfig
W1227 13:20:15.078219  123736 warnings.go:70] default SriovOperatorConfig shouldn't be deleted
...

Verified

This commit was signed with the committer’s verified signature.
vasrem Vasilis Remmas
This commit changes the behavior of the webhook when deleting the
default SriovNetworkNodePolicy and SriovOperatorConfig. This change is
needed so that we can smoothly uninstall Helm releases that have the
webhooks enabled.

Without this change, when running `helm uninstall`, it fails because the
webhook doesn't allow deletion of such resources mentioned above.

Since these Webhook Configurations resources are deployed via the
controller itself and not Helm, it's much more difficult to handle the
lifecycle of them via Helm in the current state. Instead, it's easier to
send a warning that these resources should not be deleted.

Signed-off-by: Vasilis Remmas <[email protected]>
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 7338056271

  • 3 of 3 (100.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.03%) to 25.681%

Files with Coverage Reduction New Missed Lines %
pkg/webhook/validate.go 1 60.32%
Totals Coverage Status
Change from base Build 7280777027: 0.03%
Covered Lines: 2356
Relevant Lines: 9174

💛 - Coveralls

@adrianchiris
Copy link
Collaborator

adrianchiris commented Dec 31, 2023

in helm i see we only create the default policy so during un-install i expect just that to fail. cluster would remain with sriovOperatorConfig obj.

is that not the case ?

generally for both obj's the relevant controller will first re-create the default obj before proceeding with reconcile.
so i think its ok to just return warning and not block deletion. (thus avoiding the problem in helm)

as an alternative, we can create default obj via helm as post install hook and remove in pre delete hooks.

@adrianchiris adrianchiris requested a review from zeeke December 31, 2023 15:54
@SchSeba
Copy link
Collaborator

SchSeba commented Jan 1, 2024

Adding @zeeke as he is also working on some issues with the namespace been removed and objects gets stuck because the webhook pod is already deleted

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an helm expert, but I'm ok with deleting the default policy, as it gets recreated by the controller.

In theory, this problem (and others I'm having by deleting the operator's namespace) should have been addressed by the shutdown procedure:

https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/pkg/utils/shutdown.go

But it seems to not work for reasons that aren't very clear to me yet.

LGTM

@vasrem
Copy link
Contributor Author

vasrem commented Jan 2, 2024

@adrianchiris

in helm i see we only create the default policy so during un-install i expect just that to fail. cluster would remain with sriovOperatorConfig obj.

is that not the case ?

Yes that's the case. If Helm faces an error when deleting an object it manages, the helm release will keep lingering in the cluster in errored state. So, the uninstall is not clean and may leave objects lingering in the cluster.

generally for both obj's the relevant controller will first re-create the default obj before proceeding with reconcile. so i think its ok to just return warning and not block deletion. (thus avoiding the problem in helm)

👍

as an alternative, we can create default obj via helm as post install hook and remove in pre delete hooks.

I think the removal won't work in that case. We will need to bypass the webhook to do so, and this means either of the following (and there might be more):

  • Delete webhook configuration on helm uninstall (not so easy because the webhook config is deployed by the operator itself, it's not managed by helm)
  • Set failurePolicy: Ignore on the webhook configuration (again not so easy because the configuration is deployed by the operator itself)
  • Extend logic to bypass webhook when custom annotation is set (too complicated IMO)

I think what this PR does is the simplest we can do today without too much of an impact. Especially if the operator itself will recreate the objects anyway.

@adrianchiris
Copy link
Collaborator

I think what this PR does is the simplest we can do today without too much of an impact. Especially if the operator itself will recreate the objects anyway.

I agree, i think the creation of these default objects is something we would need to discuss, but for now lets not tie between the two.

Im OK with this change.

@adrianchiris adrianchiris merged commit 51946b0 into k8snetworkplumbingwg:master Jan 8, 2024
11 checks passed
@vasrem vasrem deleted the feature/change-default-config-delete-behavior branch January 8, 2024 11:57
rollandf added a commit to Mellanox/network-operator that referenced this pull request Jan 10, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Requires and is based on:
#709

This PR adds support for auto generated cert-manager certificates when
user enables the SR-IOV Network Operator Admission Controllers via the
Helm value
`sriov-network-operator.operator.admissionControllers.enabled`.

This PR won't work until:
* a new image of SR-IOV Network Operator is published and the following
value is updated
https://github.com/vasrem/network-operator/blob/f125b8a67772fc31af6ced0b24c9249531e5e542/deployment/network-operator/values.yaml#L146
* a new image of SR-IOV Network Operator Webhook is published and the
following value is updated
https://github.com/vasrem/network-operator/blob/f125b8a67772fc31af6ced0b24c9249531e5e542/deployment/network-operator/values.yaml#L152
* This is needed to ensure smooth `helm uninstall` operation. Depends on
k8snetworkplumbingwg/sriov-network-operator#566.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants