Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-37668: [release-4.15] operator: LeaderElectionReleaseOnCancel #974

Closed
wants to merge 3 commits into from

Conversation

zeeke
Copy link
Contributor

@zeeke zeeke commented Jul 29, 2024

backport of:

This ensure the lease is released by the operator as soon as it gracefully stops.

@openshift-ci openshift-ci bot requested review from fedepaol and wizhaoredhat July 29, 2024 16:35
@zeeke
Copy link
Contributor Author

zeeke commented Jul 29, 2024

/jira cherrypick OCPBUGS-23795

@openshift-ci-robot
Copy link
Contributor

@zeeke: Jira Issue OCPBUGS-23795 has been cloned as Jira Issue OCPBUGS-37668. Will retitle bug to link to clone.
/retitle OCPBUGS-37668: [release-4.15] operator: LeaderElectionReleaseOnCancel

In response to this:

/jira cherrypick OCPBUGS-23795

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot changed the title [release-4.15] operator: LeaderElectionReleaseOnCancel OCPBUGS-37668: [release-4.15] operator: LeaderElectionReleaseOnCancel Jul 29, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jul 29, 2024
@openshift-ci-robot
Copy link
Contributor

@zeeke: This pull request references Jira Issue OCPBUGS-37668, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required"
  • expected dependent Jira Issue OCPBUGS-23795 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead
  • expected dependent Jira Issue OCPBUGS-23795 to target a version in 4.16.0, 4.16.z, but it targets "4.16" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

backport of:

This ensure the lease is released by the operator as soon as it gracefully stops.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 29, 2024
zeeke added 3 commits July 29, 2024 18:47
When manually restarting the operator, the leader election may
take 5+ minutest to acquire the lease on startup:

```
I1205 16:06:02.101302       1 leaderelection.go:245] attempting to acquire leader lease openshift-sriov-network-operator/a56def2a.openshift.io...
...
I1205 16:08:40.133558       1 leaderelection.go:255] successfully acquired lease openshift-sriov-network-operator/a56def2a.openshift.io
```

The manager's option `LeaderElectionReleaseOnCancel` would solve this
problem, but it's not safe as the shutdown cleanup procedures
(inhibiting webhooks and removing finalizers) would run without any
leader guard.

This commit moves the LeaderElection mechanism from the namespaced
manager to a dedicated, no-op controller manager. This approach has been
preferred to directly dealing with the LeaderElection API as:
- It leverages library code that has been proved to be stable
- It includes recording k8s Events about the Lease process
- The election process must come after setting up the health probe.
  Doing it manually would involve handling the healthz endpoint as well.

Signed-off-by: Andrea Panattoni <[email protected]>
Add CoordinationV1 to `test/util/clients.go` to make
assertions on `coordination.k8s.io/Lease` objects.

Add `OPERATOR_LEADER_ELECTION_ENABLE` environment variable
to `deploy/operator.yaml` to let user enable leader election
on the operator.

Signed-off-by: Andrea Panattoni <[email protected]>
@zeeke zeeke force-pushed the 415-OCPBUGS-23795 branch from 8ea344a to 20d88f4 Compare July 29, 2024 16:48
@zeeke
Copy link
Contributor Author

zeeke commented Jul 29, 2024

/jira backport release-4.14

@openshift-ci-robot
Copy link
Contributor

@zeeke: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.14

In response to this:

/jira backport release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@openshift-ci-robot: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

In response to this:

@zeeke: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.14

In response to this:

/jira backport release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Jul 29, 2024

@zeeke: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-nfv 20d88f4 link false /test e2e-openstack-nfv
ci/prow/e2e-openstack-nfv-hwoffload 20d88f4 link false /test e2e-openstack-nfv-hwoffload

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -41,6 +41,8 @@ spec:
image: $SRIOV_NETWORK_OPERATOR_IMAGE
command:
- sriov-network-operator
args:
- --leader-elect=$OPERATOR_LEADER_ELECTION_ENABLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this in the bundle also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks!

btw I am almost sure I already asked that :P

@SchSeba
Copy link
Contributor

SchSeba commented Aug 21, 2024

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 21, 2024
Copy link
Contributor

openshift-ci bot commented Aug 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SchSeba, zeeke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2024
@zeeke
Copy link
Contributor Author

zeeke commented Aug 21, 2024

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2024
@zeeke zeeke closed this Sep 4, 2024
@openshift-ci-robot
Copy link
Contributor

@zeeke: This pull request references Jira Issue OCPBUGS-37668. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

backport of:

This ensure the lease is released by the operator as soon as it gracefully stops.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants