Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4355: Coordinated Leader Election #4356

Merged
merged 14 commits into from
Jun 7, 2024

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Dec 5, 2023

  • One-line PR description: Coordinated Leader Election KEP

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 5, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 5, 2023
@jpbetz jpbetz force-pushed the coordinated-leader-election branch from 228a301 to b34941b Compare December 5, 2023 23:20
@jpbetz
Copy link
Contributor Author

jpbetz commented Dec 5, 2023

@logicalhan

@jpbetz jpbetz mentioned this pull request Dec 5, 2023
8 tasks
@jpbetz jpbetz force-pushed the coordinated-leader-election branch from b34941b to 942fced Compare December 5, 2023 23:59
@jpbetz jpbetz force-pushed the coordinated-leader-election branch from 942fced to 8b0d0c7 Compare December 6, 2023 00:37
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 6, 2023
@wojtek-t wojtek-t self-assigned this Dec 6, 2023
upgrade or rollback resulting in the version of the controller flip-flopping
between old and new.

### Goals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this KEP because it also facilitates other potential optimizations like graceful and performant failover, which a component can utilize to warm up (e.g. start informers to pull k8s objects) before actually becoming the leader. I don't want to make this KEP more complex than it needs to be but if you think there is anything that can be covered for future extensions, it might be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I was hoping this would be a good building block for potential future work around leadership transitions and controller scaling (breaking controllers into smaller groups or sharding controllers..). I'll add some of this to the future work section

@kikisdeliveryservice kikisdeliveryservice changed the title KEP: Coordinated Leader Election KEP-4355: Coordinated Leader Election Dec 7, 2023
@pacoxu
Copy link
Member

pacoxu commented Dec 22, 2023

/cc @neolit123

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty clean to me, thanks for the write up @jpbetz !

technically, sig-cluster-lifecycle does not own any of the changes that will happen in core, but we do have upgrades in the charter, so it makes sense to add us in stakeholders. +1

also, i pinged #cluster-api if anyone wishes to TAL.

metadata:
annotations:
coordination.k8s.io/binary-version: "1.29"
coordination.k8s.io/can-lead-leases: kube-system/sample-controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample-controller could be replaced with something like some-custom-controller, for the sake of a better example here and in other YAML sections.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 21, 2024
Comment on lines 408 to 414
We will allow the Coordinated Leader Election controller to create a Lease
without a holder. This will have a short duration of 5s. During this time, the
`Lease` may be updated by a third party to the desired `spec.Strategy`. The
strategy will always default to `MinimumCompatibilityVersion`, but may be
changed during the initial step of a "leader-less" `Lease`. After the initial 5
seconds, the coordinated leader election controller will proceed with the
election using the data from `spec.Strategy`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confuses me again. Why is the CLE special? Can't any party just set the stragegy, either through create or through update (like a human wanting to canary a component) ? Why such complicated 5s logic?

Copy link
Contributor

@sttts sttts Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. these semantics:

  • if there are no candidate objects, strategy is empty and stays empty (legacy mode). Components do election themselves.
  • if there are candidates, CLE will default the strategy to MinimumCompatibilityVersion.
  • if there are candidates objects and they have opinions (or a human has), make them (or the human) change the strategy on create or via update. It will be sticky. CLE won't touch it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts You're list looks right to me. This is along the lines of what I was expecting. The key thing I was hoping to see was support for both create and update in the third bullet point (exactly as you have it), so that it's clear that any authorized writer in the system has the opportunity to set or change the strategy when desired, including before the controller is started.

Copy link
Member

@Jefftree Jefftree Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's possible and have updated it, but I want to just justify the 5s logic a bit.

  • if there are no candidate objects, strategy is empty and stays empty (legacy mode). Components do >election themselves.

Strategy is a purely CLE concept. If the Lease is not managed by CLE (in the case that no candidate objects are present), then yes it should not be set. However, setting it is essentially a noop since CLE will not operate on Lease objects that have no candidates. Furthermore, we have additional edge cases during upgrades where candidate objects do exist, but the holder is not elected by CLE and will not give up the Lease regardless.

  • if there are candidates, CLE will default the strategy to MinimumCompatibilityVersion.

Yes

if there are candidates objects and they have opinions (or a human has), make them (or the human) change the strategy on create or via update. It will be sticky. CLE won't touch it.

  • To make it "sticky", we may need to rethink how releasing the lease works. Currently, releasing a lease involves deleting the lease object and letting leader election determine the next lease. The strategy information will disappear since the resource is deleted. To solve this, we can persist leases such that releasing does not involve deletion, but may run into things like unable to determine orphaned leases. Alternatively, would we require users to update the Lease every time the leader changes? In theory, a lot of user initiated strategy changes (like canary) are temporary, and it could be good to reset to default after a certain period.
  • candidates are not responsible for creating the Lease, the CLE is responsible. It doesn't make sense for candidates to create the lease because they do not have the full view of other candidates, only the CLE controller maintains the full picture. Is this a point we need to revisit? Candidates will only be able to change the strategy via update, and CLE may have already gone through a round of leader selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the deletion problem. Clearly deleting the lease is an anti-pattern then.

I don't understand the last point. If there is a way to create a leaderless lease, and an actor has an opinion beyond the default behaviour, that's the way to state that.

However, setting it is essentially a noop since CLE will not operate on Lease objects that have no candidates.

exactly. So won't the candidates beyond creating/updating the leaderless lease. They will just ensure their opinion is stated and CLE will pick it up when it is time.

Copy link
Contributor

@sttts sttts Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be on the same page: how does a leaderless lease look like in detail? From the former discussion I assumed we have a canonical, validating shape for that.

Giving up leadership would mean to update to that leaderless state, instead of deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's define it clearly. I was intuitively expecting it to look very close to an expired lease to minimize the number of distinct states..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving up leadership would mean to update to that leaderless state, instead of deletion.

ack, yes this part is what I imagined a leaderless lease would look like and solves the deletion problem. I think the shape of the leaderless lease is not a point of contention and is exactly as described (looks like an expired lease, has all the fields of a Lease except the holder and renewTime). The main thing we're trying to resolve is the creator of the leaderless lease.

Correct me if I'm wrong but I think what you're saying is that a leaderless lease can be created by anyone (thus solving both create and update case for setting the strategy). I'm still under the impression that it may only be created by the CLE controller and feels like a mixup of responsibilities allowing for instance candidates to create a leaderless Lease that they do not claim.

My 5s idea was based on the idea that if only the CLE controller was able to create an empty lease, then specifically for "strategy on create" cases, components need time some time to set the default before the CLE controller starts assuming a default strategy and selecting leaders. This does get resolved if an interested candidate chooses to create the leaderless lease themselves to set the strategy. I'm curious if that was the suggestion or if there's more regarding the point "if there are candidates objects and they have opinions (or a human has), make them (or the human) change the strategy on create or via update. It will be sticky. CLE won't touch it."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5s or any other timing dependent mechanism feels like both a Kube anti-pattern and it seems to be brittle. 5s in pod startup time terms is super small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some more time thinking about it, and yeah I think the approach that allows anyone to create an empty lease is sound. It should still be recommended to stick with the default "MinimumCompatibilityVersion" strategy, but if we already see that a lease has been created, the strategy used will be respected and used.

@Jefftree Jefftree force-pushed the coordinated-leader-election branch from 3627270 to dbd726b Compare June 5, 2024 07:36
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Jefftree
Copy link
Member

Jefftree commented Jun 6, 2024

This is missing the PRR doc in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-api-machinery Drop my name there :)

Thanks, added!

@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 6, 2024

My main concerns with the strategy have been addressed. I'll leave LGTM to @sttts since he's done a TON of review on this one (huge thanks!)

@sttts
Copy link
Contributor

sttts commented Jun 7, 2024

Same. For alpha we are good here I think. Some details around API design will be discussed during API review (the bool comes to mind).

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 6269550 into kubernetes:master Jun 7, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 7, 2024
@Jefftree
Copy link
Member

Jefftree commented Jun 7, 2024

Thanks for the review Stefan!

I didn't expect this to merge, it looks like Joe is also a prod-readiness approver so this got automatically merged.
@soltysh, for prod readiness would it be fine if you commented here and I can open a follow up PR to address any concerns? Thanks!

@soltysh
Copy link
Contributor

soltysh commented Jun 7, 2024

I didn't expect this to merge, it looks like Joe is also a prod-readiness approver so this got automatically merged.
@soltysh, for prod readiness would it be fine if you commented here and I can open a follow up PR to address any concerns? Thanks!

Joe is PRR approver, so I guess that's how it auto-approved. I went through the doc, and in general it's ok. The two missing bits (non-blocking) are:

  1. At least beta requirements, I've noticed Stefan requested adding additional strategies during that phase, so it would be nice to call it out, so it doesn't get forgotten. Ideally, it would be also call out stable graduation requirements.
  2. Optional, but highly useful would be to add metrics, especially that I'm guessing you'll be building on the existing (leader election metrics)[https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/metrics.go], are you planning to add more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.