-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4355: Coordinated Leader Election #4356
KEP-4355: Coordinated Leader Election #4356
Conversation
jpbetz
commented
Dec 5, 2023
- One-line PR description: Coordinated Leader Election KEP
- Issue link: Coordinated Leader Election #4355
- Other comments:
- Prototype: https://github.com/jpbetz/kubernetes/tree/leader-election-controller/demo
Skipping CI for Draft Pull Request. |
228a301
to
b34941b
Compare
b34941b
to
942fced
Compare
942fced
to
8b0d0c7
Compare
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
upgrade or rollback resulting in the version of the controller flip-flopping | ||
between old and new. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this KEP because it also facilitates other potential optimizations like graceful and performant failover, which a component can utilize to warm up (e.g. start informers to pull k8s objects) before actually becoming the leader. I don't want to make this KEP more complex than it needs to be but if you think there is anything that can be covered for future extensions, it might be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I was hoping this would be a good building block for potential future work around leadership transitions and controller scaling (breaking controllers into smaller groups or sharding controllers..). I'll add some of this to the future work section
/cc @neolit123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems pretty clean to me, thanks for the write up @jpbetz !
technically, sig-cluster-lifecycle does not own any of the changes that will happen in core, but we do have upgrades in the charter, so it makes sense to add us in stakeholders. +1
also, i pinged #cluster-api if anyone wishes to TAL.
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
metadata: | ||
annotations: | ||
coordination.k8s.io/binary-version: "1.29" | ||
coordination.k8s.io/can-lead-leases: kube-system/sample-controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample-controller
could be replaced with something like some-custom-controller
, for the sake of a better example here and in other YAML sections.
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
keps/sig-api-machinery/4355-coordinated-leader-election/README.md
Outdated
Show resolved
Hide resolved
We will allow the Coordinated Leader Election controller to create a Lease | ||
without a holder. This will have a short duration of 5s. During this time, the | ||
`Lease` may be updated by a third party to the desired `spec.Strategy`. The | ||
strategy will always default to `MinimumCompatibilityVersion`, but may be | ||
changed during the initial step of a "leader-less" `Lease`. After the initial 5 | ||
seconds, the coordinated leader election controller will proceed with the | ||
election using the data from `spec.Strategy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confuses me again. Why is the CLE special? Can't any party just set the stragegy, either through create or through update (like a human wanting to canary a component) ? Why such complicated 5s logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. these semantics:
- if there are no candidate objects,
strategy
is empty and stays empty (legacy mode). Components do election themselves. - if there are candidates, CLE will default the strategy to
MinimumCompatibilityVersion
. - if there are candidates objects and they have opinions (or a human has), make them (or the human) change the strategy on create or via update. It will be sticky. CLE won't touch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts You're list looks right to me. This is along the lines of what I was expecting. The key thing I was hoping to see was support for both create and update in the third bullet point (exactly as you have it), so that it's clear that any authorized writer in the system has the opportunity to set or change the strategy when desired, including before the controller is started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's possible and have updated it, but I want to just justify the 5s logic a bit.
- if there are no candidate objects, strategy is empty and stays empty (legacy mode). Components do >election themselves.
Strategy is a purely CLE concept. If the Lease is not managed by CLE (in the case that no candidate objects are present), then yes it should not be set. However, setting it is essentially a noop since CLE will not operate on Lease
objects that have no candidates. Furthermore, we have additional edge cases during upgrades where candidate objects do exist, but the holder is not elected by CLE and will not give up the Lease regardless.
- if there are candidates, CLE will default the strategy to MinimumCompatibilityVersion.
Yes
if there are candidates objects and they have opinions (or a human has), make them (or the human) change the strategy on create or via update. It will be sticky. CLE won't touch it.
- To make it "sticky", we may need to rethink how releasing the lease works. Currently, releasing a lease involves deleting the lease object and letting leader election determine the next lease. The strategy information will disappear since the resource is deleted. To solve this, we can persist leases such that releasing does not involve deletion, but may run into things like unable to determine orphaned leases. Alternatively, would we require users to update the Lease every time the leader changes? In theory, a lot of user initiated strategy changes (like canary) are temporary, and it could be good to reset to default after a certain period.
- candidates are not responsible for creating the Lease, the CLE is responsible. It doesn't make sense for candidates to create the lease because they do not have the full view of other candidates, only the CLE controller maintains the full picture. Is this a point we need to revisit? Candidates will only be able to change the strategy via update, and CLE may have already gone through a round of leader selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the deletion problem. Clearly deleting the lease is an anti-pattern then.
I don't understand the last point. If there is a way to create a leaderless lease, and an actor has an opinion beyond the default behaviour, that's the way to state that.
However, setting it is essentially a noop since CLE will not operate on Lease objects that have no candidates.
exactly. So won't the candidates beyond creating/updating the leaderless lease. They will just ensure their opinion is stated and CLE will pick it up when it is time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be on the same page: how does a leaderless lease look like in detail? From the former discussion I assumed we have a canonical, validating shape for that.
Giving up leadership would mean to update to that leaderless state, instead of deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's define it clearly. I was intuitively expecting it to look very close to an expired lease to minimize the number of distinct states..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving up leadership would mean to update to that leaderless state, instead of deletion.
ack, yes this part is what I imagined a leaderless lease would look like and solves the deletion problem. I think the shape of the leaderless lease is not a point of contention and is exactly as described (looks like an expired lease, has all the fields of a Lease except the holder and renewTime). The main thing we're trying to resolve is the creator of the leaderless lease.
Correct me if I'm wrong but I think what you're saying is that a leaderless lease can be created by anyone (thus solving both create and update case for setting the strategy). I'm still under the impression that it may only be created by the CLE controller and feels like a mixup of responsibilities allowing for instance candidates to create a leaderless Lease that they do not claim.
My 5s idea was based on the idea that if only the CLE controller was able to create an empty lease, then specifically for "strategy on create" cases, components need time some time to set the default before the CLE controller starts assuming a default strategy and selecting leaders. This does get resolved if an interested candidate chooses to create the leaderless lease themselves to set the strategy. I'm curious if that was the suggestion or if there's more regarding the point "if there are candidates objects and they have opinions (or a human has), make them (or the human) change the strategy on create or via update. It will be sticky. CLE won't touch it."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5s or any other timing dependent mechanism feels like both a Kube anti-pattern and it seems to be brittle. 5s in pod startup time terms is super small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent some more time thinking about it, and yeah I think the approach that allows anyone to create an empty lease is sound. It should still be recommended to stick with the default "MinimumCompatibilityVersion" strategy, but if we already see that a lease has been created, the strategy used will be respected and used.
3627270
to
dbd726b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the PRR doc in https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-api-machinery Drop my name there :)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpbetz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks, added! |
My main concerns with the strategy have been addressed. I'll leave LGTM to @sttts since he's done a TON of review on this one (huge thanks!) |
Same. For alpha we are good here I think. Some details around API design will be discussed during API review (the bool comes to mind). /lgtm |
Thanks for the review Stefan! I didn't expect this to merge, it looks like Joe is also a prod-readiness approver so this got automatically merged. |
Joe is PRR approver, so I guess that's how it auto-approved. I went through the doc, and in general it's ok. The two missing bits (non-blocking) are:
|