-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4444: Routing Preference for Services #4445
KEP-4444: Routing Preference for Services #4445
Conversation
I will soon fill in the remainder of the sections but wanted to get the conversation started :) |
/cc @robscott |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this @gauravkghildiyal!
strategy. | ||
* `PreferEqualSpread`: Encourages an equal distribution of traffic across | ||
endpoints, potentially spanning multiple zones (or regions). | ||
* `PreferZone`: Encourages routing traffic to endpoints within the same zone as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this particular approach is aiming to be generic, I'd aim for something more like PreferClose
, so that the same value could eventually be expanded to included region or other considerations as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Locality" or "ByLocality" ?
"Topology" or ByTopology or UseTopology ?
"Performance" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that different impls might be better or worse at this, do we need to convey that this is "strict" or "risky" ?
Is a load-aware implementation allowed to interpret this as "same zone unless overloaded" while a less intelligent impl would follow the letter of this description and do "same zone if any endpoint exists" ?
* **Limitations:** However, it lacked flexibility for implementations to make smart | ||
decisions on its own, like incorporating feedback-based routing optimizations | ||
(e.g., avoiding overloaded endpoints). The name `topologyKeys` might be | ||
perceived as overly restrictive, given the desire to incorporate non-topology | ||
factors into routing decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another key thing to mention here is the arbitrary ordering, lengths, and keys here made it very difficult to integrate with systems that might have a concept of more standard settings like preferring same zone routing. With topologyKeys you could theoretically request that an entirely arbitrary topology key be given preference over zone or region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Incorporated.
* **Limitation:** While designed for intelligent routing, this mode offers less | ||
user control and can be less predictable. Some users prioritize | ||
predictability for performance optimization, even when accepting potential | ||
endpoint overload in certain situations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd mention that multiple users have filed issues asking why hints weren't applied or working for them. More recent versions added events to help explain what was happening, but there was still a lot of confusion around the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added.
Note that while the initial proposal of InternalTrafficPolicy proposed a | ||
PreferLocal policy, it was dropped later on. This meant that now | ||
TopologyAwareRouting in conjunction with InternalTrafficPolicy didn’t exactly | ||
allow users to express a much desired use case from topologyKeys which is | ||
"prefer node-local, failover to same zone, then route anywhere" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is all 100% correct, I don't think this KEP really addresses that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - it should be a non-goal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes I should have been more clear that this is still a non-goal (for now). Added that statement. Hope that's enough.
* **Requirement:** As a developer deploying applications across multiple | ||
Kubernetes environments, I want a consistent way to express my routing | ||
preferences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're going for here is to describe the maintainer of what I'd call a cluster addon. For example, cert-manager, prometheus or really anything on https://artifacthub.io/. Maybe rephrase this to the following:
* **Requirement:** As a developer deploying applications across multiple | |
Kubernetes environments, I want a consistent way to express my routing | |
preferences. | |
* **Requirement:** As a developer of a widely deployed cluster addon, I want to be | |
able to provide users an easy way to configure my Helm chart and/or deployment | |
configuration to enable same-zone routing behavior in a portable way that works | |
across many different environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch! This makes more sense.
|
||
kube-proxy based dataplane (along with EndpointSlice controller, within | ||
kube-controller-manager as the control plane) will support the three standard | ||
routing preferences (`Default`, `PreferEqualSpread`, `PreferZone`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth starting a thread specifically to talk about the names of the values here. Although each of these make sense, I think I'd prefer names that were even a bit more generic, maybe:
Default
Even
Close
This assumes that the name of the field already includes Preference
. My main goal here would be to move away from PreferZone
towards something that just allowed for Close
routing, something that could include region or other keys in the future.
|
||
### Standard Heuristic Implementation (kube-proxy dataplane) | ||
|
||
kube-proxy based dataplane (along with EndpointSlice controller, within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-proxy based dataplane (along with EndpointSlice controller, within | |
kube-proxy (along with EndpointSlice controller, within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
EndpointSlice controller (which is acting as the control plane) will update the | ||
Service status with the following conditions (inspired by Gateway API conditions) | ||
|
||
* RoutingPreferenceAccepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only relevant component that could reliably set status on a Service here is the EndpointSlice controller. Given that, I think these are the states we'd want to be able to represent:
- The EndpointSlice controller sees a value that is not domain-prefixed on a Service and does not know how to handle it
- The EndpointSlice controller sees a value that it recognizes but can't populate hints due to some other factor (common with the current proportional approach)
It is likely worth representing the positive polarity versions of those as well:
- The EndpointSlice controller sees a value that it recognizes and has populated hints or determined that it does not need to for the heuristic (Default and PreferEqualSpread would fall into the latter category as proposed)
I think the first item in both of those lists could be handled by this RoutingPreferenceAccepted
condition that was true or false. Maybe it should be set to "Unknown" for unrecognized/domain-prefixed values?
I think the second item (error populating hints) should probably only appear in the error case. I think your suggestion of RoutingPreferenceProgrammed
is reasonable here, but since this is very specific to hints, we could instead use something more specific like HintsPopulated
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving this a detailed thought, Rob! I mostly have the same thoughts as you on the matter, with the exception of the topic regarding RoutingPreferenceProgrammed
v/s HintsPopulated
as the condition name. I was thinking that "HintsPopulated" would be a reason under the RoutingPreferenceProgrammed
condition. Here's what I had in mind:
-
From the user's perspective, the top priority is whether their routing preference has been "somewhat" sent to the data plane.
RoutingPreferenceProgrammed
reflects this directly, while the use of hints may be thought of as an implementation detail (which we will convey through the condition's reason) -
Another reason might spur the debate of being generic vs YAGNI :). But given the previous advantage, in this case I was leaning towards being generic which allows non-hint based implementations to reuse the same condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense, I think that's a reasonable approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the primary reason we needed status on the previous iteration of topology was that there was so much confusion about when it was or was not applied. I'd recommend scrapping status from this stage of the KEP. We can choose to add it in the future if feedback suggests it would be helpful, but I'm hopeful that this will end up being closer to any other Service field where it's not really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that this is a first draft, but it looks pretty good (though I got a sneak peak, so most of my concerns are addressed).
Please put some energy into the operational and PRR parts, next.
- "@danwinship" | ||
- "@thockin" | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship - are you OK to approve on this one? Or co-approve with me?
Note that while the initial proposal of InternalTrafficPolicy proposed a | ||
PreferLocal policy, it was dropped later on. This meant that now | ||
TopologyAwareRouting in conjunction with InternalTrafficPolicy didn’t exactly | ||
allow users to express a much desired use case from topologyKeys which is | ||
"prefer node-local, failover to same zone, then route anywhere" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - it should be a non-goal
* `Default`: Indicates no specific routing preference. The user delegates the | ||
routing decision to the implementation, allowing it to apply its best-effort | ||
strategy. | ||
* `PreferEqualSpread`: Encourages an equal distribution of traffic across |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming:
"Safe" - implies that the other options are less safe
"Spread" or "Spray" - short and to the point
"Random" - over-prescriptive?
other ideas?
strategy. | ||
* `PreferEqualSpread`: Encourages an equal distribution of traffic across | ||
endpoints, potentially spanning multiple zones (or regions). | ||
* `PreferZone`: Encourages routing traffic to endpoints within the same zone as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Locality" or "ByLocality" ?
"Topology" or ByTopology or UseTopology ?
"Performance" ?
strategy. | ||
* `PreferEqualSpread`: Encourages an equal distribution of traffic across | ||
endpoints, potentially spanning multiple zones (or regions). | ||
* `PreferZone`: Encourages routing traffic to endpoints within the same zone as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that different impls might be better or worse at this, do we need to convey that this is "strict" or "risky" ?
Is a load-aware implementation allowed to interpret this as "same zone unless overloaded" while a less intelligent impl would follow the letter of this description and do "same zone if any endpoint exists" ?
### Risks and Mitigations | ||
|
||
<!-- | ||
What are the risks of this proposal, and how do we mitigate? Think broadly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think a bit about the confusion as users of the old adopt the new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something, hoping that is what you had in mind.
prevent conflicts when multiple control planes (like the default EndpointSlice | ||
controller) are present. | ||
|
||
### Choice of field name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming debate thread:
I don't love it. "routing" feels really low-level. We know this is about locality in practice, but maybe that is too concrete? I guess, in theory, someone could use different criteria.
readers: ideas welcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very unconvinced by the name here. This is about how a client selects which peer to connect to, or perhaps how a proxy makes that choice on behalf of a client. As API clients, we don't know that the thing making that choice is kube-proxy
.
I expect lots of people to be misled by the name and for them to think it's about the choice of next hops for packets, because that is what routing very often means, and it's something that is relevant to the problem of getting traffic to backend pods.
You could even use (eg) segment routing to implement the routing preference this KEP describes.
For that reason, I urge a rethink on the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How I'd start off: pick a good descriptive name for the feature gate, and see if that helps shape what we call the API field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do folks feel about trafficDistribution
with Prefer
keyword in the field values? (Assuming trafficDistributionPreference
might be too long) Still worried about possible confusion with "traffic" internal/ExternalTrafficPolicy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that "routing" implies the wrong thing
with
Prefer
keyword in the field values
moving "prefer" from the field name to the values opens up the possibility of having Require
(or at least "stronger-than-Prefer") values in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trafficDistribution
is not horrible.
If iTP
and/or eTP
are "Local", then they take precedence, so this only kicks in when they are "Cluster". Does that help with naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems like an intersection of everyones ideas (including ideas from other comments) is that we have:
Field name: trafficDistribution
Values:
Default
Spread
(Don't think this one requiresPrefer
)PreferZone
(Also see discussion on this comment of why notPreferClose
)
Shall we seal the deal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final names can be argued in impl, but we will need to come back and update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My tip: come up with really clear name for the feature gate, one that communicates what the feature is and does. Then see if that helps us name the API fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final names can be argued in impl, but we will need to come back and update this.
Sounds good.
My tip: come up with really clear name for the feature gate, one that communicates what the feature is and does. Then see if that helps us name the API fields.
Thanks for the suggestion, Tim! I've given it a shot but sadly haven't come up with anything better yet. I'll try to resolve the rest of the conversations first and get to this during the implementation -- will keep this in mind.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md | ||
--> | ||
|
||
[ ] I/we understand the owners of the involved components may require updates to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotta check the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
Hi @johnbelamaric -- would you be able to help with the PRR review for this KEP? The target is alpha release for the coming 1.30 version (Yes the freeze is just around the corner :) ) |
6a97ac3
to
02a82a4
Compare
/assign @johnbelamaric for PRR |
@gauravkghildiyal: GitHub didn't allow me to assign the following users: for, PRR. Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
merge and iterate - names are the big outstanding
* `Default`: Indicates no specific routing preference. The user delegates the | ||
routing decision to the implementation, allowing it to apply its best-effort | ||
strategy. | ||
* `PreferEqualSpread`: Encourages an equal distribution of traffic across |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also agree to drop "Prefer" from the values if it is in the field name
Go in to as much detail as necessary here. | ||
This might be a good place to talk about core concepts and how they relate. | ||
--> | ||
None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat: This being our third attempt at an API for this, there's a non-zero chance that we STILL didn't get it right, and will have to revisit again.
Caveat: Converting from the annotation to the field is not quite a mechanical change, which means users will have to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 1.
2 is part of Risks and Mitigations.
annotation might encounter differences in exact routing behavior: | ||
* The new field doesn't support a routing preference that is exactly similar | ||
to using `service.kubernetes.io/topology-mode=Auto` from the old annotation. | ||
* If both field and the annotation are set, the annotation will take precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...for some number of releases, until it is deprecated and removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
**New behaviour:** Irrespective of what the annotation | ||
`service.kubernetes.io/topology-aware-hints` or field `routingPreference` are | ||
set to (or even if they are not set at all), kube-proxy will always consider | ||
EndpointSlice hints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... kube-proxy will always consider EndpointSlice hints (assuming this feature gate is enabled)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
|
||
### Topology Aware Routing and InternalTrafficPolicy | ||
|
||
TopologyAwareRouting together with InternalTrafficPolicy were meant to be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TopologyAwareRouting together with InternalTrafficPolicy were meant to be the | |
Topology aware routing together with `internalTrafficPolicy` for Services were meant to be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
successors of `topologyKeys` and allow implementations to be more flexible. | ||
|
||
* TopologyAwareRouting: | ||
* Exposes the annotation service.kubernetes.io/topology-mode. When this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Exposes the annotation service.kubernetes.io/topology-mode. When this | |
* Responds to the annotation `service.kubernetes.io/topology-mode`. When this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed! Thanks.
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: ServiceRoutingPreference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite seem right. Can we find a better name?
* Verify that if both the annotation `service.kubernetes.io/topology-mode=Auto` | ||
and field `routingPreference=PreferZone` are configured, precedence is given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Verify that if both the annotation `service.kubernetes.io/topology-mode=Auto` | |
and field `routingPreference=PreferZone` are configured, precedence is given | |
* Verify that if both the annotation `service.kubernetes.io/topology-mode: Auto` is set | |
and the field `routingPreference` is set to PreferZone, precedence is given |
prevent conflicts when multiple control planes (like the default EndpointSlice | ||
controller) are present. | ||
|
||
### Choice of field name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very unconvinced by the name here. This is about how a client selects which peer to connect to, or perhaps how a proxy makes that choice on behalf of a client. As API clients, we don't know that the thing making that choice is kube-proxy
.
I expect lots of people to be misled by the name and for them to think it's about the choice of next hops for packets, because that is what routing very often means, and it's something that is relevant to the problem of getting traffic to backend pods.
You could even use (eg) segment routing to implement the routing preference this KEP describes.
For that reason, I urge a rethink on the naming.
prevent conflicts when multiple control planes (like the default EndpointSlice | ||
controller) are present. | ||
|
||
### Choice of field name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How I'd start off: pick a good descriptive name for the feature gate, and see if that helps shape what we call the API field.
--> | ||
|
||
## Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that folks are going to like it, but: I'd like to see an alternative discussed where the appropriate heuristic to use is captured in a new API kind.
Let's call this TargetPodHeuristic for now.
So a Service might look like:
apiVersion: v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
heuristics:
- example-heuristic
selector:
app.kubernetes.io/name: example
ports:
- protocol: TCP
port: 24
targetPort: 24
and the TargetPodHeuristic for that:
apiVersion: example/v1alpha1
kind: TargetPodHeuristic
metadata:
name: example-heuristic
namespace: default
spec:
preferredTargets:
- weight: 23
matchLabelKeys:
- topology.kubernetes.io/zone # prefer a target in the same zone
- topology.kubernetes.io/region # prefer a target in the same region
# maybe zone names are only regionally unique
- weight: 11
matchLabelKeys:
- topology.kubernetes.io/region # prefer a target in the same region
avoidedTargets:
- weight: 42
matchTaints:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoSchedule"
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
To be clear: we can rule this out if we want to. That's absolutely fine. What I'm calling for is enough of a write-up about why it wouldn't fit / couldn't work / doesn't scale / etc. Let's not leave readers to guess this themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sftim -- That's a reasonable ask. Thanks. I've tried to add something. Hope that conveys the right ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skews back to topologyKeys
which gave users a ton of control but made alternative implementations very difficult because it was over-precise.
It's a fine line we are walking between defining a full service mesh API and defining too little to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added section on Pod Topology Spread Constraints. Fixed some typos.
|
||
The field will support the following initial values: | ||
|
||
* `Default`: Indicates no specific routing preference. The user delegates the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was/am in favour of keeping a non-nil Default (I believe kinda like we have for several other fields, including internalTrafficPolicy
).
As suggested, lets differ this to the implementation?
* `Spread`: Encourages an equal distribution of traffic across | ||
endpoints, potentially spanning multiple zones (or regions). | ||
* `Zone`: Encourages routing traffic to endpoints within the same zone as | ||
the client. If no endpoints are available within the zone, traffic should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is completely true. I believe the "looose" definition extends to all heuristics, hence I've kept a separate common note (right towards the end of this section) describing this expectation. I hope that is sufficient.
* While it may seem redundant to populate the hints here since kube-proxy can | ||
already derive the zone hint from the endpoints zone (as they would be the | ||
same), we will still use this for implementation simply because of the reason | ||
that it’s easier to implement and provides a better design. Consider an | ||
alternative implementation where kube-proxy reads | ||
`routingPreference=Zone` field and then constructs the route rules | ||
accordingly. This means some extra logic needs to be baked into the kube-proxy | ||
which could have just as easily been implemented by an already existing | ||
extensibility mechanism (i.e. EndpointSlice hints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tim and Rob have clarified the exact intent -- which was to prevent "special-casing" this logic in kube-proxy for PreferZone
when we already had a way to avoid special-casing.
As mentioned, in the future, we can discuss possibilities of what the right way to implement PreferNode
would be.
**New behaviour:** Irrespective of what the annotation | ||
`service.kubernetes.io/topology-aware-hints` or field `routingPreference` are | ||
set to (or even if they are not set at all), kube-proxy will always consider | ||
EndpointSlice hints (assuming this feature-gate is enabled). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dan, in addition to what Tim already said, if I'm understanding your emphasis on "regardless", I believe the following piece of text (copied from this KEP) might help clarify things.
Although this is not an explicit design goal, an implication of the above
implementation choice means that:
* The control plane (kube-controller-manager in this case) is only concerned
with the field `routingPreference` and populates EndpointSlice hints based
on its value
* The data plane (kube-proxy in this case) is only concerned with the Hints
populated in the EndpointSlice and the fields `internal/externalTrafficPolicy`
to make routing decisions.
* Neither the control plane nor the data plane looks at the others field.
strategy. | ||
* `Spread`: Encourages an equal distribution of traffic across | ||
endpoints, potentially spanning multiple zones (or regions). | ||
* `Zone`: Encourages routing traffic to endpoints within the same zone as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand what the intent is with your comment, though I was consciously striving for precision with the word Zone
at this point in time. Here's what I was thinking, hope it makes sense:
We have the following three things in play:
- At the moment, we don't really have a good feedback mechanism to detect overload (and this maybe the case for some time in the future)
- This in some sense means that the smaller the locality (i.e. Node < Zone < Region), the higher the risk of pods getting overloaded in that locality.
- In the very near future, we may be asked to introduce a
Node
level preference and most likely even then we may not have a mechanism to consider load based feedback.
Now, If we were to choose the heuristic name of Close
right now (which at the moment would mean "route traffic within the same zone") and then later on we want to introduce an equivalent to Local
(still without any feedback mechanism available to us), I don't think we would be able to say that, henceforth, Close
means "route traffic within the same node" because that is increasing the risk. Hence, we would end up with offering two distinct options of Close
(i.e. =Zone) and Local
which together tend to get confusing -- why did we have a dedicated keyword for Local
but not for Zone
? Even though the KEP explicitly allows implementors some freedom to change what a heuristic means, I don't think a change like this (which increases the risk) would be a proper interpretation of the following:
NOTE: Implementations reserve the right to refine the behavior associated with
any heuristic, including standard heuristics. This means the behavior enabled
by values such as `Default` or `Zone` might evolve over time, and some
evolutions might interpret the heuristic goals slightly differently. For
example, in the case of `Zone`, an implementation might initially route
traffic within the zone without considering endpoint overload, while a future
refinement could introduce feedback mechanisms to detect overload and route
traffic outside the zone when necessary, optimizing overall performance. The
decision of what constitutes an "improvement" remains at the discretion of the
implementation.
Having precise words like Zone
right now means that if/when we can get some load based feedback, that new topology might have the name Close
.
Goes without saying, we would not be having this discussion if we could somehow get traffic based feedback -- Close
would be the best and maybe only choice then :)
prevent conflicts when multiple control planes (like the default EndpointSlice | ||
controller) are present. | ||
|
||
### Choice of field name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems like an intersection of everyones ideas (including ideas from other comments) is that we have:
Field name: trafficDistribution
Values:
Default
Spread
(Don't think this one requiresPrefer
)PreferZone
(Also see discussion on this comment of why notPreferClose
)
Shall we seal the deal?
3627463
to
9a3afeb
Compare
9a3afeb
to
17f18ed
Compare
the client. If no endpoints are available within the zone, traffic should be | ||
routed to other zones. | ||
|
||
Implementations are strongly encouraged to support the standard values. While |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Strongly encoraged" != required.
Is it OK for an implementation to not support "Zone" at all? If so, should it accept the value and ignore it, or should it fail to program endpoints at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK for an implementation to not support "Zone" at all?
(Ignore this if the real focus was the follow up question and skip to the next answer)
Yes that should have been okay. But if we keep field names like Close
and Spread
, it feels like all implementations would be able to offer both, since in the worst case an implementation could interpret Close
as just routing anywhere since that is the "closest" for them (all this assuming that implementations can always implement Spread
).
If so, should it accept the value and ignore it, or should it fail to program endpoints at all?
My take here is that we're trying to position this new field as a "preference" rather than being "strict". If we were being "strict" and the implementation didn't respect a choice like Zone
that could mean semantic incorrectness in which case it would make total sense to fail fast and not program the endpoints at all. Because we aren't offering the strictness and rather positioning the field as a "preference", maybe we should not define this as part of an API contract whether the endpoints should be programmed or not -- i.e. this is an implementation detail.
As part of the API, we're suggesting that APIs "SHOULD" (meaning equivalent to rfc2119) use the defined status conditions to inform users about the level of acceptance of the preference. (Ref. "Condition usage by other implementations" section of the KEP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to merge and iterate on names. I think the biggest question is how optional the "strong suggestions" are and what implementations should do when they don't support the spirit of a standard name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even more feedback. I hope it's useful.
* `Zone`: Encourages routing traffic to endpoints within the same zone as | ||
the client. If no endpoints are available within the zone, traffic should be | ||
routed to other zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect to see topology.kubernetes.io/zone
here, not Zone
.
- it will help people making custom extensions infer the pattern
- it matches the topology label key
- we might one day support something else, eg
topology.kubernetes.io/planet
, and this more obviously leaves room for that. - it makes it more obvious that
Default
andSpread
are special cases.
We can use admission time checks to ensure that a Kubernetes namespaced value isn't used unless recognised - by this mechanism, not just as a registered label. Further checks might either reject or warn about private values (ie: nudge people to use a namespace).
That does mean adding inactive support for one minor release before we allow using a new value to allow for downgrades. Eg, Kubernetes v1.41 adds support for topology.kubernetes.io/planet
if set, but doesn't allow admitting it. Kubernetes v1.41 adds support for topology.kubernetes.io/planet
if set, and lets you actually set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving this some thought, Tim! I feel we're coming close to a name with a choice like Close
(as per the discussion in #4445 (comment)) and maybe we won't have some of the problems then.
* **Strict Routing Guarantees:** The `routingPreference` field will not | ||
enforce deterministic routing paths. It serves as a mechanism for expressing | ||
hints and preferences that implementations can consider when making routing | ||
decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the guarantees aren't there, are we as a side effect making it harder for Kubernetes to offer those guarantees one day?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're completely robbing away the possibility of defining stricter guarantees -- we could always define more stricter fields -- but yes any decision has it's cost.
Upon downgrade of EndpointSlice controller (within kube-controller-manager) and | ||
kube-proxy, the `routingPreference` will simply not be considered in any | ||
decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For downgrade, how would the API server handle the routingPreference
field that it finds on some persisted objects? Any concerns there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hoping that following the standard guidelines from https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#adding-unstable-features-to-stable-versions about adding a new field should cover all such choices -- hence haven't tried to redefine the details here. I will not try to hide the fact that I do need to re-read the guide to refresh my memory, but I'm planning to do it while we implement.
pod distribution across zones, maximizing the likelihood that the | ||
`routingPreference` can be satisfied and reduce (although not completely | ||
eliminate) chances of overload for a single zone. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternative available where we put the complexity into Gateway to ensure that there is a SomethingWeNeedToName (ie, not Service) per zone, and it does, and we're happy?
In this story:
- maybe a SomethingWeNeedToName controller manages the EndpointSlices, rather than kube-controller-manager
- maybe you can still refer to the Service if you want to, for example using environment variables - and the Cluster IP for that Service sends traffic somewhere reasonable.
- maybe we annotate Services when we want SomethingWeNeedToNames set up per zone.
That's not fleshed out; the key ask is: did we already consider putting the complexity [mostly] out of tree into Gateway, with kube-proxy learning how to integrate and other in-tree code remaining unaware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like an interesting idea -- I really do appreciate you giving this some thought (including the other suggestion on documenting the alternative regarding "Granular Routing Controls"). Sadly, I may not have immediate capacity to fully think through what this would mean on a broader picture. Lets get to this separately from this KEP? I will try giving it some though later (unless there's more people who feel this needs immediate attention)
bb1819f
to
7a18a06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PRR Shadow]
PRR LGTM for Alpha.
This section must be completed when targeting beta to a release. | ||
--> | ||
|
||
_This section will be completed when targeting beta to a release._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's encouraged to fill this out even for alpha (consider that we hope users will try this out and will need to consider upgrade/rollout/rollback), even if it is only required for beta.
(this applies to all "required for beta" sections)
Ok, I see Tim has approved for the SIG. The PRR looks fine for alpha. Note Ben's comment though. /approve |
@gauravkghildiyal @danwinship @robscott and I chatted briefly yesterday - is this KEP updated to reflect that? |
I will update in an hour and mark those things unresolved. Will revert back. Thanks for the discussion. |
…rence for Services
@thockin, @danwinship, @robscott -- Ready for final review/approvals. Cleaned things up as per our latest discussion. Summarizing our discussion and the accompanying deltas to this doc:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gauravkghildiyal! A couple non-blocking suggestions for where we can lessen the scope of this KEP but otherwise LGTM.
EndpointSlice controller (which is acting as the control plane) will update the | ||
Service status with the following conditions (inspired by Gateway API conditions) | ||
|
||
* RoutingPreferenceAccepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the primary reason we needed status on the previous iteration of topology was that there was so much confusion about when it was or was not applied. I'd recommend scrapping status from this stage of the KEP. We can choose to add it in the future if feedback suggests it would be helpful, but I'm hopeful that this will end up being closer to any other Service field where it's not really necessary.
Implementations may support additional routing heuristics using values of the | ||
form `<domain>/<heuristicName>`. Heuristics without a domain prefix will be | ||
reserved for potential future standardization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally recommend leaving this as an optional future enhancement to limit the scope of this KEP for now. We can always loosen validation to allow this in the future if there's demand for it.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gauravkghildiyal, johnbelamaric, robscott, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @robscott -- incorporated the suggestions by adding an optional future expansions section. |
…g Preference for Services
7a45a43
to
4ef427b
Compare
Thanks @gauravkghildiyal! /lgtm |
/sig network
/assign @thockin
/cc @robscott, @aojea, @danwinship