Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4444: Routing Preference for Services #4445

Conversation

gauravkghildiyal
Copy link
Member

  • One-line PR description: Adding new KEP-4444 proposing Routing Preference for Services
  • Other comments: None

/sig network
/assign @thockin
/cc @robscott, @aojea, @danwinship

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 26, 2024
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 26, 2024
@k8s-ci-robot k8s-ci-robot requested a review from shaneutt January 26, 2024 23:05
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 26, 2024
@gauravkghildiyal
Copy link
Member Author

I will soon fill in the remainder of the sections but wanted to get the conversation started :)

@gauravkghildiyal
Copy link
Member Author

/cc @robscott
/cc @aojea
/cc @danwinship

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this @gauravkghildiyal!

strategy.
* `PreferEqualSpread`: Encourages an equal distribution of traffic across
endpoints, potentially spanning multiple zones (or regions).
* `PreferZone`: Encourages routing traffic to endpoints within the same zone as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this particular approach is aiming to be generic, I'd aim for something more like PreferClose, so that the same value could eventually be expanded to included region or other considerations as necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Locality" or "ByLocality" ?

"Topology" or ByTopology or UseTopology ?

"Performance" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that different impls might be better or worse at this, do we need to convey that this is "strict" or "risky" ?

Is a load-aware implementation allowed to interpret this as "same zone unless overloaded" while a less intelligent impl would follow the letter of this description and do "same zone if any endpoint exists" ?

Comment on lines 173 to 177
* **Limitations:** However, it lacked flexibility for implementations to make smart
decisions on its own, like incorporating feedback-based routing optimizations
(e.g., avoiding overloaded endpoints). The name `topologyKeys` might be
perceived as overly restrictive, given the desire to incorporate non-topology
factors into routing decisions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another key thing to mention here is the arbitrary ordering, lengths, and keys here made it very difficult to integrate with systems that might have a concept of more standard settings like preferring same zone routing. With topologyKeys you could theoretically request that an entirely arbitrary topology key be given preference over zone or region.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Incorporated.

Comment on lines 192 to 195
* **Limitation:** While designed for intelligent routing, this mode offers less
user control and can be less predictable. Some users prioritize
predictability for performance optimization, even when accepting potential
endpoint overload in certain situations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mention that multiple users have filed issues asking why hints weren't applied or working for them. More recent versions added events to help explain what was happening, but there was still a lot of confusion around the feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added.

Comment on lines 205 to 209
Note that while the initial proposal of InternalTrafficPolicy proposed a
PreferLocal policy, it was dropped later on. This meant that now
TopologyAwareRouting in conjunction with InternalTrafficPolicy didn’t exactly
allow users to express a much desired use case from topologyKeys which is
"prefer node-local, failover to same zone, then route anywhere"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is all 100% correct, I don't think this KEP really addresses that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - it should be a non-goal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes I should have been more clear that this is still a non-goal (for now). Added that statement. Hope that's enough.

Comment on lines 321 to 323
* **Requirement:** As a developer deploying applications across multiple
Kubernetes environments, I want a consistent way to express my routing
preferences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're going for here is to describe the maintainer of what I'd call a cluster addon. For example, cert-manager, prometheus or really anything on https://artifacthub.io/. Maybe rephrase this to the following:

Suggested change
* **Requirement:** As a developer deploying applications across multiple
Kubernetes environments, I want a consistent way to express my routing
preferences.
* **Requirement:** As a developer of a widely deployed cluster addon, I want to be
able to provide users an easy way to configure my Helm chart and/or deployment
configuration to enable same-zone routing behavior in a portable way that works
across many different environments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch! This makes more sense.


kube-proxy based dataplane (along with EndpointSlice controller, within
kube-controller-manager as the control plane) will support the three standard
routing preferences (`Default`, `PreferEqualSpread`, `PreferZone`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth starting a thread specifically to talk about the names of the values here. Although each of these make sense, I think I'd prefer names that were even a bit more generic, maybe:

  • Default
  • Even
  • Close

This assumes that the name of the field already includes Preference. My main goal here would be to move away from PreferZone towards something that just allowed for Close routing, something that could include region or other keys in the future.

/cc @thockin @danwinship @aojea


### Standard Heuristic Implementation (kube-proxy dataplane)

kube-proxy based dataplane (along with EndpointSlice controller, within
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube-proxy based dataplane (along with EndpointSlice controller, within
kube-proxy (along with EndpointSlice controller, within

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

EndpointSlice controller (which is acting as the control plane) will update the
Service status with the following conditions (inspired by Gateway API conditions)

* RoutingPreferenceAccepted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only relevant component that could reliably set status on a Service here is the EndpointSlice controller. Given that, I think these are the states we'd want to be able to represent:

  1. The EndpointSlice controller sees a value that is not domain-prefixed on a Service and does not know how to handle it
  2. The EndpointSlice controller sees a value that it recognizes but can't populate hints due to some other factor (common with the current proportional approach)

It is likely worth representing the positive polarity versions of those as well:

  1. The EndpointSlice controller sees a value that it recognizes and has populated hints or determined that it does not need to for the heuristic (Default and PreferEqualSpread would fall into the latter category as proposed)

I think the first item in both of those lists could be handled by this RoutingPreferenceAccepted condition that was true or false. Maybe it should be set to "Unknown" for unrecognized/domain-prefixed values?

I think the second item (error populating hints) should probably only appear in the error case. I think your suggestion of RoutingPreferenceProgrammed is reasonable here, but since this is very specific to hints, we could instead use something more specific like HintsPopulated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving this a detailed thought, Rob! I mostly have the same thoughts as you on the matter, with the exception of the topic regarding RoutingPreferenceProgrammed v/s HintsPopulated as the condition name. I was thinking that "HintsPopulated" would be a reason under the RoutingPreferenceProgrammed condition. Here's what I had in mind:

  • From the user's perspective, the top priority is whether their routing preference has been "somewhat" sent to the data plane. RoutingPreferenceProgrammed reflects this directly, while the use of hints may be thought of as an implementation detail (which we will convey through the condition's reason)

  • Another reason might spur the debate of being generic vs YAGNI :). But given the previous advantage, in this case I was leaning towards being generic which allows non-hint based implementations to reuse the same condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense, I think that's a reasonable approach here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary reason we needed status on the previous iteration of topology was that there was so much confusion about when it was or was not applied. I'd recommend scrapping status from this stage of the KEP. We can choose to add it in the future if feedback suggests it would be helpful, but I'm hopeful that this will end up being closer to any other Service field where it's not really necessary.

@k8s-ci-robot k8s-ci-robot requested a review from thockin January 26, 2024 23:52
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that this is a first draft, but it looks pretty good (though I got a sneak peak, so most of my concerns are addressed).

Please put some energy into the operational and PRR parts, next.

- "@danwinship"
- "@thockin"
approvers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship - are you OK to approve on this one? Or co-approve with me?

Comment on lines 205 to 209
Note that while the initial proposal of InternalTrafficPolicy proposed a
PreferLocal policy, it was dropped later on. This meant that now
TopologyAwareRouting in conjunction with InternalTrafficPolicy didn’t exactly
allow users to express a much desired use case from topologyKeys which is
"prefer node-local, failover to same zone, then route anywhere"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - it should be a non-goal

* `Default`: Indicates no specific routing preference. The user delegates the
routing decision to the implementation, allowing it to apply its best-effort
strategy.
* `PreferEqualSpread`: Encourages an equal distribution of traffic across
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming:

"Safe" - implies that the other options are less safe

"Spread" or "Spray" - short and to the point

"Random" - over-prescriptive?

other ideas?

strategy.
* `PreferEqualSpread`: Encourages an equal distribution of traffic across
endpoints, potentially spanning multiple zones (or regions).
* `PreferZone`: Encourages routing traffic to endpoints within the same zone as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Locality" or "ByLocality" ?

"Topology" or ByTopology or UseTopology ?

"Performance" ?

strategy.
* `PreferEqualSpread`: Encourages an equal distribution of traffic across
endpoints, potentially spanning multiple zones (or regions).
* `PreferZone`: Encourages routing traffic to endpoints within the same zone as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that different impls might be better or worse at this, do we need to convey that this is "strict" or "risky" ?

Is a load-aware implementation allowed to interpret this as "same zone unless overloaded" while a less intelligent impl would follow the letter of this description and do "same zone if any endpoint exists" ?

### Risks and Mitigations

<!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think a bit about the confusion as users of the old adopt the new?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added something, hoping that is what you had in mind.

prevent conflicts when multiple control planes (like the default EndpointSlice
controller) are present.

### Choice of field name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming debate thread:

I don't love it. "routing" feels really low-level. We know this is about locality in practice, but maybe that is too concrete? I guess, in theory, someone could use different criteria.

readers: ideas welcome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very unconvinced by the name here. This is about how a client selects which peer to connect to, or perhaps how a proxy makes that choice on behalf of a client. As API clients, we don't know that the thing making that choice is kube-proxy.

I expect lots of people to be misled by the name and for them to think it's about the choice of next hops for packets, because that is what routing very often means, and it's something that is relevant to the problem of getting traffic to backend pods.

You could even use (eg) segment routing to implement the routing preference this KEP describes.

For that reason, I urge a rethink on the naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I'd start off: pick a good descriptive name for the feature gate, and see if that helps shape what we call the API field.

Copy link
Member Author

@gauravkghildiyal gauravkghildiyal Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do folks feel about trafficDistribution with Prefer keyword in the field values? (Assuming trafficDistributionPreference might be too long) Still worried about possible confusion with "traffic" internal/ExternalTrafficPolicy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that "routing" implies the wrong thing

with Prefer keyword in the field values

moving "prefer" from the field name to the values opens up the possibility of having Require (or at least "stronger-than-Prefer") values in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trafficDistribution is not horrible.

If iTP and/or eTP are "Local", then they take precedence, so this only kicks in when they are "Cluster". Does that help with naming?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems like an intersection of everyones ideas (including ideas from other comments) is that we have:

Field name: trafficDistribution
Values:

  • Default
  • Spread (Don't think this one requires Prefer)
  • PreferZone (Also see discussion on this comment of why not PreferClose )

Shall we seal the deal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final names can be argued in impl, but we will need to come back and update this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My tip: come up with really clear name for the feature gate, one that communicates what the feature is and does. Then see if that helps us name the API fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final names can be argued in impl, but we will need to come back and update this.

Sounds good.

My tip: come up with really clear name for the feature gate, one that communicates what the feature is and does. Then see if that helps us name the API fields.

Thanks for the suggestion, Tim! I've given it a shot but sadly haven't come up with anything better yet. I'll try to resolve the rest of the conversations first and get to this during the implementation -- will keep this in mind.

[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->

[ ] I/we understand the owners of the involved components may require updates to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta check the box

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 29, 2024
@gauravkghildiyal
Copy link
Member Author

Hi @johnbelamaric -- would you be able to help with the PRR review for this KEP? The target is alpha release for the coming 1.30 version (Yes the freeze is just around the corner :) )

@gauravkghildiyal gauravkghildiyal force-pushed the kep-service-routing-preference branch from 6a97ac3 to 02a82a4 Compare January 29, 2024 04:58
@gauravkghildiyal
Copy link
Member Author

/assign @johnbelamaric for PRR

@k8s-ci-robot
Copy link
Contributor

@gauravkghildiyal: GitHub didn't allow me to assign the following users: for, PRR.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @johnbelamaric for PRR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

merge and iterate - names are the big outstanding

* `Default`: Indicates no specific routing preference. The user delegates the
routing decision to the implementation, allowing it to apply its best-effort
strategy.
* `PreferEqualSpread`: Encourages an equal distribution of traffic across
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agree to drop "Prefer" from the values if it is in the field name

Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat: This being our third attempt at an API for this, there's a non-zero chance that we STILL didn't get it right, and will have to revisit again.

Caveat: Converting from the annotation to the field is not quite a mechanical change, which means users will have to think about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 1.
2 is part of Risks and Mitigations.

annotation might encounter differences in exact routing behavior:
* The new field doesn't support a routing preference that is exactly similar
to using `service.kubernetes.io/topology-mode=Auto` from the old annotation.
* If both field and the annotation are set, the annotation will take precedence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...for some number of releases, until it is deprecated and removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

**New behaviour:** Irrespective of what the annotation
`service.kubernetes.io/topology-aware-hints` or field `routingPreference` are
set to (or even if they are not set at all), kube-proxy will always consider
EndpointSlice hints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... kube-proxy will always consider EndpointSlice hints (assuming this feature gate is enabled)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2024

### Topology Aware Routing and InternalTrafficPolicy

TopologyAwareRouting together with InternalTrafficPolicy were meant to be the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TopologyAwareRouting together with InternalTrafficPolicy were meant to be the
Topology aware routing together with `internalTrafficPolicy` for Services were meant to be the

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks!

successors of `topologyKeys` and allow implementations to be more flexible.

* TopologyAwareRouting:
* Exposes the annotation service.kubernetes.io/topology-mode. When this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Exposes the annotation service.kubernetes.io/topology-mode. When this
* Responds to the annotation `service.kubernetes.io/topology-mode`. When this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed! Thanks.

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: ServiceRoutingPreference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite seem right. Can we find a better name?

Comment on lines 523 to 524
* Verify that if both the annotation `service.kubernetes.io/topology-mode=Auto`
and field `routingPreference=PreferZone` are configured, precedence is given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Verify that if both the annotation `service.kubernetes.io/topology-mode=Auto`
and field `routingPreference=PreferZone` are configured, precedence is given
* Verify that if both the annotation `service.kubernetes.io/topology-mode: Auto` is set
and the field `routingPreference` is set to PreferZone, precedence is given

prevent conflicts when multiple control planes (like the default EndpointSlice
controller) are present.

### Choice of field name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very unconvinced by the name here. This is about how a client selects which peer to connect to, or perhaps how a proxy makes that choice on behalf of a client. As API clients, we don't know that the thing making that choice is kube-proxy.

I expect lots of people to be misled by the name and for them to think it's about the choice of next hops for packets, because that is what routing very often means, and it's something that is relevant to the problem of getting traffic to backend pods.

You could even use (eg) segment routing to implement the routing preference this KEP describes.

For that reason, I urge a rethink on the naming.

prevent conflicts when multiple control planes (like the default EndpointSlice
controller) are present.

### Choice of field name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I'd start off: pick a good descriptive name for the feature gate, and see if that helps shape what we call the API field.

-->

## Alternatives

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that folks are going to like it, but: I'd like to see an alternative discussed where the appropriate heuristic to use is captured in a new API kind.

Let's call this TargetPodHeuristic for now.

So a Service might look like:

apiVersion: v1
kind: Service
metadata:
  name: example-service
  namespace: default
spec:
  heuristics:
    - example-heuristic
  selector:
    app.kubernetes.io/name: example
  ports:
    - protocol: TCP
      port: 24
      targetPort: 24

and the TargetPodHeuristic for that:

apiVersion: example/v1alpha1
kind: TargetPodHeuristic
metadata:
  name: example-heuristic
  namespace: default
spec:
    preferredTargets:
      - weight: 23
        matchLabelKeys:
         - topology.kubernetes.io/zone   # prefer a target in the same zone
         - topology.kubernetes.io/region # prefer a target in the same region
                                         # maybe zone names are only regionally unique
      - weight: 11
        matchLabelKeys:
         - topology.kubernetes.io/region # prefer a target in the same region
    avoidedTargets:
      - weight: 42
        matchTaints:
        - key: "node.kubernetes.io/unreachable"
          operator: "Exists"
          effect: "NoSchedule"
        - key: "node.kubernetes.io/unreachable"
          operator: "Exists"
          effect: "NoExecute"

To be clear: we can rule this out if we want to. That's absolutely fine. What I'm calling for is enough of a write-up about why it wouldn't fit / couldn't work / doesn't scale / etc. Let's not leave readers to guess this themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sftim -- That's a reasonable ask. Thanks. I've tried to add something. Hope that conveys the right ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skews back to topologyKeys which gave users a ton of control but made alternative implementations very difficult because it was over-precise.

It's a fine line we are walking between defining a full service mesh API and defining too little to be useful.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2024
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 2, 2024
Copy link
Member Author

@gauravkghildiyal gauravkghildiyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added section on Pod Topology Spread Constraints. Fixed some typos.


The field will support the following initial values:

* `Default`: Indicates no specific routing preference. The user delegates the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was/am in favour of keeping a non-nil Default (I believe kinda like we have for several other fields, including internalTrafficPolicy).

As suggested, lets differ this to the implementation?

* `Spread`: Encourages an equal distribution of traffic across
endpoints, potentially spanning multiple zones (or regions).
* `Zone`: Encourages routing traffic to endpoints within the same zone as
the client. If no endpoints are available within the zone, traffic should be
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is completely true. I believe the "looose" definition extends to all heuristics, hence I've kept a separate common note (right towards the end of this section) describing this expectation. I hope that is sufficient.

Comment on lines 344 to 352
* While it may seem redundant to populate the hints here since kube-proxy can
already derive the zone hint from the endpoints zone (as they would be the
same), we will still use this for implementation simply because of the reason
that it’s easier to implement and provides a better design. Consider an
alternative implementation where kube-proxy reads
`routingPreference=Zone` field and then constructs the route rules
accordingly. This means some extra logic needs to be baked into the kube-proxy
which could have just as easily been implemented by an already existing
extensibility mechanism (i.e. EndpointSlice hints)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim and Rob have clarified the exact intent -- which was to prevent "special-casing" this logic in kube-proxy for PreferZone when we already had a way to avoid special-casing.

As mentioned, in the future, we can discuss possibilities of what the right way to implement PreferNode would be.

**New behaviour:** Irrespective of what the annotation
`service.kubernetes.io/topology-aware-hints` or field `routingPreference` are
set to (or even if they are not set at all), kube-proxy will always consider
EndpointSlice hints (assuming this feature-gate is enabled).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dan, in addition to what Tim already said, if I'm understanding your emphasis on "regardless", I believe the following piece of text (copied from this KEP) might help clarify things.

Although this is not an explicit design goal, an implication of the above
implementation choice means that:
* The control plane (kube-controller-manager in this case) is only concerned
  with the field `routingPreference` and populates EndpointSlice hints based
  on its value 
* The data plane (kube-proxy in this case) is only concerned with the Hints
  populated in the EndpointSlice and the fields `internal/externalTrafficPolicy`
  to make routing decisions.
* Neither the control plane nor the data plane looks at the others field.

strategy.
* `Spread`: Encourages an equal distribution of traffic across
endpoints, potentially spanning multiple zones (or regions).
* `Zone`: Encourages routing traffic to endpoints within the same zone as
Copy link
Member Author

@gauravkghildiyal gauravkghildiyal Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand what the intent is with your comment, though I was consciously striving for precision with the word Zone at this point in time. Here's what I was thinking, hope it makes sense:

We have the following three things in play:

  1. At the moment, we don't really have a good feedback mechanism to detect overload (and this maybe the case for some time in the future)
  2. This in some sense means that the smaller the locality (i.e. Node < Zone < Region), the higher the risk of pods getting overloaded in that locality.
  3. In the very near future, we may be asked to introduce a Node level preference and most likely even then we may not have a mechanism to consider load based feedback.

Now, If we were to choose the heuristic name of Close right now (which at the moment would mean "route traffic within the same zone") and then later on we want to introduce an equivalent to Local (still without any feedback mechanism available to us), I don't think we would be able to say that, henceforth, Close means "route traffic within the same node" because that is increasing the risk. Hence, we would end up with offering two distinct options of Close (i.e. =Zone) and Local which together tend to get confusing -- why did we have a dedicated keyword for Local but not for Zone? Even though the KEP explicitly allows implementors some freedom to change what a heuristic means, I don't think a change like this (which increases the risk) would be a proper interpretation of the following:

NOTE: Implementations reserve the right to refine the behavior associated with
  any heuristic, including standard heuristics. This means the behavior enabled
  by values such as `Default` or `Zone` might evolve over time, and some
  evolutions might interpret the heuristic goals slightly differently. For
  example, in the case of `Zone`, an implementation might initially route
  traffic within the zone without considering endpoint overload, while a future
  refinement could introduce feedback mechanisms to detect overload and route
  traffic outside the zone when necessary, optimizing overall performance. The
  decision of what constitutes an "improvement" remains at the discretion of the
  implementation.

Having precise words like Zone right now means that if/when we can get some load based feedback, that new topology might have the name Close.

Goes without saying, we would not be having this discussion if we could somehow get traffic based feedback -- Close would be the best and maybe only choice then :)

prevent conflicts when multiple control planes (like the default EndpointSlice
controller) are present.

### Choice of field name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems like an intersection of everyones ideas (including ideas from other comments) is that we have:

Field name: trafficDistribution
Values:

  • Default
  • Spread (Don't think this one requires Prefer)
  • PreferZone (Also see discussion on this comment of why not PreferClose )

Shall we seal the deal?

@gauravkghildiyal gauravkghildiyal force-pushed the kep-service-routing-preference branch from 3627463 to 9a3afeb Compare February 2, 2024 08:48
@gauravkghildiyal gauravkghildiyal force-pushed the kep-service-routing-preference branch from 9a3afeb to 17f18ed Compare February 2, 2024 08:50
the client. If no endpoints are available within the zone, traffic should be
routed to other zones.

Implementations are strongly encouraged to support the standard values. While
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Strongly encoraged" != required.

Is it OK for an implementation to not support "Zone" at all? If so, should it accept the value and ignore it, or should it fail to program endpoints at all?

Copy link
Member Author

@gauravkghildiyal gauravkghildiyal Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK for an implementation to not support "Zone" at all?

(Ignore this if the real focus was the follow up question and skip to the next answer)

Yes that should have been okay. But if we keep field names like Close and Spread, it feels like all implementations would be able to offer both, since in the worst case an implementation could interpret Close as just routing anywhere since that is the "closest" for them (all this assuming that implementations can always implement Spread).

If so, should it accept the value and ignore it, or should it fail to program endpoints at all?

My take here is that we're trying to position this new field as a "preference" rather than being "strict". If we were being "strict" and the implementation didn't respect a choice like Zone that could mean semantic incorrectness in which case it would make total sense to fail fast and not program the endpoints at all. Because we aren't offering the strictness and rather positioning the field as a "preference", maybe we should not define this as part of an API contract whether the endpoints should be programmed or not -- i.e. this is an implementation detail.

As part of the API, we're suggesting that APIs "SHOULD" (meaning equivalent to rfc2119) use the defined status conditions to inform users about the level of acceptance of the preference. (Ref. "Condition usage by other implementations" section of the KEP)

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to merge and iterate on names. I think the biggest question is how optional the "strong suggestions" are and what implementations should do when they don't support the spirit of a standard name.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more feedback. I hope it's useful.

Comment on lines 218 to 220
* `Zone`: Encourages routing traffic to endpoints within the same zone as
the client. If no endpoints are available within the zone, traffic should be
routed to other zones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect to see topology.kubernetes.io/zone here, not Zone.

  • it will help people making custom extensions infer the pattern
  • it matches the topology label key
  • we might one day support something else, eg topology.kubernetes.io/planet, and this more obviously leaves room for that.
  • it makes it more obvious that Default and Spread are special cases.

We can use admission time checks to ensure that a Kubernetes namespaced value isn't used unless recognised - by this mechanism, not just as a registered label. Further checks might either reject or warn about private values (ie: nudge people to use a namespace).

That does mean adding inactive support for one minor release before we allow using a new value to allow for downgrades. Eg, Kubernetes v1.41 adds support for topology.kubernetes.io/planet if set, but doesn't allow admitting it. Kubernetes v1.41 adds support for topology.kubernetes.io/planet if set, and lets you actually set it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving this some thought, Tim! I feel we're coming close to a name with a choice like Close (as per the discussion in #4445 (comment)) and maybe we won't have some of the problems then.

Comment on lines +188 to +191
* **Strict Routing Guarantees:** The `routingPreference` field will not
enforce deterministic routing paths. It serves as a mechanism for expressing
hints and preferences that implementations can consider when making routing
decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the guarantees aren't there, are we as a side effect making it harder for Kubernetes to offer those guarantees one day?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're completely robbing away the possibility of defining stricter guarantees -- we could always define more stricter fields -- but yes any decision has it's cost.

Comment on lines +563 to +565
Upon downgrade of EndpointSlice controller (within kube-controller-manager) and
kube-proxy, the `routingPreference` will simply not be considered in any
decisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For downgrade, how would the API server handle the routingPreference field that it finds on some persisted objects? Any concerns there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping that following the standard guidelines from https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#adding-unstable-features-to-stable-versions about adding a new field should cover all such choices -- hence haven't tried to redefine the details here. I will not try to hide the fact that I do need to re-read the guide to refresh my memory, but I'm planning to do it while we implement.

pod distribution across zones, maximizing the likelihood that the
`routingPreference` can be satisfied and reduce (although not completely
eliminate) chances of overload for a single zone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative available where we put the complexity into Gateway to ensure that there is a SomethingWeNeedToName (ie, not Service) per zone, and it does, and we're happy?

In this story:

  • maybe a SomethingWeNeedToName controller manages the EndpointSlices, rather than kube-controller-manager
  • maybe you can still refer to the Service if you want to, for example using environment variables - and the Cluster IP for that Service sends traffic somewhere reasonable.
  • maybe we annotate Services when we want SomethingWeNeedToNames set up per zone.

That's not fleshed out; the key ask is: did we already consider putting the complexity [mostly] out of tree into Gateway, with kube-proxy learning how to integrate and other in-tree code remaining unaware?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like an interesting idea -- I really do appreciate you giving this some thought (including the other suggestion on documenting the alternative regarding "Granular Routing Controls"). Sadly, I may not have immediate capacity to fully think through what this would mean on a broader picture. Lets get to this separately from this KEP? I will try giving it some though later (unless there's more people who feel this needs immediate attention)

@gauravkghildiyal gauravkghildiyal force-pushed the kep-service-routing-preference branch from bb1819f to 7a18a06 Compare February 5, 2024 04:05
Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PRR Shadow]
PRR LGTM for Alpha.

This section must be completed when targeting beta to a release.
-->

_This section will be completed when targeting beta to a release._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's encouraged to fill this out even for alpha (consider that we hope users will try this out and will need to consider upgrade/rollout/rollback), even if it is only required for beta.

(this applies to all "required for beta" sections)

@johnbelamaric
Copy link
Member

Ok, I see Tim has approved for the SIG. The PRR looks fine for alpha. Note Ben's comment though.

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2024
@thockin
Copy link
Member

thockin commented Feb 7, 2024

@gauravkghildiyal @danwinship @robscott and I chatted briefly yesterday - is this KEP updated to reflect that?

@gauravkghildiyal
Copy link
Member Author

@gauravkghildiyal @danwinship @robscott and I chatted briefly yesterday - is this KEP updated to reflect that?

I will update in an hour and mark those things unresolved. Will revert back. Thanks for the discussion.

@gauravkghildiyal
Copy link
Member Author

@thockin, @danwinship, @robscott -- Ready for final review/approvals.

Cleaned things up as per our latest discussion. Summarizing our discussion and the accompanying deltas to this doc:

  • We will start with having only two standard values: Close and Default
    • Changed doc to clarify what Close would mean for kube-proxy/endpointslice-controller, as of now (viz. Close == PreferZone)
  • Changed “Implementations are strongly encouraged to support standard values” to “Implementations SHOULD support standard values”
    • This should allow use cases like User Story 3 involving deployments through helm charts working across implementations.
    • The only standard value with some specific meaning defined is for Close, but the meaning still allows many implementations to be compliant.
  • The name of the field routingPreference is still unresolved and will most likely be refined during implementation.

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gauravkghildiyal! A couple non-blocking suggestions for where we can lessen the scope of this KEP but otherwise LGTM.

EndpointSlice controller (which is acting as the control plane) will update the
Service status with the following conditions (inspired by Gateway API conditions)

* RoutingPreferenceAccepted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary reason we needed status on the previous iteration of topology was that there was so much confusion about when it was or was not applied. I'd recommend scrapping status from this stage of the KEP. We can choose to add it in the future if feedback suggests it would be helpful, but I'm hopeful that this will end up being closer to any other Service field where it's not really necessary.

Comment on lines 221 to 223
Implementations may support additional routing heuristics using values of the
form `<domain>/<heuristicName>`. Heuristics without a domain prefix will be
reserved for potential future standardization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally recommend leaving this as an optional future enhancement to limit the scope of this KEP for now. We can always loosen validation to allow this in the future if there's demand for it.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gauravkghildiyal, johnbelamaric, robscott, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gauravkghildiyal
Copy link
Member Author

Thanks @robscott -- incorporated the suggestions by adding an optional future expansions section.

@gauravkghildiyal gauravkghildiyal force-pushed the kep-service-routing-preference branch from 7a45a43 to 4ef427b Compare February 8, 2024 00:40
@robscott
Copy link
Member

robscott commented Feb 8, 2024

Thanks @gauravkghildiyal!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit a03be2f into kubernetes:master Feb 8, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants