-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 2004: Evolving Topology Aware Routing for Services #2005
KEP 2004: Evolving Topology Aware Routing for Services #2005
Conversation
Hi @ckyiricky. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
/assign |
|
||
This proposal builds off of the [EndpointSlice API](https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/). | ||
|
||
We propose a new Service annotation: `endpointslice.kubernetes.io/same-zone` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh naming party 😄 , my vote for endpointslice.kubernetes.io/zone-affinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, this will be tricky name to get right. I like the idea of zone-affinity, what would the values be here? I'm not sure Prefer
or Require
would make as much sense with this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about hard
, soft
and none
?
this aligns with node affinity terminoloy https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of our goals with this kind of naming pattern is that we could use the same approach for node and region (same-node
, same-region
). Unfortunately I think that would be difficult/confusing for an affinity
based name as node-affinity
already has a different meaning in k8s. Still very interested in naming suggestions here, just hesitant to use something that might not work as well for node or region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really "about" the service or is it effectively a parameter pass-thru to the controller(s)? It seems like the later. Maybe it is "endpoint selection policy" ? Or something.
Do we want to position this as "which endpoints we choose for a zone" or "which endpoints we will choose for a given client"? What makes most sense to people who will USE this rather than us who know how it works?
E.g. if I am an app-owner, deploying my service, I think I want to reason about "given an arbitrary client, which endpoints will be chosen". "Random" vs "PreferSameZone" vs "RequireSameZone" vs ...
Let's think through possible evolution. At some point we talked about exposing the different heuristics here, but it seems like maybe we don't need to (or maybe that is a different annotation).
We've already said that same-node is probably "different".
What we we name algorithms that consider load (e.g. suppose we collection # of connections and balance that better)?
How else might this evolve? How does that inform how we will talk about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finding the appropriate name here is definitely challenging. We actually explored options similar to what you're suggesting here. My first suggestion for an annotation was endpointslice.kubernetes.io/slicing-policy
which seems similar to the concept of an endpoint selection policy. We ended up moving away from that since it seemed too closely tied to implementation and would not translate well for node-local config.
We also explored algorithm/policy names and created a doc with possible combinations. In this case it felt like there were simply too many potential combinations.
As I've thought about this KEP more, I think the title/scope may actually be a bit too broad here. This is fundamentally about trying to find smarter ways to subset EndpointSlices so that:
- Consumers only need to process a fraction of endpoints (significant scalability win).
- Consumers only get the endpoints closest to them.
This does not feel like it is the final solution for topology aware routing, just a step in the right direction. This will still allow more advanced logic to be built into consumers that could potentially make adjustments based on feedback/metrics. If we take that kind of a more limited approach to this KEP, maybe a narrower name/annotation would also make sense?
Kube-Proxy will be updated to support EndpointSlice subsetting. Each instance | ||
will watch for EndpointSlices that have a label indicating they should be | ||
consumed by this zone. For more information, refer to the EndpointSlice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we expand on this? how will kube-proxy know when it has to install rules to other zones or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I owe another KEP here. The general idea would be adding a label to EndpointSlices like endpointslice.kubernetes.io/for-zone: zone-a
. Then kube-proxy could be configured to only subscribe to EndpointSlices with that label set to the same zone or not set at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to clarify. Selectors are pretty dumb, and I don't think there's a way to express OR? So will we install multiple selectors? E.g.
one for "endpointslice.kubernetes.io/for-zone IN zone-a" and another for "endpointslice.kubernetes.io/for-zone NOTEXIST" ?
Is the plan to add a bunch of these conditions up front so that we don't have to keep editing the rules? E.g.
(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone AND no-region)
(no-node AND my-zone AND my-region)
(my-node AND no-zone AND no-region)
(my-node AND no-zone AND my-region)
(my-node AND my-zone AND no-region)
(my-node AND my-zone AND my-region)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is quite messy. I'd missed that there's no way to express OR
with selectors. This combination of selectors could work but would be bad if it dramatically increases the number of informers in kube-proxy/number of connections to api server from each node.
This could be a compelling reason to add some kind of global label to slices that didn't apply to a zone or region. That would allow us to express all of this with a single selector but unfortunately would result in all EndpointSlices being updated whenever this upgrade happened. This would also require any EndpointSlice producers to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone AND no-region)
(no-node AND my-zone AND my-region)
(my-node AND no-zone AND no-region)
(my-node AND no-zone AND my-region)
(my-node AND my-zone AND no-region)
(my-node AND my-zone AND my-region)
Surely if a Service targets my-node then it must also target either my-zone or no-zone, and likewise my-region or no-region?
And I'd assume that zones were likewise inside regions? So then it's just
(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone)
(my-node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've filed #2031, I think this discussion will actually make more sense there. I ended up taking a similar approach to what @danwinship proposed above with that proposal with the exception of no support for node labels/delivery.
79180f9
to
c36b4a9
Compare
c36b4a9
to
aa87470
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API and names are my main sticking point. I am not sure either way. Will think more, would love to hear arguments
traffic in zone. | ||
- Provide different approaches to users who have a strong desire over the | ||
routing policy: | ||
- Balanced: Use the original approach to ensure traffic is distributed across |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure "Balanced" vs "Spread" vs "Random"
|
||
We propose a new Service annotation: `service.kubernetes.io/same-zone` | ||
with three options: | ||
- `Balanced`: This represents the existing approach that is already used, no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment got lost: I am afraid "balanced" implies more than we actually guarantee. Would "Random" be a better choice here? We don't guarantee to balance traffic, just spread it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about "Unaware"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spray-n-Pray ? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To go a slightly different direction here, let's say we had a different field/annotation here that encompassed both zone and region and supported values like PreferZone
.
Maybe that annotation is something like endpointslice.kubernetes.io/slicing-policy
and maybe the supported values are:
- One of
None
,Random
,Shared
, orBalanced
PreferZone
RequireZone
I don't think same-zone: Random
or same-zone: Balanced
make a lot of sense in comparison. This is a bit of a larger change though and changes the scope of this KEP + config from solving topology aware routing to defining a controller subsetting approach that helps implement it.
number of endpoints. If a zone does not have a sufficient number of endpoints, | ||
traffic will be routed to other zones. | ||
|
||
This same pattern could be used for topology aware routing at node and region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning pretty hard away from multiple knobs here. I think something more abstract makes more sense. Especially if you think about multi-cluster Services here. E.g. If the MCS thingy already does some form of subsetting, what does this policy mean? @JeremyOT
| Rolling Update Test | Integration Test |1. Use integration framework to mock pods <br> 2. Set annotation to `Prefer` <br> 3. Conduct an aggressive rolling update policy for endpoints update <br> 4. Set endpoints to test corner cases | In the desired state, EndpointSlices are distributed as expected | If capable, monitor the churn of EndpointSlices during the rolling update | | ||
| Traffic Reliability Test | E2E Test |1. Set annotation to `Require` and `Prefer` <br> 2. Set endpoints above `starting threshold + padding` <br> 3. Reuse the current traffic reachability test | Endpoints are reachable from different zones as expected | Step 2 is only needed for `Prefer` | | ||
|
||
### Observability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good for administrators to know which services are "blowing out" of their zone. So, while it breaks tradition, I can't think of any way to do this besides a per-service metric.
This could also be an event, but that seems un-ergonomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this metric be determined? If endpoints are being contributed from one zone to another for a Service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I was thinking, yes. Otherwise you have no way of knowing which services need to be "re-balanced". (I'm open to other suggestions - I just want to write an alert that says "service XXX may be costing you money")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can label the metrics with the zone name, so you can build a graph that shows the metrics per zone.
I did something similar to the cidr_sets kubernetes/kubernetes#90288 , so you can have metrics per each cidr_set configured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be a valuable metric to have, but I think per-Service metrics may end up becoming a scalability issue. If we made these per-zone we'd potentially end up with data evening out across all Services in that zone even if there were significant discrepancies within each Service. Any ideas @wojtek-t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient to note the number of imbalanced Services and then decorate the Service object itself (or EPSlices?) with an annotation or Condition to say "is imbalanced" ?
So, I know that @thockin was considering it, do we want to formalize limiting the scope of topologyKeys? If this moves forward, it leaves us (as proposed) with only one real topology level (save node-local). Or do you intend to just take the cartesian product of keys and use that as the load-balancing input? Edit: nevermind, I'd missed https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/1659-standard-topology-labels |
4e9fdf8
to
e829f96
Compare
It would be useful to support Also, it would be useful to provide an option for advanced users to have more control over when traffic spills over to other regions. The current proposal makes it hard to reason about when spill over happens without knowing the details of the heuristics used. originally posted in #2004 (comment) |
and `Prefer` options, there is a chance that endpoints in that zone will be | ||
overloaded while endpoints in other zones receive little to no traffic. | ||
Without some sort of feedback (out of scope) this will not self-rectify. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robscott I have this doubt, can a pod in ZoneB access a Service with service.kubernetes.io/same-zone: Require
that only has endpoints in ZoneA?
I mean, the traffic load "service-endpoint" will be withing the zone, but at least in kube-proxy the ClusterIP -> Endpoint translation is done in the same node used by the pod client. how this traffic will be redirected to the other zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would not be able to. That approach relies on the user distributing workloads across all zones they intend to send traffic to. This could work well with some cross-sig work that would make this kind of scheduling easier and potentially update autoscaling to also take topology into account. I don't imagine this being a commonly used option at the beginning but I think it's still valuable for power users. I'll work on a better explanation for this in the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some better clarification around this to the KEP now.
e829f96
to
768906b
Compare
@wojtek-t I've added you as an approver to this PR based on #2031 (comment). I've also added some notes about how this would interact with HPAs in the "Risks and Mitigations" section. /assign @wojtek-t |
@mzohreva I think this is a poor fit for same-node because it assumes something about the impl. It's tricky, really. If you look at this as JUST the API, then it makes sense. But to implement the API, we'd need to split the logic between slice controller and kube-proxy. Why? Imagine a 1K node cluster. Setting per-zone makes sense to implement via slice-subsetting. Setting per-node does NOT make sense to implement via slicing - you'd have 1000 slices. It makes more sense for that to be filtered in kube-proxy (perhaps sliced by zone first). In order to make that work, you have to assume some coordination between components, both of which COULD be replaceable. @andrewsykim has a different KEP which defines This goes to a question I asked somewhere - is this an API or is it a hint to the slice controller? If I had a custom EPslice controller (who knows why) and it did not support zone subsetting, is that controller "broken" or valid? Another way to look at this - is the API specific to subslicing or is it more general than that? See my up-coming review's top-level comment for more thoughts... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I am thinking. Less is more. Let's start with the smallest scope we can justify and expand as NEEDED.
IOW, treat this like a hint to the EPSlice controller. The only values It supports are "" (meaning the current behavior) and "Auto" (meaning "try to be smarter"). Auto is what you have defined here as "prefer". Today we define it in terms of Zone, but it really means "auto". If we decided to support multi-region, it would consider region. If we added sub-zone it might consider that, too.
We can leave "require same zone" out for now. I don't feel comfortable enough with the use-cases to justify the sharp edges it exposes. We can always come back and add it if those become clearer.
Likewise, if this is really to become a contract (API) rather than a hint, we'll need to think about what guarantees we can make (Rob raised the issue of the EPSlice mirroring controller which doesn't even have topology information).
I'd rather get something in Alpha so we can start the clock on topologyKeys. I don't mind this sitting in alpha for a few releases if it has to.
Other notes:
- This gate should be mutually exclusive with the older ServiceTopology (aka topologyKeys)
- We need to define what happens if we find pods that do not have topology information - do they get left out entirely? Randomly assigned?
- We can come back and add more hints (like the coefficients of the heuristic or even multiple heuristic options) IF and ONLY IF needed.
Thoughts?
777b47b
to
af577cf
Compare
@thockin I like the simplicity of that approach. I've added one more commit to this PR that updates the KEP to take those changes into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small nits - race the mergebot!
/approve
/lgtm
|
||
#### Interoperability | ||
|
||
This annotation may not be set on a Service if the `topologyKeys` field is set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any precedent for failing to run if two incompatible feature gates are specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, I've made this language a bit more realistic.
|
||
#### Interoperability | ||
|
||
This annotation may not be set on a Service if the `topologyKeys` field is set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detail what happens. E.g. If this is set on a service which also has topologyKeys
set, topologyKeys
will take priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can I task you with deprecating topologyKeys (comments and a release note) ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to clarify that topologyKeys
should be given precedence. Will take care of deprecating topologyKeys
.
#### Feature Gate | ||
|
||
This functionality will be guarded by the `TopologyAwareSubsetting` feature | ||
gate. This gate may not be enabled if the `ServiceTopology` gate is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, changed this to should not
.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ckyiricky, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
af577cf
to
0df0684
Compare
/lgtm |
I had initially proposed this as an addition to the (old) ServiceTopology KEP but at this point it probably deserves it's own KEP. If we feel that node-local is definitely not going to be addressed with subsetting I can start work on a separate KEP to tackle this more directly. |
* KEP 2004: Evolving Topology Aware Routing for Services * Updates related to PR feedback, some markdown cleanup * Simplifying the KEP to focus on subsetting Co-authored-by: Rob Scott <[email protected]>
Kubernetes clusters are increasingly deployed in multi-zone environments but unfortunately network routing has not caught up with that. To mitigate this issue, @robscott and I added an automatic topology aware routing mechanism with a new Service annotation. This new annotation will provide three options of different routing approaches. Currently our proposal will only focus on topology aware routing at zone level.
This is closely related to both #536 and #1944, we may want to merge these into a single KEP.
Enhancement Issue: #2004
/sig network
/cc @andrewsykim @bowei @freehan @thockin
/assign @thockin