KEP 2004: Evolving Topology Aware Routing for Services #2005

ckyiricky · 2020-09-22T23:30:48Z

Kubernetes clusters are increasingly deployed in multi-zone environments but unfortunately network routing has not caught up with that. To mitigate this issue, @robscott and I added an automatic topology aware routing mechanism with a new Service annotation. This new annotation will provide three options of different routing approaches. Currently our proposal will only focus on topology aware routing at zone level.

This is closely related to both #536 and #1944, we may want to merge these into a single KEP.

Enhancement Issue: #2004
/sig network
/cc @andrewsykim @bowei @freehan @thockin
/assign @thockin

k8s-ci-robot · 2020-09-22T23:30:56Z

Hi @ckyiricky. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

robscott · 2020-09-22T23:33:51Z

/ok-to-test

robscott · 2020-09-22T23:33:57Z

/assign

aojea · 2020-09-23T07:59:43Z

keps/sig-network/2004-topology-aware-routing/README.md

+
+This proposal builds off of the [EndpointSlice API](https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/).
+
+We propose a new Service annotation: `endpointslice.kubernetes.io/same-zone`


heh naming party 😄 , my vote for endpointslice.kubernetes.io/zone-affinity

Haha, this will be tricky name to get right. I like the idea of zone-affinity, what would the values be here? I'm not sure Prefer or Require would make as much sense with this name.

what about hard , soft and none ?

this aligns with node affinity terminoloy https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

One of our goals with this kind of naming pattern is that we could use the same approach for node and region (same-node, same-region). Unfortunately I think that would be difficult/confusing for an affinity based name as node-affinity already has a different meaning in k8s. Still very interested in naming suggestions here, just hesitant to use something that might not work as well for node or region.

Is this really "about" the service or is it effectively a parameter pass-thru to the controller(s)? It seems like the later. Maybe it is "endpoint selection policy" ? Or something.

Do we want to position this as "which endpoints we choose for a zone" or "which endpoints we will choose for a given client"? What makes most sense to people who will USE this rather than us who know how it works?

E.g. if I am an app-owner, deploying my service, I think I want to reason about "given an arbitrary client, which endpoints will be chosen". "Random" vs "PreferSameZone" vs "RequireSameZone" vs ...

Let's think through possible evolution. At some point we talked about exposing the different heuristics here, but it seems like maybe we don't need to (or maybe that is a different annotation).

We've already said that same-node is probably "different".

What we we name algorithms that consider load (e.g. suppose we collection # of connections and balance that better)?

How else might this evolve? How does that inform how we will talk about this?

Finding the appropriate name here is definitely challenging. We actually explored options similar to what you're suggesting here. My first suggestion for an annotation was endpointslice.kubernetes.io/slicing-policy which seems similar to the concept of an endpoint selection policy. We ended up moving away from that since it seemed too closely tied to implementation and would not translate well for node-local config.

We also explored algorithm/policy names and created a doc with possible combinations. In this case it felt like there were simply too many potential combinations.

As I've thought about this KEP more, I think the title/scope may actually be a bit too broad here. This is fundamentally about trying to find smarter ways to subset EndpointSlices so that:

Consumers only need to process a fraction of endpoints (significant scalability win).

Consumers only get the endpoints closest to them.

This does not feel like it is the final solution for topology aware routing, just a step in the right direction. This will still allow more advanced logic to be built into consumers that could potentially make adjustments based on feedback/metrics. If we take that kind of a more limited approach to this KEP, maybe a narrower name/annotation would also make sense?

keps/sig-network/2004-topology-aware-routing/README.md

aojea · 2020-09-23T08:41:14Z

keps/sig-network/2004-topology-aware-routing/README.md

+Kube-Proxy will be updated to support EndpointSlice subsetting. Each instance
+will watch for EndpointSlices that have a label indicating they should be
+consumed by this zone. For more information, refer to the EndpointSlice


can we expand on this? how will kube-proxy know when it has to install rules to other zones or no?

I owe another KEP here. The general idea would be adding a label to EndpointSlices like endpointslice.kubernetes.io/for-zone: zone-a. Then kube-proxy could be configured to only subscribe to EndpointSlices with that label set to the same zone or not set at all.

We need to clarify. Selectors are pretty dumb, and I don't think there's a way to express OR? So will we install multiple selectors? E.g.

one for "endpointslice.kubernetes.io/for-zone IN zone-a" and another for "endpointslice.kubernetes.io/for-zone NOTEXIST" ?

Is the plan to add a bunch of these conditions up front so that we don't have to keep editing the rules? E.g.

(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone AND no-region)
(no-node AND my-zone AND my-region)
(my-node AND no-zone AND no-region)
(my-node AND no-zone AND my-region)
(my-node AND my-zone AND no-region)
(my-node AND my-zone AND my-region)

Yeah that is quite messy. I'd missed that there's no way to express OR with selectors. This combination of selectors could work but would be bad if it dramatically increases the number of informers in kube-proxy/number of connections to api server from each node.

This could be a compelling reason to add some kind of global label to slices that didn't apply to a zone or region. That would allow us to express all of this with a single selector but unfortunately would result in all EndpointSlices being updated whenever this upgrade happened. This would also require any EndpointSlice producers to be updated.

(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone AND no-region)
(no-node AND my-zone AND my-region)
(my-node AND no-zone AND no-region)
(my-node AND no-zone AND my-region)
(my-node AND my-zone AND no-region)
(my-node AND my-zone AND my-region)

Surely if a Service targets my-node then it must also target either my-zone or no-zone, and likewise my-region or no-region?

And I'd assume that zones were likewise inside regions? So then it's just

(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone)
(my-node)

Now that I've filed #2031, I think this discussion will actually make more sense there. I ended up taking a similar approach to what @danwinship proposed above with that proposal with the exception of no support for node labels/delivery.

keps/sig-network/2004-topology-aware-routing/README.md

thockin

The API and names are my main sticking point. I am not sure either way. Will think more, would love to hear arguments

keps/sig-network/2004-topology-aware-routing/README.md

thockin · 2020-10-02T23:41:51Z

keps/sig-network/2004-topology-aware-routing/README.md

+traffic in zone.
+- Provide different approaches to users who have a strong desire over the
+routing policy:
+  - Balanced: Use the original approach to ensure traffic is distributed across


I'm still not sure "Balanced" vs "Spread" vs "Random"

thockin · 2020-10-02T23:45:42Z

keps/sig-network/2004-topology-aware-routing/README.md

+
+We propose a new Service annotation: `service.kubernetes.io/same-zone`
+with three options:
+- `Balanced`: This represents the existing approach that is already used, no


Comment got lost: I am afraid "balanced" implies more than we actually guarantee. Would "Random" be a better choice here? We don't guarantee to balance traffic, just spread it.

What about "Unaware"?

Spray-n-Pray ? :D

To go a slightly different direction here, let's say we had a different field/annotation here that encompassed both zone and region and supported values like PreferZone.

Maybe that annotation is something like endpointslice.kubernetes.io/slicing-policy and maybe the supported values are:

One of None, Random, Shared, or Balanced

PreferZone

RequireZone

I don't think same-zone: Random or same-zone: Balanced make a lot of sense in comparison. This is a bit of a larger change though and changes the scope of this KEP + config from solving topology aware routing to defining a controller subsetting approach that helps implement it.

thockin · 2020-10-02T23:50:37Z

keps/sig-network/2004-topology-aware-routing/README.md

+number of endpoints. If a zone does not have a sufficient number of endpoints,
+traffic will be routed to other zones.
+
+This same pattern could be used for topology aware routing at node and region


I am leaning pretty hard away from multiple knobs here. I think something more abstract makes more sense. Especially if you think about multi-cluster Services here. E.g. If the MCS thingy already does some form of subsetting, what does this policy mean? @JeremyOT

keps/sig-network/2004-topology-aware-routing/README.md

squeed · 2020-10-05T16:05:42Z

keps/sig-network/2004-topology-aware-routing/README.md

+| Rolling Update Test | Integration Test |1. Use integration framework to mock pods <br> 2. Set annotation to `Prefer` <br> 3. Conduct an aggressive rolling update policy for endpoints update <br> 4. Set endpoints to test corner cases | In the desired state, EndpointSlices are distributed as expected | If capable, monitor the churn of EndpointSlices during the rolling update |
+| Traffic Reliability Test | E2E Test |1. Set annotation to `Require` and `Prefer` <br> 2. Set endpoints above `starting threshold + padding` <br> 3. Reuse the current traffic reachability test | Endpoints are reachable from different zones as expected | Step 2 is only needed for `Prefer` |
+
+### Observability


It would be good for administrators to know which services are "blowing out" of their zone. So, while it breaks tradition, I can't think of any way to do this besides a per-service metric.

This could also be an event, but that seems un-ergonomic.

How would this metric be determined? If endpoints are being contributed from one zone to another for a Service?

That's what I was thinking, yes. Otherwise you have no way of knowing which services need to be "re-balanced". (I'm open to other suggestions - I just want to write an alert that says "service XXX may be costing you money")

you can label the metrics with the zone name, so you can build a graph that shows the metrics per zone.
I did something similar to the cidr_sets kubernetes/kubernetes#90288 , so you can have metrics per each cidr_set configured

I agree this would be a valuable metric to have, but I think per-Service metrics may end up becoming a scalability issue. If we made these per-zone we'd potentially end up with data evening out across all Services in that zone even if there were significant discrepancies within each Service. Any ideas @wojtek-t?

Would it be sufficient to note the number of imbalanced Services and then decorate the Service object itself (or EPSlices?) with an annotation or Condition to say "is imbalanced" ?

squeed · 2020-10-05T16:23:19Z

So, I know that @thockin was considering it, do we want to formalize limiting the scope of topologyKeys? If this moves forward, it leaves us (as proposed) with only one real topology level (save node-local). Or do you intend to just take the cartesian product of keys and use that as the load-balancing input?

Edit: nevermind, I'd missed https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/1659-standard-topology-labels

mzohreva · 2020-10-06T01:14:40Z

It would be useful to support same-node as well. A typical use case for same-node: Prefer is a database service. Application Pods would prefer to use the co-located database endpoint if possible.

Also, it would be useful to provide an option for advanced users to have more control over when traffic spills over to other regions. The current proposal makes it hard to reason about when spill over happens without knowing the details of the heuristics used.

originally posted in #2004 (comment)

aojea · 2020-10-06T11:30:19Z

keps/sig-network/2004-topology-aware-routing/README.md

+  and `Prefer` options, there is a chance that endpoints in that zone will be
+  overloaded while endpoints in other zones receive little to no traffic.
+  Without some sort of feedback (out of scope) this will not self-rectify.
+


@robscott I have this doubt, can a pod in ZoneB access a Service with service.kubernetes.io/same-zone: Require that only has endpoints in ZoneA?
I mean, the traffic load "service-endpoint" will be withing the zone, but at least in kube-proxy the ClusterIP -> Endpoint translation is done in the same node used by the pod client. how this traffic will be redirected to the other zone?

It would not be able to. That approach relies on the user distributing workloads across all zones they intend to send traffic to. This could work well with some cross-sig work that would make this kind of scheduling easier and potentially update autoscaling to also take topology into account. I don't imagine this being a commonly used option at the beginning but I think it's still valuable for power users. I'll work on a better explanation for this in the KEP.

I've added some better clarification around this to the KEP now.

robscott · 2020-10-06T15:56:05Z

@wojtek-t I've added you as an approver to this PR based on #2031 (comment). I've also added some notes about how this would interact with HPAs in the "Risks and Mitigations" section.

/assign @wojtek-t

thockin · 2020-10-06T22:57:51Z

@mzohreva I think this is a poor fit for same-node because it assumes something about the impl. It's tricky, really.

If you look at this as JUST the API, then it makes sense. But to implement the API, we'd need to split the logic between slice controller and kube-proxy. Why? Imagine a 1K node cluster. Setting per-zone makes sense to implement via slice-subsetting. Setting per-node does NOT make sense to implement via slicing - you'd have 1000 slices. It makes more sense for that to be filtered in kube-proxy (perhaps sliced by zone first).

In order to make that work, you have to assume some coordination between components, both of which COULD be replaceable.

@andrewsykim has a different KEP which defines internalTrafficPolicy (parallel to external...) to cover this.

This goes to a question I asked somewhere - is this an API or is it a hint to the slice controller? If I had a custom EPslice controller (who knows why) and it did not support zone subsetting, is that controller "broken" or valid?

Another way to look at this - is the API specific to subslicing or is it more general than that?

See my up-coming review's top-level comment for more thoughts...

thockin

Here's what I am thinking. Less is more. Let's start with the smallest scope we can justify and expand as NEEDED.

IOW, treat this like a hint to the EPSlice controller. The only values It supports are "" (meaning the current behavior) and "Auto" (meaning "try to be smarter"). Auto is what you have defined here as "prefer". Today we define it in terms of Zone, but it really means "auto". If we decided to support multi-region, it would consider region. If we added sub-zone it might consider that, too.

We can leave "require same zone" out for now. I don't feel comfortable enough with the use-cases to justify the sharp edges it exposes. We can always come back and add it if those become clearer.

Likewise, if this is really to become a contract (API) rather than a hint, we'll need to think about what guarantees we can make (Rob raised the issue of the EPSlice mirroring controller which doesn't even have topology information).

I'd rather get something in Alpha so we can start the clock on topologyKeys. I don't mind this sitting in alpha for a few releases if it has to.

Other notes:

This gate should be mutually exclusive with the older ServiceTopology (aka topologyKeys)
We need to define what happens if we find pods that do not have topology information - do they get left out entirely? Randomly assigned?
We can come back and add more hints (like the coefficients of the heuristic or even multiple heuristic options) IF and ONLY IF needed.

Thoughts?

robscott · 2020-10-06T23:30:08Z

@thockin I like the simplicity of that approach. I've added one more commit to this PR that updates the KEP to take those changes into account.

thockin

Just small nits - race the mergebot!

/approve
/lgtm

thockin · 2020-10-06T23:42:18Z

keps/sig-network/2004-topology-aware-subsetting/README.md

+
+#### Interoperability
+
+This annotation may not be set on a Service if the `topologyKeys` field is set.


Do we have any precedent for failing to run if two incompatible feature gates are specified?

I don't think so, I've made this language a bit more realistic.

thockin · 2020-10-06T23:43:05Z

keps/sig-network/2004-topology-aware-subsetting/README.md

+
+#### Interoperability
+
+This annotation may not be set on a Service if the `topologyKeys` field is set.


Detail what happens. E.g. If this is set on a service which also has topologyKeys set, topologyKeys will take priority.

Also, can I task you with deprecating topologyKeys (comments and a release note) ? :)

Updated to clarify that topologyKeys should be given precedence. Will take care of deprecating topologyKeys.

thockin · 2020-10-06T23:44:03Z

keps/sig-network/2004-topology-aware-subsetting/README.md

+#### Feature Gate
+
+This functionality will be guarded by the `TopologyAwareSubsetting` feature
+gate. This gate may not be enabled if the `ServiceTopology` gate is enabled.


...or else?

Yeah, good point, changed this to should not.

k8s-ci-robot · 2020-10-06T23:49:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ckyiricky, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-network/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thockin · 2020-10-06T23:55:26Z

/lgtm

andrewsykim · 2020-10-07T00:13:08Z

@andrewsykim has a different KEP which defines internalTrafficPolicy (parallel to external...) to cover this.

I had initially proposed this as an addition to the (old) ServiceTopology KEP but at this point it probably deserves it's own KEP. If we feel that node-local is definitely not going to be addressed with subsetting I can start work on a separate KEP to tackle this more directly.

* KEP 2004: Evolving Topology Aware Routing for Services * Updates related to PR feedback, some markdown cleanup * Simplifying the KEP to focus on subsetting Co-authored-by: Rob Scott <[email protected]>

k8s-ci-robot assigned thockin Sep 22, 2020

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 22, 2020

k8s-ci-robot requested review from andrewsykim, bowei, freehan and thockin September 22, 2020 23:30

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 22, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 22, 2020

k8s-ci-robot assigned robscott Sep 22, 2020

aojea reviewed Sep 23, 2020

View reviewed changes

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved

aojea reviewed Sep 23, 2020

View reviewed changes

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved

aojea reviewed Sep 23, 2020

View reviewed changes

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved

aojea reviewed Sep 23, 2020

View reviewed changes

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved

aojea reviewed Sep 23, 2020

View reviewed changes

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved

kikisdeliveryservice mentioned this pull request Sep 23, 2020

Topology Aware Subsetting #2004

Closed

kikisdeliveryservice reviewed Sep 23, 2020

View reviewed changes

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved

ckyiricky force-pushed the topology-aware-routing branch 3 times, most recently from 79180f9 to c36b4a9 Compare September 23, 2020 23:36

KEP 2004: Evolving Topology Aware Routing for Services

aa87470

ckyiricky force-pushed the topology-aware-routing branch from c36b4a9 to aa87470 Compare September 24, 2020 21:28

thockin reviewed Oct 2, 2020

View reviewed changes

squeed reviewed Oct 5, 2020

View reviewed changes

robscott force-pushed the topology-aware-routing branch from 4e9fdf8 to e829f96 Compare October 6, 2020 00:27

aojea reviewed Oct 6, 2020

View reviewed changes

Updates related to PR feedback, some markdown cleanup

768906b

robscott force-pushed the topology-aware-routing branch from e829f96 to 768906b Compare October 6, 2020 15:54

k8s-ci-robot assigned wojtek-t Oct 6, 2020

thockin reviewed Oct 6, 2020

View reviewed changes

robscott force-pushed the topology-aware-routing branch from 777b47b to af577cf Compare October 6, 2020 23:29

thockin added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 6, 2020

thockin reviewed Oct 6, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2020

Simplifying the KEP to focus on subsetting

0df0684

robscott force-pushed the topology-aware-routing branch from af577cf to 0df0684 Compare October 6, 2020 23:52

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2020

k8s-ci-robot merged commit 7ffe8bb into kubernetes:master Oct 6, 2020

k8s-ci-robot added this to the v1.20 milestone Oct 6, 2020

This was referenced Oct 7, 2020

KEP-2086: Service Internal Traffic Policy #1944

Merged

Service Internal Traffic Policy #2086

Closed


		This proposal builds off of the [EndpointSlice API](https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/).

		We propose a new Service annotation: `endpointslice.kubernetes.io/same-zone`


		#### Interoperability

		This annotation may not be set on a Service if the `topologyKeys` field is set.

KEP 2004: Evolving Topology Aware Routing for Services #2005

KEP 2004: Evolving Topology Aware Routing for Services #2005

Conversation

ckyiricky commented Sep 22, 2020

k8s-ci-robot commented Sep 22, 2020

robscott commented Sep 22, 2020

robscott commented Sep 22, 2020

aojea Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin Oct 5, 2020 • edited Loading

Choose a reason for hiding this comment

robscott Oct 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squeed commented Oct 5, 2020 • edited Loading

mzohreva commented Oct 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robscott commented Oct 6, 2020

thockin commented Oct 6, 2020 • edited Loading

thockin left a comment

Choose a reason for hiding this comment

robscott commented Oct 6, 2020

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 6, 2020

thockin commented Oct 6, 2020

andrewsykim commented Oct 7, 2020

aojea Sep 23, 2020 •

edited

Loading

aojea Sep 23, 2020 •

edited

Loading

thockin Oct 5, 2020 •

edited

Loading

robscott Oct 5, 2020 •

edited

Loading

squeed commented Oct 5, 2020 •

edited

Loading

thockin commented Oct 6, 2020 •

edited

Loading