Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 2004: Evolving Topology Aware Routing for Services #2005

Merged
merged 3 commits into from
Oct 6, 2020

Conversation

ckyiricky
Copy link
Contributor

Kubernetes clusters are increasingly deployed in multi-zone environments but unfortunately network routing has not caught up with that. To mitigate this issue, @robscott and I added an automatic topology aware routing mechanism with a new Service annotation. This new annotation will provide three options of different routing approaches. Currently our proposal will only focus on topology aware routing at zone level.

This is closely related to both #536 and #1944, we may want to merge these into a single KEP.

Enhancement Issue: #2004
/sig network
/cc @andrewsykim @bowei @freehan @thockin
/assign @thockin

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 22, 2020
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 22, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @ckyiricky. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 22, 2020
@robscott
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 22, 2020
@robscott
Copy link
Member

/assign


This proposal builds off of the [EndpointSlice API](https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/).

We propose a new Service annotation: `endpointslice.kubernetes.io/same-zone`
Copy link
Member

@aojea aojea Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh naming party 😄 , my vote for endpointslice.kubernetes.io/zone-affinity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, this will be tricky name to get right. I like the idea of zone-affinity, what would the values be here? I'm not sure Prefer or Require would make as much sense with this name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about hard , soft and none ?

this aligns with node affinity terminoloy https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of our goals with this kind of naming pattern is that we could use the same approach for node and region (same-node, same-region). Unfortunately I think that would be difficult/confusing for an affinity based name as node-affinity already has a different meaning in k8s. Still very interested in naming suggestions here, just hesitant to use something that might not work as well for node or region.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really "about" the service or is it effectively a parameter pass-thru to the controller(s)? It seems like the later. Maybe it is "endpoint selection policy" ? Or something.

Do we want to position this as "which endpoints we choose for a zone" or "which endpoints we will choose for a given client"? What makes most sense to people who will USE this rather than us who know how it works?

E.g. if I am an app-owner, deploying my service, I think I want to reason about "given an arbitrary client, which endpoints will be chosen". "Random" vs "PreferSameZone" vs "RequireSameZone" vs ...

Let's think through possible evolution. At some point we talked about exposing the different heuristics here, but it seems like maybe we don't need to (or maybe that is a different annotation).

We've already said that same-node is probably "different".

What we we name algorithms that consider load (e.g. suppose we collection # of connections and balance that better)?

How else might this evolve? How does that inform how we will talk about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding the appropriate name here is definitely challenging. We actually explored options similar to what you're suggesting here. My first suggestion for an annotation was endpointslice.kubernetes.io/slicing-policy which seems similar to the concept of an endpoint selection policy. We ended up moving away from that since it seemed too closely tied to implementation and would not translate well for node-local config.

We also explored algorithm/policy names and created a doc with possible combinations. In this case it felt like there were simply too many potential combinations.

As I've thought about this KEP more, I think the title/scope may actually be a bit too broad here. This is fundamentally about trying to find smarter ways to subset EndpointSlices so that:

  • Consumers only need to process a fraction of endpoints (significant scalability win).
  • Consumers only get the endpoints closest to them.

This does not feel like it is the final solution for topology aware routing, just a step in the right direction. This will still allow more advanced logic to be built into consumers that could potentially make adjustments based on feedback/metrics. If we take that kind of a more limited approach to this KEP, maybe a narrower name/annotation would also make sense?

Comment on lines 194 to 203
Kube-Proxy will be updated to support EndpointSlice subsetting. Each instance
will watch for EndpointSlices that have a label indicating they should be
consumed by this zone. For more information, refer to the EndpointSlice
Copy link
Member

@aojea aojea Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we expand on this? how will kube-proxy know when it has to install rules to other zones or no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I owe another KEP here. The general idea would be adding a label to EndpointSlices like endpointslice.kubernetes.io/for-zone: zone-a. Then kube-proxy could be configured to only subscribe to EndpointSlices with that label set to the same zone or not set at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clarify. Selectors are pretty dumb, and I don't think there's a way to express OR? So will we install multiple selectors? E.g.

one for "endpointslice.kubernetes.io/for-zone IN zone-a" and another for "endpointslice.kubernetes.io/for-zone NOTEXIST" ?

Is the plan to add a bunch of these conditions up front so that we don't have to keep editing the rules? E.g.

(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone AND no-region)
(no-node AND my-zone AND my-region)
(my-node AND no-zone AND no-region)
(my-node AND no-zone AND my-region)
(my-node AND my-zone AND no-region)
(my-node AND my-zone AND my-region)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is quite messy. I'd missed that there's no way to express OR with selectors. This combination of selectors could work but would be bad if it dramatically increases the number of informers in kube-proxy/number of connections to api server from each node.

This could be a compelling reason to add some kind of global label to slices that didn't apply to a zone or region. That would allow us to express all of this with a single selector but unfortunately would result in all EndpointSlices being updated whenever this upgrade happened. This would also require any EndpointSlice producers to be updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone AND no-region)
(no-node AND my-zone AND my-region)
(my-node AND no-zone AND no-region)
(my-node AND no-zone AND my-region)
(my-node AND my-zone AND no-region)
(my-node AND my-zone AND my-region)

Surely if a Service targets my-node then it must also target either my-zone or no-zone, and likewise my-region or no-region?

And I'd assume that zones were likewise inside regions? So then it's just

(no-node AND no-zone AND no-region)
(no-node AND no-zone AND my-region)
(no-node AND my-zone)
(my-node)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've filed #2031, I think this discussion will actually make more sense there. I ended up taking a similar approach to what @danwinship proposed above with that proposal with the exception of no support for node labels/delivery.

@ckyiricky ckyiricky force-pushed the topology-aware-routing branch 3 times, most recently from 79180f9 to c36b4a9 Compare September 23, 2020 23:36
@ckyiricky ckyiricky force-pushed the topology-aware-routing branch from c36b4a9 to aa87470 Compare September 24, 2020 21:28
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API and names are my main sticking point. I am not sure either way. Will think more, would love to hear arguments

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved
traffic in zone.
- Provide different approaches to users who have a strong desire over the
routing policy:
- Balanced: Use the original approach to ensure traffic is distributed across
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure "Balanced" vs "Spread" vs "Random"


We propose a new Service annotation: `service.kubernetes.io/same-zone`
with three options:
- `Balanced`: This represents the existing approach that is already used, no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment got lost: I am afraid "balanced" implies more than we actually guarantee. Would "Random" be a better choice here? We don't guarantee to balance traffic, just spread it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "Unaware"?

Copy link
Member

@thockin thockin Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spray-n-Pray ? :D

Copy link
Member

@robscott robscott Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To go a slightly different direction here, let's say we had a different field/annotation here that encompassed both zone and region and supported values like PreferZone.

Maybe that annotation is something like endpointslice.kubernetes.io/slicing-policy and maybe the supported values are:

  • One of None, Random, Shared, or Balanced
  • PreferZone
  • RequireZone

I don't think same-zone: Random or same-zone: Balanced make a lot of sense in comparison. This is a bit of a larger change though and changes the scope of this KEP + config from solving topology aware routing to defining a controller subsetting approach that helps implement it.

number of endpoints. If a zone does not have a sufficient number of endpoints,
traffic will be routed to other zones.

This same pattern could be used for topology aware routing at node and region
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning pretty hard away from multiple knobs here. I think something more abstract makes more sense. Especially if you think about multi-cluster Services here. E.g. If the MCS thingy already does some form of subsetting, what does this policy mean? @JeremyOT

keps/sig-network/2004-topology-aware-routing/README.md Outdated Show resolved Hide resolved
| Rolling Update Test | Integration Test |1. Use integration framework to mock pods <br> 2. Set annotation to `Prefer` <br> 3. Conduct an aggressive rolling update policy for endpoints update <br> 4. Set endpoints to test corner cases | In the desired state, EndpointSlices are distributed as expected | If capable, monitor the churn of EndpointSlices during the rolling update |
| Traffic Reliability Test | E2E Test |1. Set annotation to `Require` and `Prefer` <br> 2. Set endpoints above `starting threshold + padding` <br> 3. Reuse the current traffic reachability test | Endpoints are reachable from different zones as expected | Step 2 is only needed for `Prefer` |

### Observability
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good for administrators to know which services are "blowing out" of their zone. So, while it breaks tradition, I can't think of any way to do this besides a per-service metric.

This could also be an event, but that seems un-ergonomic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this metric be determined? If endpoints are being contributed from one zone to another for a Service?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was thinking, yes. Otherwise you have no way of knowing which services need to be "re-balanced". (I'm open to other suggestions - I just want to write an alert that says "service XXX may be costing you money")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can label the metrics with the zone name, so you can build a graph that shows the metrics per zone.
I did something similar to the cidr_sets kubernetes/kubernetes#90288 , so you can have metrics per each cidr_set configured

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be a valuable metric to have, but I think per-Service metrics may end up becoming a scalability issue. If we made these per-zone we'd potentially end up with data evening out across all Services in that zone even if there were significant discrepancies within each Service. Any ideas @wojtek-t?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient to note the number of imbalanced Services and then decorate the Service object itself (or EPSlices?) with an annotation or Condition to say "is imbalanced" ?

@squeed
Copy link

squeed commented Oct 5, 2020

So, I know that @thockin was considering it, do we want to formalize limiting the scope of topologyKeys? If this moves forward, it leaves us (as proposed) with only one real topology level (save node-local). Or do you intend to just take the cartesian product of keys and use that as the load-balancing input?

Edit: nevermind, I'd missed https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/1659-standard-topology-labels

@robscott robscott force-pushed the topology-aware-routing branch from 4e9fdf8 to e829f96 Compare October 6, 2020 00:27
@mzohreva
Copy link

mzohreva commented Oct 6, 2020

It would be useful to support same-node as well. A typical use case for same-node: Prefer is a database service. Application Pods would prefer to use the co-located database endpoint if possible.

Also, it would be useful to provide an option for advanced users to have more control over when traffic spills over to other regions. The current proposal makes it hard to reason about when spill over happens without knowing the details of the heuristics used.

originally posted in #2004 (comment)

and `Prefer` options, there is a chance that endpoints in that zone will be
overloaded while endpoints in other zones receive little to no traffic.
Without some sort of feedback (out of scope) this will not self-rectify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robscott I have this doubt, can a pod in ZoneB access a Service with service.kubernetes.io/same-zone: Require that only has endpoints in ZoneA?
I mean, the traffic load "service-endpoint" will be withing the zone, but at least in kube-proxy the ClusterIP -> Endpoint translation is done in the same node used by the pod client. how this traffic will be redirected to the other zone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not be able to. That approach relies on the user distributing workloads across all zones they intend to send traffic to. This could work well with some cross-sig work that would make this kind of scheduling easier and potentially update autoscaling to also take topology into account. I don't imagine this being a commonly used option at the beginning but I think it's still valuable for power users. I'll work on a better explanation for this in the KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some better clarification around this to the KEP now.

@robscott robscott force-pushed the topology-aware-routing branch from e829f96 to 768906b Compare October 6, 2020 15:54
@robscott
Copy link
Member

robscott commented Oct 6, 2020

@wojtek-t I've added you as an approver to this PR based on #2031 (comment). I've also added some notes about how this would interact with HPAs in the "Risks and Mitigations" section.

/assign @wojtek-t

@thockin
Copy link
Member

thockin commented Oct 6, 2020

@mzohreva I think this is a poor fit for same-node because it assumes something about the impl. It's tricky, really.

If you look at this as JUST the API, then it makes sense. But to implement the API, we'd need to split the logic between slice controller and kube-proxy. Why? Imagine a 1K node cluster. Setting per-zone makes sense to implement via slice-subsetting. Setting per-node does NOT make sense to implement via slicing - you'd have 1000 slices. It makes more sense for that to be filtered in kube-proxy (perhaps sliced by zone first).

In order to make that work, you have to assume some coordination between components, both of which COULD be replaceable.

@andrewsykim has a different KEP which defines internalTrafficPolicy (parallel to external...) to cover this.

This goes to a question I asked somewhere - is this an API or is it a hint to the slice controller? If I had a custom EPslice controller (who knows why) and it did not support zone subsetting, is that controller "broken" or valid?

Another way to look at this - is the API specific to subslicing or is it more general than that?

See my up-coming review's top-level comment for more thoughts...

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I am thinking. Less is more. Let's start with the smallest scope we can justify and expand as NEEDED.

IOW, treat this like a hint to the EPSlice controller. The only values It supports are "" (meaning the current behavior) and "Auto" (meaning "try to be smarter"). Auto is what you have defined here as "prefer". Today we define it in terms of Zone, but it really means "auto". If we decided to support multi-region, it would consider region. If we added sub-zone it might consider that, too.

We can leave "require same zone" out for now. I don't feel comfortable enough with the use-cases to justify the sharp edges it exposes. We can always come back and add it if those become clearer.

Likewise, if this is really to become a contract (API) rather than a hint, we'll need to think about what guarantees we can make (Rob raised the issue of the EPSlice mirroring controller which doesn't even have topology information).

I'd rather get something in Alpha so we can start the clock on topologyKeys. I don't mind this sitting in alpha for a few releases if it has to.

Other notes:

  • This gate should be mutually exclusive with the older ServiceTopology (aka topologyKeys)
  • We need to define what happens if we find pods that do not have topology information - do they get left out entirely? Randomly assigned?
  • We can come back and add more hints (like the coefficients of the heuristic or even multiple heuristic options) IF and ONLY IF needed.

Thoughts?

@robscott robscott force-pushed the topology-aware-routing branch from 777b47b to af577cf Compare October 6, 2020 23:29
@robscott
Copy link
Member

robscott commented Oct 6, 2020

@thockin I like the simplicity of that approach. I've added one more commit to this PR that updates the KEP to take those changes into account.

@thockin thockin added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 6, 2020
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small nits - race the mergebot!

/approve
/lgtm


#### Interoperability

This annotation may not be set on a Service if the `topologyKeys` field is set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any precedent for failing to run if two incompatible feature gates are specified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I've made this language a bit more realistic.


#### Interoperability

This annotation may not be set on a Service if the `topologyKeys` field is set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detail what happens. E.g. If this is set on a service which also has topologyKeys set, topologyKeys will take priority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can I task you with deprecating topologyKeys (comments and a release note) ? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to clarify that topologyKeys should be given precedence. Will take care of deprecating topologyKeys.

#### Feature Gate

This functionality will be guarded by the `TopologyAwareSubsetting` feature
gate. This gate may not be enabled if the `ServiceTopology` gate is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, changed this to should not.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ckyiricky, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2020
@robscott robscott force-pushed the topology-aware-routing branch from af577cf to 0df0684 Compare October 6, 2020 23:52
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2020
@thockin
Copy link
Member

thockin commented Oct 6, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7ffe8bb into kubernetes:master Oct 6, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 6, 2020
@andrewsykim
Copy link
Member

@andrewsykim has a different KEP which defines internalTrafficPolicy (parallel to external...) to cover this.

I had initially proposed this as an addition to the (old) ServiceTopology KEP but at this point it probably deserves it's own KEP. If we feel that node-local is definitely not going to be addressed with subsetting I can start work on a separate KEP to tackle this more directly.

SergeyKanzhelev pushed a commit to SergeyKanzhelev/enhancements that referenced this pull request Jan 8, 2021
* KEP 2004: Evolving Topology Aware Routing for Services

* Updates related to PR feedback, some markdown cleanup

* Simplifying the KEP to focus on subsetting

Co-authored-by: Rob Scott <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.