Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3015: PreferLocal traffic policy #3016

Closed
wants to merge 1 commit into from

Conversation

danwinship
Copy link
Contributor

  • One-line PR description:
    Addition of PreferLocal (ie, Local with fallback) to externalTrafficPolicy and internalTrafficPolicy

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 21, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: danwinship
To complete the pull request process, please assign dcbw, wojtek-t after the PR has been reviewed.
You can assign the PR to them by writing /assign @dcbw @wojtek-t in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Oct 21, 2021
@robscott
Copy link
Member

/cc

@k8s-ci-robot k8s-ci-robot requested a review from robscott October 21, 2021 21:03
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth exploring how this intersects with topology, I think. Can we guarantee that all the endpoints in zone A are hinted "for zone A" ? If not, could we end up with a node that holds local endpoints which are dedicated to a different zone (because of imbalance) ?

@robscott @swetharepakula

load balancers do not balance between nodes themselves. (eg, perhaps
you are using a `NodePort` service, but only advertising a single
node IP for clients to connect to). Does not preserve source IP, but
guarantees the traffic is always delivered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/guarantees.../will use any available endpoint to avoid delivery failure/

"guarantee" is such a strong word

- `PreferLocal`: Prioritizes efficiency and reliability. When
possible, traffic will be delivered to a local endpoint, but when no
local endpoints exist, it will be delivered to a remote endpoint
instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If the upstream routing over-uses a node, endpoints on that node can become overloaded."

@danwinship
Copy link
Contributor Author

It's worth exploring how this intersects with topology, I think.

Well, I think PreferLocal should intersect in exactly the same way as Local does...

Can we guarantee that all the endpoints in zone A are hinted "for zone A" ? If not, could we end up with a node that holds local endpoints which are dedicated to a different zone (because of imbalance) ?

The topology KEP says "Both ExternalTrafficPolicy and InternalTrafficPolicy will be given precedence over topology aware routing."

I guess "precedence" is kind of vague; the current implementation is that traffic policy overrides topology, so if Node 1 in Zone A has two endpoints, and one is topology-hinted for Zone A and the other is hinted for Zone B, then Local traffic can go to either local endpoint, ignoring the topology hints.

The other interpretation would be "endpoints are filtered for traffic policy first, and then for topology second" (although there isn't actually "precedence" in that case since you'd get the same answer regardless of which order you filtered them in). But I don't think we want that, because the rule for assigning some endpoints to other zones is going to thwart the administrator's attempts to ensure that every node has a local endpoint. (eg, if Node 2 in Zone A has one endpoint which is hinted as being for Zone B, then if we apply both Local traffic policy and topology, then pods on Node 2 would not be able to reach the service, even though the administrator had made sure that there was an endpoint of the service on that node.)

@robscott
Copy link
Member

There's a lot of overlap in kubernetes/kubernetes#100313 with the discussion here. I think fundamentally we should consider traffic policy as having the highest precedence for internal or external traffic. When set to "Local", there is no fallback and therefore no ambiguity on interaction with other config. On the other hand, "Cluster" is really falling back to the next level of configuration. That means either topology filtered endpoints or all endpoints.

I think "PreferLocal" is really just trying to merge "Local" with a "Cluster" fallback. In that case, I think the behavior should line up with that. That means the following approximate logic:

  1. If trafficPolicy == "PreferLocal" && len(localEndpoints) > 0, use local endpoints exclusively
  2. If topology == "Auto" && len(hintedEndpoints) > 0, use hinted endpoints exclusively
  3. Use all endpoints

As mentioned here and in kubernetes/kubernetes#100313, this does mean that we either need to maintain separate filtered sets of internal and external endpoints in the shared proxy code, or we need to send the full list of endpoints to proxiers and then use shared code to filter internal and external endpoints separately. I think @danwinship is working on that logic.

@danwinship
Copy link
Contributor Author

danwinship commented Jan 15, 2022

Updated:

  • don't say "guarantee"
  • totally rewrote the bullet points at the end of the "External Traffic Policy" section to better describe the benefits and tradeoffs (also without saying "guarantee")
  • added a section on how it interacts with topology
  • [EDIT] updated versions in kep.yaml to be alpha in 1.24

Comment on lines 259 to 261
that happens, there's a good chance that the topology code will _also_
hit its fallback case (due to there being at least one zone which is
missing an endpoint), and so topology won't even get used anyway. At
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is true. For example, there may be a Service with 3 endpoints spread over 6 nodes. That would mean "Local" would fallback to "Cluster" ~1/2 of the time, but topology would consistently be helpful. Having a "PreferLocal" mode may result in a larger set of users than the traditional "Local" use cases, including some cases where users are not intending to deploy workloads on every possible node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be a Service with 3 endpoints spread over 6 nodes. That would mean "Local" would fallback to "Cluster" ~1/2 of the time

(I assume you meant PreferLocal would fall back to Cluster.)

including some cases where users are not intending to deploy workloads on every possible node.

Hm... I was not considering that as a use case, although it does make sense. It's the same behavior, you're just thinking of it as "Cluster with optimization" rather than "Local with fallback". I should add a User Story for that.

At any rate, the comment here about PreferLocal and Topology failing at the same time was purely "informational" and doesn't affect the actual behavior; the proposed rule ("use a local endpoint if you can, otherwise do exactly what you would have done if the policy was Cluster") will give the right answer whether or not Topology is in effect.

@astoycos
Copy link
Contributor

/cc

@danwinship
Copy link
Contributor Author

Updated:

  • added new User Story "PreferLocal as More-Efficient Cluster"
  • removed incorrect parenthetical comment about PreferLocal vs Topology
  • updated the "SLO" table at the end for the new use case

ignore topology, just like `Local` policy does.

When delivering packets remotely, it makes the most sense to have
`PreferLocal` policy make use of topology. (This is explicitly wanted
Copy link
Member

@andrewsykim andrewsykim Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons we scrapped PreferLocal in the other KEP is because the assumption was you can achieve the same behavior by setting internalTrafficPolicy: Cluster and topology-aware routing that strongly prefers node-local endpoints. What is the semantic difference between that and using PreferLocal?

(sorry if we already discussed this somewhere else, catching up on some context I missed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had node-level topology, then we probably would not need PreferLocal... but do we even have a plan for node-level topology yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably if we had node-level topology, then we would not need internalTrafficPolicy: Local either, since (unlike externalTrafficPolicy: Local) it provides no advantages other than locality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had node-level topology, then we probably would not need PreferLocal... but do we even have a plan for node-level topology yet?

I guess this is a question for @robscott -- is "requires node" a thing? My current understanding so far is that topology-aware routing is more designed to handle traffic routing automatically so it should always fall back to the next preferred topology instead of dropping. This was one of the reasons we added internalTrafficPolicy: Local because there are cases where you actually want "local or drop", which fell outside the scope of topology-aware routing.

Fwiw I do think there's a lot of value in "PreferLocal" still -- it seems particularly more useful than something like "PreferZone". Routing locally has more specific use-cases as opposed to "route to closest zone" which is almost-always around cost-savings and reducing latency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll attach my main feedback here.

Let's consider: Why do people use [EI]TP ?

ETP: I know of 2 reasons. 1) "I need to preserve source IP". 2) "Efficiency".

ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".

For both classes of traffic, "Local" is either a hard requirement (i.e. "prefer" is a non-starter) or just an optimization over "Cluster".

I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can. The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart". This then FORCES other implementations, which might be capable of more smartness, to play dumb.

For example, it's really not hard to imagine an implementation that sends the number of active flows per-endpoint to an aggregator, and then have that aggregator give the node-agents hints about which endpoints are overloaded. It doesn't even need to be super timely to have some impact, I bet. But the semantics of "PreferLocal" disallow that.

If we think this is the optimizatrion with the biggest bang for the buck, let's talk about how we can do something like if num_endpoints > (num_nodes/2) && have_local_eps() { use_local_eps() }. I just made that up - obviously we can tune the heuristic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PreferLocal is a loadbalancing/topology algorithm problem and not a traffic policy problem, hence this is not the right place to fix it

@aojea That's a reasonable distillation :)

@terinjokes thanks for the additional input. "Local" gets better with the terminating endpoints work that is in flight, but still there's a moment when it has to die for a successor to start. If we really want to make node-level "sidecars" a thing that is reliable, we need a way to "hand off".

But such a mechanism does (kind of) exist - the updateStrategy of a DaemonSet allows maxSurge to be positive, so we can start a new replica before killing an old one. This is subject to resource limitations on heavily packed nodes. If you think that is you, you can choose to run 2 DaemonSets (e.g. "dns-1" and "dns-a") so you pre-allocate space. Maybe DaemonSet actually needs "replicas" field to run N copies per target node...

Topology, as implemented today, is not really considering node. It could, and yes my proposed heuristic is far from perfect. But not considering node was sort of intentional. In my experience, such optimizations quietly become functional requirements. Apps (or SLOs) start to depend on the optimization, and when the optimization isn't available for some reason, apps break.

So my intuitive lean is to either NOT optimize that way, or else to make it really clear that it is an optimization, not a policy. If you depend on the optimization, then "prefer" is wrong, and we should focus on making "Local" work for you. If you don't depend on it, then you should let the system try to be smart, and not tell it how to do its job.

That said I really do understand how "PreferLocal" is attractive, especially while the system is NOT as smart as you want it to be. On this topic, I have cited this before: https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine . As you said Trying to be "smart" can potentially hotspot specific instances. It's an overly-sharp tool that SEEMS like a good idea, but probably isn't.

In some sense what you are asking for is not an abstract policy, but an implementation-centric toggle that says "I know what I am doing and I know how the implementation works". If we put that into the API we get to support it FOREVER. We reduce the degrees of freedom implementations have for doing good things, and we lock them into sub-optimal semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".

internalTrafficPolicy: Local doesn't solve the "efficiency" use case though, because it implies "I would rather lose packets than let this connection go to another node", which is never what you want for the "efficiency" use case.

I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can.

But optimize for what?

"Cluster" traffic policy optimizes under the assumption that your service will work best if kube-proxy distributes connections evenly across all the endpoints. This might be the best choice if each connection creates a lot of load on the endpoint, so overloading a local endpoint would slow response time down by more than forwarding requests to another node would.

"PreferLocal" traffic policy optimizes under the assumption that the service will work best if kube-proxy routes connections to the closest endpoint. This is the right choice for something like DNS where connections are short lived, and processing them doesn't require much CPU, so it doesn't matter if lots of connections get sent to the same endpoint all at once.

The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart".

Yeah, because actually I lied above: "PreferLocal" doesn't say "route connections to the closest endpoint", it says "route connections to a local endpoint, or if you can't, then route it anywhere", which is better than "Cluster"'s behavior, but not really exactly what you want.

This then FORCES other implementations, which might be capable of more smartness, to play dumb.

In theory an implementation might be super smart and actively monitor the cluster to figure out the best way to route connections, but I think it's more likely that you'll get good behavior if we let the user give the proxy implementation hints. We just need to figure out the right set of hints.

I guess having traffic policy encode both behavioral changes ("preserve source IP", "must stay on node for semantic correctness") and efficiency hints is probably a bad approach. But in that case we should clarify that traffic policy is only describing desired behavior, not desired efficiency.

Copy link
Member

@aojea aojea Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalTrafficPolicy: Local doesn't solve the "efficiency" use case though, because it implies "I would rather lose packets than let this connection go to another node", which is never what you want for the "efficiency" use case.

I think that this is the key ^^^ , internalTrafficPolicy: Local doesn't have any value if doesn't fall back, or at least I can't understand when and how should I use it, but we want to keep the same semantics that it has for externalTrafficPolicy and we are running in circles because of that .... I would bet that when people say Local for internalTrafficPolicy they want to mean LocalWithFallback / PreferLocal

Copy link
Member

@thockin thockin Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of cases where falling back to a non-local endpoint is semantically wrong, and isn't acceptable. Setting a policy to "Local" ought to be a rule), not a hint. IF we support "PreferLocal", it has to be distinct from "Local". It still seems wrong to conflate policy and efficiency hints.

E.g. imagine a near-ideal implementation. It would gather stats about how many connections and bytes were sent per usint time. Maybe it would consider latency (cracking flows to recognize DNS or maybe just based on ports). Maybe it would consider CPU load on the server. It would use that to make a "good" choice. Hopefully "pods on this node" would be the choice whenever possible. We don't need a user to tell us that.

Now, we don't have a near-ideal implementation, so should we:

a) give a policy knob that a closer-to-ideal solution might wrestle with

b) give a policy knob that other implementations can choose to ignore (e.g. annotation);

c) let the user offer a hint that is explicitly a hint (which implementations can choose to respect or not)

d) start building a more ideal implementation

e) ignore this problem and hope it goes away

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of cases where falling back to a non-local endpoint is semantically wrong, and isn't acceptable. Setting a policy to "Local" ought to be a rule

Fair enough

b) seems the simplest solution

```
<<[UNRESOLVED renaming ]>>

Should we add a new alias for `Local`, such as `PreserveSourceIP`, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +229 to +231
As a platform owner, I want to create a Service that always
directs traffic to a logging daemon on the same node. Traffic
should never bounce to a daemon on another node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singleton pattern? I honestly tried to find an example and I couldn't

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're saying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be useful if you want to have a service with special characteristics per node... I can't find an example neither, I was thinking if maybe someone want to have an unique service in the cluster that only works in one node ...

Copy link
Member

@andrewsykim andrewsykim Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that you want a unique service that only works in one node, it's that some daemons are designed as "side cars for a node". Some examples I've seen in that past:

  • A CoreDNS pod per node to reduce extra hop and reduce DNS latency
  • Ingress controller on every node to reduce extra hop, but you want to still connect using Service domain or clusterIP (as opposed to local node IP for example). This is typically the case where a user thinks service-mesh is too expensive for their use-cases but they want similar functionality.
  • Logging/telemetry/monitoring agent might receive events/data from pods on that node but it shouldn't receive from another node since that would skew the telemetry data.

Some of these use-cases could be solved today though by using downward API to get your local host IP and routing there, but it would be much easier to use a consistent Service cluster domain and have that route local all the time. This was the primary driver for internalTrafficPolicy: Local|PreferLocal in the original KEP

@thockin thockin self-assigned this Jan 21, 2022
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am -1 on this proposal. Rationale in a line-comment below.

ignore topology, just like `Local` policy does.

When delivering packets remotely, it makes the most sense to have
`PreferLocal` policy make use of topology. (This is explicitly wanted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll attach my main feedback here.

Let's consider: Why do people use [EI]TP ?

ETP: I know of 2 reasons. 1) "I need to preserve source IP". 2) "Efficiency".

ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".

For both classes of traffic, "Local" is either a hard requirement (i.e. "prefer" is a non-starter) or just an optimization over "Cluster".

I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can. The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart". This then FORCES other implementations, which might be capable of more smartness, to play dumb.

For example, it's really not hard to imagine an implementation that sends the number of active flows per-endpoint to an aggregator, and then have that aggregator give the node-agents hints about which endpoints are overloaded. It doesn't even need to be super timely to have some impact, I bet. But the semantics of "PreferLocal" disallow that.

If we think this is the optimizatrion with the biggest bang for the buck, let's talk about how we can do something like if num_endpoints > (num_nodes/2) && have_local_eps() { use_local_eps() }. I just made that up - obviously we can tune the heuristic.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2022
@danwinship
Copy link
Contributor Author

I'm going to close this PR and push a new more topology-focused version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants