-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3015: PreferLocal traffic policy #3016
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: danwinship The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth exploring how this intersects with topology, I think. Can we guarantee that all the endpoints in zone A are hinted "for zone A" ? If not, could we end up with a node that holds local endpoints which are dedicated to a different zone (because of imbalance) ?
load balancers do not balance between nodes themselves. (eg, perhaps | ||
you are using a `NodePort` service, but only advertising a single | ||
node IP for clients to connect to). Does not preserve source IP, but | ||
guarantees the traffic is always delivered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/guarantees.../will use any available endpoint to avoid delivery failure/
"guarantee" is such a strong word
- `PreferLocal`: Prioritizes efficiency and reliability. When | ||
possible, traffic will be delivered to a local endpoint, but when no | ||
local endpoints exist, it will be delivered to a remote endpoint | ||
instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If the upstream routing over-uses a node, endpoints on that node can become overloaded."
Well, I think
The topology KEP says "Both ExternalTrafficPolicy and InternalTrafficPolicy will be given precedence over topology aware routing." I guess "precedence" is kind of vague; the current implementation is that traffic policy overrides topology, so if Node 1 in Zone A has two endpoints, and one is topology-hinted for Zone A and the other is hinted for Zone B, then The other interpretation would be "endpoints are filtered for traffic policy first, and then for topology second" (although there isn't actually "precedence" in that case since you'd get the same answer regardless of which order you filtered them in). But I don't think we want that, because the rule for assigning some endpoints to other zones is going to thwart the administrator's attempts to ensure that every node has a local endpoint. (eg, if Node 2 in Zone A has one endpoint which is hinted as being for Zone B, then if we apply both |
There's a lot of overlap in kubernetes/kubernetes#100313 with the discussion here. I think fundamentally we should consider traffic policy as having the highest precedence for internal or external traffic. When set to "Local", there is no fallback and therefore no ambiguity on interaction with other config. On the other hand, "Cluster" is really falling back to the next level of configuration. That means either topology filtered endpoints or all endpoints. I think "PreferLocal" is really just trying to merge "Local" with a "Cluster" fallback. In that case, I think the behavior should line up with that. That means the following approximate logic:
As mentioned here and in kubernetes/kubernetes#100313, this does mean that we either need to maintain separate filtered sets of internal and external endpoints in the shared proxy code, or we need to send the full list of endpoints to proxiers and then use shared code to filter internal and external endpoints separately. I think @danwinship is working on that logic. |
56c2e00
to
d0cbb1e
Compare
Updated:
|
d0cbb1e
to
da2981e
Compare
that happens, there's a good chance that the topology code will _also_ | ||
hit its fallback case (due to there being at least one zone which is | ||
missing an endpoint), and so topology won't even get used anyway. At |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is true. For example, there may be a Service with 3 endpoints spread over 6 nodes. That would mean "Local" would fallback to "Cluster" ~1/2 of the time, but topology would consistently be helpful. Having a "PreferLocal" mode may result in a larger set of users than the traditional "Local" use cases, including some cases where users are not intending to deploy workloads on every possible node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there may be a Service with 3 endpoints spread over 6 nodes. That would mean "Local" would fallback to "Cluster" ~1/2 of the time
(I assume you meant PreferLocal
would fall back to Cluster
.)
including some cases where users are not intending to deploy workloads on every possible node.
Hm... I was not considering that as a use case, although it does make sense. It's the same behavior, you're just thinking of it as "Cluster
with optimization" rather than "Local
with fallback". I should add a User Story for that.
At any rate, the comment here about PreferLocal
and Topology failing at the same time was purely "informational" and doesn't affect the actual behavior; the proposed rule ("use a local endpoint if you can, otherwise do exactly what you would have done if the policy was Cluster
") will give the right answer whether or not Topology is in effect.
/cc |
da2981e
to
010a683
Compare
010a683
to
5185808
Compare
Updated:
|
ignore topology, just like `Local` policy does. | ||
|
||
When delivering packets remotely, it makes the most sense to have | ||
`PreferLocal` policy make use of topology. (This is explicitly wanted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons we scrapped PreferLocal in the other KEP is because the assumption was you can achieve the same behavior by setting internalTrafficPolicy: Cluster and topology-aware routing that strongly prefers node-local endpoints. What is the semantic difference between that and using PreferLocal?
(sorry if we already discussed this somewhere else, catching up on some context I missed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had node-level topology, then we probably would not need PreferLocal
... but do we even have a plan for node-level topology yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably if we had node-level topology, then we would not need internalTrafficPolicy: Local
either, since (unlike externalTrafficPolicy: Local
) it provides no advantages other than locality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had node-level topology, then we probably would not need PreferLocal... but do we even have a plan for node-level topology yet?
I guess this is a question for @robscott -- is "requires node" a thing? My current understanding so far is that topology-aware routing is more designed to handle traffic routing automatically so it should always fall back to the next preferred topology instead of dropping. This was one of the reasons we added internalTrafficPolicy: Local
because there are cases where you actually want "local or drop", which fell outside the scope of topology-aware routing.
Fwiw I do think there's a lot of value in "PreferLocal" still -- it seems particularly more useful than something like "PreferZone". Routing locally has more specific use-cases as opposed to "route to closest zone" which is almost-always around cost-savings and reducing latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll attach my main feedback here.
Let's consider: Why do people use [EI]TP ?
ETP: I know of 2 reasons. 1) "I need to preserve source IP". 2) "Efficiency".
ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".
For both classes of traffic, "Local" is either a hard requirement (i.e. "prefer" is a non-starter) or just an optimization over "Cluster".
I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can. The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart". This then FORCES other implementations, which might be capable of more smartness, to play dumb.
For example, it's really not hard to imagine an implementation that sends the number of active flows per-endpoint to an aggregator, and then have that aggregator give the node-agents hints about which endpoints are overloaded. It doesn't even need to be super timely to have some impact, I bet. But the semantics of "PreferLocal" disallow that.
If we think this is the optimizatrion with the biggest bang for the buck, let's talk about how we can do something like if num_endpoints > (num_nodes/2) && have_local_eps() { use_local_eps() }
. I just made that up - obviously we can tune the heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PreferLocal is a loadbalancing/topology algorithm problem and not a traffic policy problem, hence this is not the right place to fix it
@aojea That's a reasonable distillation :)
@terinjokes thanks for the additional input. "Local" gets better with the terminating endpoints work that is in flight, but still there's a moment when it has to die for a successor to start. If we really want to make node-level "sidecars" a thing that is reliable, we need a way to "hand off".
But such a mechanism does (kind of) exist - the updateStrategy
of a DaemonSet allows maxSurge
to be positive, so we can start a new replica before killing an old one. This is subject to resource limitations on heavily packed nodes. If you think that is you, you can choose to run 2 DaemonSets (e.g. "dns-1" and "dns-a") so you pre-allocate space. Maybe DaemonSet actually needs "replicas" field to run N copies per target node...
Topology, as implemented today, is not really considering node. It could, and yes my proposed heuristic is far from perfect. But not considering node was sort of intentional. In my experience, such optimizations quietly become functional requirements. Apps (or SLOs) start to depend on the optimization, and when the optimization isn't available for some reason, apps break.
So my intuitive lean is to either NOT optimize that way, or else to make it really clear that it is an optimization, not a policy. If you depend on the optimization, then "prefer" is wrong, and we should focus on making "Local" work for you. If you don't depend on it, then you should let the system try to be smart, and not tell it how to do its job.
That said I really do understand how "PreferLocal" is attractive, especially while the system is NOT as smart as you want it to be. On this topic, I have cited this before: https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine . As you said Trying to be "smart" can potentially hotspot specific instances
. It's an overly-sharp tool that SEEMS like a good idea, but probably isn't.
In some sense what you are asking for is not an abstract policy, but an implementation-centric toggle that says "I know what I am doing and I know how the implementation works". If we put that into the API we get to support it FOREVER. We reduce the degrees of freedom implementations have for doing good things, and we lock them into sub-optimal semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".
internalTrafficPolicy: Local
doesn't solve the "efficiency" use case though, because it implies "I would rather lose packets than let this connection go to another node", which is never what you want for the "efficiency" use case.
I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can.
But optimize for what?
"Cluster" traffic policy optimizes under the assumption that your service will work best if kube-proxy distributes connections evenly across all the endpoints. This might be the best choice if each connection creates a lot of load on the endpoint, so overloading a local endpoint would slow response time down by more than forwarding requests to another node would.
"PreferLocal" traffic policy optimizes under the assumption that the service will work best if kube-proxy routes connections to the closest endpoint. This is the right choice for something like DNS where connections are short lived, and processing them doesn't require much CPU, so it doesn't matter if lots of connections get sent to the same endpoint all at once.
The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart".
Yeah, because actually I lied above: "PreferLocal" doesn't say "route connections to the closest endpoint", it says "route connections to a local endpoint, or if you can't, then route it anywhere", which is better than "Cluster"'s behavior, but not really exactly what you want.
This then FORCES other implementations, which might be capable of more smartness, to play dumb.
In theory an implementation might be super smart and actively monitor the cluster to figure out the best way to route connections, but I think it's more likely that you'll get good behavior if we let the user give the proxy implementation hints. We just need to figure out the right set of hints.
I guess having traffic policy encode both behavioral changes ("preserve source IP", "must stay on node for semantic correctness") and efficiency hints is probably a bad approach. But in that case we should clarify that traffic policy is only describing desired behavior, not desired efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internalTrafficPolicy: Local
doesn't solve the "efficiency" use case though, because it implies "I would rather lose packets than let this connection go to another node", which is never what you want for the "efficiency" use case.
I think that this is the key ^^^ , internalTrafficPolicy: Local
doesn't have any value if doesn't fall back, or at least I can't understand when and how should I use it, but we want to keep the same semantics that it has for externalTrafficPolicy and we are running in circles because of that .... I would bet that when people say Local for internalTrafficPolicy they want to mean LocalWithFallback / PreferLocal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know of cases where falling back to a non-local endpoint is semantically wrong, and isn't acceptable. Setting a policy to "Local" ought to be a rule), not a hint. IF we support "PreferLocal", it has to be distinct from "Local". It still seems wrong to conflate policy and efficiency hints.
E.g. imagine a near-ideal implementation. It would gather stats about how many connections and bytes were sent per usint time. Maybe it would consider latency (cracking flows to recognize DNS or maybe just based on ports). Maybe it would consider CPU load on the server. It would use that to make a "good" choice. Hopefully "pods on this node" would be the choice whenever possible. We don't need a user to tell us that.
Now, we don't have a near-ideal implementation, so should we:
a) give a policy knob that a closer-to-ideal solution might wrestle with
b) give a policy knob that other implementations can choose to ignore (e.g. annotation);
c) let the user offer a hint that is explicitly a hint (which implementations can choose to respect or not)
d) start building a more ideal implementation
e) ignore this problem and hope it goes away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know of cases where falling back to a non-local endpoint is semantically wrong, and isn't acceptable. Setting a policy to "Local" ought to be a rule
Fair enough
b) seems the simplest solution
``` | ||
<<[UNRESOLVED renaming ]>> | ||
|
||
Should we add a new alias for `Local`, such as `PreserveSourceIP`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a platform owner, I want to create a Service that always | ||
directs traffic to a logging daemon on the same node. Traffic | ||
should never bounce to a daemon on another node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singleton pattern? I honestly tried to find an example and I couldn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be useful if you want to have a service with special characteristics per node... I can't find an example neither, I was thinking if maybe someone want to have an unique service in the cluster that only works in one node ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that you want a unique service that only works in one node, it's that some daemons are designed as "side cars for a node". Some examples I've seen in that past:
- A CoreDNS pod per node to reduce extra hop and reduce DNS latency
- Ingress controller on every node to reduce extra hop, but you want to still connect using Service domain or clusterIP (as opposed to local node IP for example). This is typically the case where a user thinks service-mesh is too expensive for their use-cases but they want similar functionality.
- Logging/telemetry/monitoring agent might receive events/data from pods on that node but it shouldn't receive from another node since that would skew the telemetry data.
Some of these use-cases could be solved today though by using downward API to get your local host IP and routing there, but it would be much easier to use a consistent Service cluster domain and have that route local all the time. This was the primary driver for internalTrafficPolicy: Local|PreferLocal
in the original KEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am -1 on this proposal. Rationale in a line-comment below.
ignore topology, just like `Local` policy does. | ||
|
||
When delivering packets remotely, it makes the most sense to have | ||
`PreferLocal` policy make use of topology. (This is explicitly wanted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll attach my main feedback here.
Let's consider: Why do people use [EI]TP ?
ETP: I know of 2 reasons. 1) "I need to preserve source IP". 2) "Efficiency".
ITP: I know of 2 reasons. 1) "I need to stay on this node (semantic correctness)". 2) "Efficiency".
For both classes of traffic, "Local" is either a hard requirement (i.e. "prefer" is a non-starter) or just an optimization over "Cluster".
I don't like that we put the responsibility onto our users tell us which optimizations we should be using. That seems like a total failure on our part. OF COURSE they want us to optimize if we can. The specific optimization that PreferLocal dictates is pretty black or white, which means there's not much room for being "smart". This then FORCES other implementations, which might be capable of more smartness, to play dumb.
For example, it's really not hard to imagine an implementation that sends the number of active flows per-endpoint to an aggregator, and then have that aggregator give the node-agents hints about which endpoints are overloaded. It doesn't even need to be super timely to have some impact, I bet. But the semantics of "PreferLocal" disallow that.
If we think this is the optimizatrion with the biggest bang for the buck, let's talk about how we can do something like if num_endpoints > (num_nodes/2) && have_local_eps() { use_local_eps() }
. I just made that up - obviously we can tune the heuristic.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I'm going to close this PR and push a new more topology-focused version |
Addition of
PreferLocal
(ie,Local
with fallback) toexternalTrafficPolicy
andinternalTrafficPolicy
based on discussion in KEP-2086: Add internalTrafficPolicy: PreferLocal #3010, which I'll close now