-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1669: updates for ProxyTerminatingEndpoints feature in v1.24 #3174
Conversation
1c219bd
to
81320e4
Compare
* update kube-proxy endpoints info to track endpoint readiness based on endpoint.condition.ready in EndpointSlice | ||
* within the scope of the traffic policy for a Service, iterate the following set of endpoints, picking the first set that has at least 1 ready endpoint: | ||
* ready endpoints that are not terminating | ||
* ready endpoints that are terminating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting here that in the previous version of this KEP, we said we would fall back to not-ready terminating endpoints as a last resort. After thinking about this more, I think it's better to only fall back to ready terminating endpoints. It's generally more predictable and safer IMO. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
81320e4
to
8e5b24e
Compare
* feature is only enabled if the feature gate `ProxyTerminatingEndpoints` is on. | ||
* unit tests in kube-proxy. | ||
|
||
#### Beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the Beta criteria here to keep it in mind going forward, but I think target for v1.24 is alpha.
@@ -26,7 +26,7 @@ stage: alpha | |||
# The most recent milestone for which work toward delivery of this KEP has been | |||
# done. This can be the current (upcoming) milestone, if it is being actively | |||
# worked on. | |||
latest-milestone: "v1.22" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can try to push the existing feature as-is to Beta in v1,24 which only handle external traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naah, let's sync it up
/assign @thockin @robscott @bowei @danwinship @aojea |
|
||
Currently there are several workarounds that users can leverage: | ||
* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time. | ||
* Reducing the load balancer health check interval to very small time windows. This may not alwyays be possible based on the load balancer implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an option for users on all Services LoadBalancer implementation, or just in a few ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I ama not surfe this applies to more than a handful of settings.
Thanks! /lgtm hold to get more eyeballs |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
### Risks and Mitigations | ||
|
||
There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but | ||
future testing may change this decision. See KEP 1672 for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was brought up during last sig-net meeting, maybe is worth to merge API and implementation in the same kep https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1672-tracking-terminating-endpoints
but I really don't know the consequences and the feasibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally split them off because there's going to be many ways to consume the new terminating condition outside of the way kube-proxy will use it. For example, ingress controllers may want to implement their own termination logic.
However, I do agree with pacing those two KEPs together since there's a dependency here to KEP-1672
from the load balancer. The worse case scenario is a node with 1 local endpoint and a load balancer with a long health check interval. | ||
|
||
Currently there are several workarounds that users can leverage: | ||
* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is what PDB tries to solve https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but relying solely on pod placement to avoid network loss on rolling update is not good enough
|
||
### Goals | ||
|
||
* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because trafffic is sent to Pods that are terminating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because trafffic is sent to Pods that are terminating. | |
* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because traffic is sent to Pods that are terminating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
nit comments from someone reading keps friday night :) @danwinship has a related KEP in flight #3016 |
It will need more PRR work for Beta, but given it stays in Alpha - it looks fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship has a related KEP in flight #3016
we should wait for him to avoid possible conflicts
I don't think this will conflict much with PreferLocal, but any implementation PRs should probably wait until after kubernetes/kubernetes#106497 merges.
|
||
### Example: all endpoints terminating and traffic policy is "Cluster" | ||
|
||
When the traffic policy is "Cluster" and all endpoints are terminating, then traffic should be routed to any terminating endpoint that is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in this updated version of the KEP, isn't it? The current code only implements terminating endpoint support for "Local" traffic policy.
Why would someone want this semantic? In the "Local" case, a single node might have only terminating endpoints because of a rolling update. But in the "Cluster" case, if the entire cluster has only terminating endpoints, that suggests that the service is being shut down entirely.
(You also don't have a user story for the "Cluster" case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would someone want this semantic? In the "Local" case, a single node might have only terminating endpoints because of a rolling update. But in the "Cluster" case, if the entire cluster has only terminating endpoints, that suggests that the service is being shut down entirely.
I agree it's less useful for Cluster, but you can still result in scenarios where all endpoints are terminating. For example, all pods might be on the same node that is draining or a 1 pod statefulset, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my main concern here is if this needs to add additional logic that we have to support and increase the bug surface area... the use case seems very rare ... but if this add new branches in the code I think we should evaluate the cost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have a 1 pod service, then you presumably don't expect it to be "high-availability" in the face of updates / node drains / etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have a 1 pod service, then you presumably don't expect it to be "high-availability" in the face of updates / node drains / etc.
Sure, this is just one example where this can happen though. It is also more likely to occur on smaller clusters (like 1-3 nodes)
### Risks and Mitigations | ||
|
||
There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but | ||
future testing may change this decision. See KEP 1672 for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linkify that
#### Story 1 | ||
|
||
As a user I would like to do a rolling update of a Deployment fronted by a Service Type=LoadBalancer with ExternalTrafficPolicy=Local. | ||
If a node that has only 1 pod of said deployment goes into the `Terminating` state, all traffic to that node is dropped until either a new pod | ||
comes up or my cloud provider removes the node from the loadbalancer's node pool. Ideally the terminating pod should gracefully handle traffic to this node | ||
until either one of the conditions are satisfied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This user story is too tied to the specific feature proposal. I think the real user story is "I would like to do a rolling update of a Deployment fronted by a Service Type=LoadBalancer with ExternalTrafficPolicy=Local without having any noticeable interruption of the service".
And ProxyTerminatingEndpoints, as it currently exists, does not do a great job of addressing that user story. It makes it less likely that there will be an interruption, but people don't want just "less likely".
That's why I started the PreferLocal KEP; because if you want "efficient and no disruption" and don't need "preserves source IPs" then it's a much better solution than ProxyTerminatingEndpoints.
I think if we want to be able to do Local
with no disruption, we need a feature that integrates more directly with the load balancer health checking somehow and actually ensures that the endpoint remains active until after the load balancer forgets about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes it less likely that there will be an interruption, but people don't want just "less likely".
Can you expand on this a bit? Nothing is guarenteed but what are the gaps right now that make it not "less likely" enough?
I think if we want to be able to do Local with no disruption, we need a feature that integrates more directly with the load balancer health checking somehow and actually ensures that the endpoint remains active until after the load balancer forgets about it.
The current feature as-is does interact with load balancer health checking. In that it fails the load balancer health checking if all endpoints are terminating, but it still routes to traffic to terminating ones in case the LB's health check is delayed (see https://github.com/kubernetes/kubernetes/blob/03fc2eeda270c9244dc23c75424db3a20bf61d7b/pkg/proxy/endpoints.go#L396). This way you fail the health check immediately when you only have terminating endpoints, but you allow any traffic in case there are delays between health check probes. This should reduce traffic loss pretty significanitly assuming SIGTERM is handled by the app
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should reduce traffic loss pretty significanitly assuming SIGTERM is handled by the app
assuming the app handles SIGTERM, and assuming that the pod's terminationGracePeriodSeconds
is larger than EndpointsController lag
+ kube-proxy processing Endpoints updates lag
+ load balancer health check interval
, which the pod has no way of ensuring.
I dunno. Maybe 30s is pretty much guaranteed to be enough... I had been thinking we tested this in OpenShift and found that it didn't always work, but maybe we just didn't have fully-working ProxyTerminatingEndpoints support at that point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back when the alpha implementation of this feature was being developed, there was some testing done on OpenShift, see kubernetes/kubernetes#97238 (comment) -- although that comment is pretty old so maybe things have changed. But my impression was that the feature was addressing the issue. If you're seeing otherwise on OpenShift, I'd love to dig deeper and figure out why.
I would also think 30s is plenty, but it does depend on a lot of factors like you mentioned. But if we assume that a Pod handles SIGTERM and then actually serves traffic during the full 30s before terminating, I would think it should work pretty well.
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
8e5b24e
to
bea551e
Compare
@danwinship @aojea gonna defer LGTM to both of you, seems like there's two outstanding issues: |
I prefer to be conservative here and keep the current behavior, once the feature gate is removed the Cluster traffic will behave different.
Dan should know better, I think that ovn-kubernetes is replacing kube-proxy soon in Openshift, I don't know if is easy to check this change now |
I don't think we can could pull this off as a bug-fix in the future, it would probably require a new feature gate starting at alpha -- which was why I thought it was worthwhile to consolidate the effort here. I see your point about being conservative here though 🤔 |
why not? if there is a reasonable and reproducible bug because the traffic somehow doesn't work correctly or is dropped because of this ... this is a kube-proxy implementation detail, ipvs version per example has different implementation details, same as ovnk and others, ... my personal opinion though The thing is that we are not able to come up with a solid reason to change the traffic for Cluster, we've speculated about several scenarios and I don't think that we are convinced that is going to fix something , so is a possibility that instead we can break something .... like the conntrack logic for the UDP stale connections 😨 |
That's fair, I'm fine to defer it, but I really don't think it can be considered a bug fix at that point, because it is introducing new functionality/behavior into kube-proxy at that point, and not fixing an issue with an existing one. |
Imagine a pod that does not close its listening socket when it receives SIGTERM, but it starts doing other cleanup, which has the side effect that if you were to make a request to the pod 10 seconds after SIGTERM, it would return error messages in response to valid requests. In current kubernetes, those semantics would never be noticed and would not cause problems. If kube-proxy started using terminating endpoints for Cluster services, then the service would become broken whenever it had only terminating endpoints. (Though arguably this was true for adding PTE to Local services too...) Anyway, I still don't think there's a good use case for Cluster policy PTE; it is one thing to go to extra lengths to ensure that a service remains available when it seems like that's what the user wanted. But if there's a service with only 1 endpoint, or with endpoints on only 1 node which is about to go away, then I think we can say "clearly the user didn't expect this service to have 100% uptime", and just let it fail. |
My main goal with advocating for this at Cluster was simplicity. It should be simpler to explain. Dan, I think your "10 seconds after SIGTERM" error case is true today in that a kube-proxy could be out-to-lunch (control plane) while the data plane still routed traffic to such pods. For a service which has only terminating endpoints, the choice is really "blackhole" or "try to use terminating endpoints". I don't think PTE for cluster changes anything except when that happens. Note: we only use LB healthchecks when ETP is local. We could use it for ETP=cluster, but that is a big change, I think. Assume a cloud LB -> node -> cluster setup. Today we get: If we did PTE for cluster: The difference is in t4-t5 Is that at all convincing? :) |
why should you use Cluster for that when we have a canonical way of using LB for these setups with Local? |
I take no real opinion on whether one should use Cluster or Local for this
sort of setting, but both are valid.
If PTE-Cluster is net simpler to document and explain ("we always use ready
endpoints unless none exist, in which case we use terminating endpoints if
they are still serving" vs all sorts of conditional clauses around policy)
and the only real difference is when we switch from using terminating
endpoints to blackholes, is that justifiable?
NOTE: I haven't tried to look a t the code for this, but I can't see how
less special-cases won't be better.
…On Wed, Jan 26, 2022 at 1:36 PM Antonio Ojea ***@***.***> wrote:
The difference is in t4-t5
Is that at all convincing? :)
why should you use Cluster for that when we have a canonical way of using
LB for these setups with Local?
—
Reply to this email directly, view it on GitHub
<#3174 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVHVBXNSSKJJF5TG6KLUYBSMHANCNFSM5MQASHBQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
well, this is alpha, we can always try and see what happens, it seems the best way to figure it out |
Agreed, more than happy to discuss this in more detail during the implementation PR when we're in front of the actual code :) /hold cancel |
## Proposal | ||
|
||
This KEP proposes that if all endpoints for a given Service scoped to its traffic policy are terminating (i.e. pod.DeletionTimestamp != nil), then all traffic should be sent to | ||
terminating Pods that are still Ready. Note that the EndpointSlice API introduced a new condition called "Serving" which is semantically equivalent to "Ready" except that the Ready condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I found it very confusing to read this paragraph that claims "the Ready condition must always be "False" for terminating pods" and the following examples which make statements like "then traffic should be routed to any terminating endpoint that is ready."
I think what you mean in the examples is regarding readiness checks, not the Ready condition. Could we note that distinction, between the use of "ready endpoints" in the most of the text, vs the "Ready" condition on the EndpointSliceAPI, and that the document mostly uses the former to mean readiness checks, not the Ready condition.
Currently there are several workarounds that users can leverage: | ||
* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time. | ||
* Reducing the load balancer health check interval to very small time windows. This may not alwyays be possible based on the load balancer implementation. | ||
* Use a preStop hook in the Pod to delay the time between a Pod terminating and the process receiving SIGTERM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't believe this helps at all with the loadbalancer issue, as the pod has already entered terminating status and is no longer routable before the preStop hook is executed. Maybe we should remove it to avoid confusion?
Signed-off-by: Andrew Sy Kim [email protected]
After some thought, I decided to consolidate handling both external/internal traffic under a single feature. I think overall this is simpler. The implication for v1.24 though is that the feature will remain alpha since we haven't actually implemented the feature at all for internal traffic yet and we probably want that for at least 1 release before thinking about the promotion to beta.