Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1669: updates for ProxyTerminatingEndpoints feature in v1.24 #3174

Merged
merged 2 commits into from
Jan 27, 2022

Conversation

andrewsykim
Copy link
Member

Signed-off-by: Andrew Sy Kim [email protected]

  • One-line PR description: v1.24 updates for KEP-1669.
  • Other comments:

After some thought, I decided to consolidate handling both external/internal traffic under a single feature. I think overall this is simpler. The implication for v1.24 though is that the feature will remain alpha since we haven't actually implemented the feature at all for internal traffic yet and we probably want that for at least 1 release before thinking about the promotion to beta.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 21, 2022
@k8s-ci-robot k8s-ci-robot requested review from dcbw and thockin January 21, 2022 17:02
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Jan 21, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 21, 2022
* update kube-proxy endpoints info to track endpoint readiness based on endpoint.condition.ready in EndpointSlice
* within the scope of the traffic policy for a Service, iterate the following set of endpoints, picking the first set that has at least 1 ready endpoint:
* ready endpoints that are not terminating
* ready endpoints that are terminating
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting here that in the previous version of this KEP, we said we would fall back to not-ready terminating endpoints as a last resort. After thinking about this more, I think it's better to only fall back to ready terminating endpoints. It's generally more predictable and safer IMO. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

* feature is only enabled if the feature gate `ProxyTerminatingEndpoints` is on.
* unit tests in kube-proxy.

#### Beta
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the Beta criteria here to keep it in mind going forward, but I think target for v1.24 is alpha.

@@ -26,7 +26,7 @@ stage: alpha
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can try to push the existing feature as-is to Beta in v1,24 which only handle external traffic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naah, let's sync it up

@andrewsykim
Copy link
Member Author

/assign @thockin @robscott @bowei @danwinship @aojea

@andrewsykim andrewsykim changed the title KEP-1669: updates for v1.24 KEP-1669: updates for ProxyTerminatingEndpoints feature in v1.24 Jan 21, 2022

Currently there are several workarounds that users can leverage:
* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time.
* Reducing the load balancer health check interval to very small time windows. This may not alwyays be possible based on the load balancer implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an option for users on all Services LoadBalancer implementation, or just in a few ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I ama not surfe this applies to more than a handful of settings.

@thockin
Copy link
Member

thockin commented Jan 21, 2022

Thanks!

/lgtm
/approve
/hold

hold to get more eyeballs

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 21, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2022
### Risks and Mitigations

There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but
future testing may change this decision. See KEP 1672 for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was brought up during last sig-net meeting, maybe is worth to merge API and implementation in the same kep https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/1672-tracking-terminating-endpoints
but I really don't know the consequences and the feasibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally split them off because there's going to be many ways to consume the new terminating condition outside of the way kube-proxy will use it. For example, ingress controllers may want to implement their own termination logic.

However, I do agree with pacing those two KEPs together since there's a dependency here to KEP-1672

from the load balancer. The worse case scenario is a node with 1 local endpoint and a load balancer with a long health check interval.

Currently there are several workarounds that users can leverage:
* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is what PDB tries to solve https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but relying solely on pod placement to avoid network loss on rolling update is not good enough


### Goals

* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because trafffic is sent to Pods that are terminating.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because trafffic is sent to Pods that are terminating.
* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because traffic is sent to Pods that are terminating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@aojea
Copy link
Member

aojea commented Jan 21, 2022

nit comments from someone reading keps friday night :)
lgtm

@danwinship has a related KEP in flight #3016
we should wait for him to avoid possible conflicts

@wojtek-t
Copy link
Member

It will need more PRR work for Beta, but given it stays in Alpha - it looks fine.

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship has a related KEP in flight #3016
we should wait for him to avoid possible conflicts

I don't think this will conflict much with PreferLocal, but any implementation PRs should probably wait until after kubernetes/kubernetes#106497 merges.


### Example: all endpoints terminating and traffic policy is "Cluster"

When the traffic policy is "Cluster" and all endpoints are terminating, then traffic should be routed to any terminating endpoint that is ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in this updated version of the KEP, isn't it? The current code only implements terminating endpoint support for "Local" traffic policy.

Why would someone want this semantic? In the "Local" case, a single node might have only terminating endpoints because of a rolling update. But in the "Cluster" case, if the entire cluster has only terminating endpoints, that suggests that the service is being shut down entirely.

(You also don't have a user story for the "Cluster" case.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would someone want this semantic? In the "Local" case, a single node might have only terminating endpoints because of a rolling update. But in the "Cluster" case, if the entire cluster has only terminating endpoints, that suggests that the service is being shut down entirely.

I agree it's less useful for Cluster, but you can still result in scenarios where all endpoints are terminating. For example, all pods might be on the same node that is draining or a 1 pod statefulset, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main concern here is if this needs to add additional logic that we have to support and increase the bug surface area... the use case seems very rare ... but if this add new branches in the code I think we should evaluate the cost

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a 1 pod service, then you presumably don't expect it to be "high-availability" in the face of updates / node drains / etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a 1 pod service, then you presumably don't expect it to be "high-availability" in the face of updates / node drains / etc.

Sure, this is just one example where this can happen though. It is also more likely to occur on smaller clusters (like 1-3 nodes)

### Risks and Mitigations

There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but
future testing may change this decision. See KEP 1672 for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linkify that

Comment on lines +125 to +130
#### Story 1

As a user I would like to do a rolling update of a Deployment fronted by a Service Type=LoadBalancer with ExternalTrafficPolicy=Local.
If a node that has only 1 pod of said deployment goes into the `Terminating` state, all traffic to that node is dropped until either a new pod
comes up or my cloud provider removes the node from the loadbalancer's node pool. Ideally the terminating pod should gracefully handle traffic to this node
until either one of the conditions are satisfied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This user story is too tied to the specific feature proposal. I think the real user story is "I would like to do a rolling update of a Deployment fronted by a Service Type=LoadBalancer with ExternalTrafficPolicy=Local without having any noticeable interruption of the service".

And ProxyTerminatingEndpoints, as it currently exists, does not do a great job of addressing that user story. It makes it less likely that there will be an interruption, but people don't want just "less likely".

That's why I started the PreferLocal KEP; because if you want "efficient and no disruption" and don't need "preserves source IPs" then it's a much better solution than ProxyTerminatingEndpoints.

I think if we want to be able to do Local with no disruption, we need a feature that integrates more directly with the load balancer health checking somehow and actually ensures that the endpoint remains active until after the load balancer forgets about it.

Copy link
Member Author

@andrewsykim andrewsykim Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it less likely that there will be an interruption, but people don't want just "less likely".

Can you expand on this a bit? Nothing is guarenteed but what are the gaps right now that make it not "less likely" enough?

I think if we want to be able to do Local with no disruption, we need a feature that integrates more directly with the load balancer health checking somehow and actually ensures that the endpoint remains active until after the load balancer forgets about it.

The current feature as-is does interact with load balancer health checking. In that it fails the load balancer health checking if all endpoints are terminating, but it still routes to traffic to terminating ones in case the LB's health check is delayed (see https://github.com/kubernetes/kubernetes/blob/03fc2eeda270c9244dc23c75424db3a20bf61d7b/pkg/proxy/endpoints.go#L396). This way you fail the health check immediately when you only have terminating endpoints, but you allow any traffic in case there are delays between health check probes. This should reduce traffic loss pretty significanitly assuming SIGTERM is handled by the app

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should reduce traffic loss pretty significanitly assuming SIGTERM is handled by the app

assuming the app handles SIGTERM, and assuming that the pod's terminationGracePeriodSeconds is larger than EndpointsController lag + kube-proxy processing Endpoints updates lag + load balancer health check interval, which the pod has no way of ensuring.

I dunno. Maybe 30s is pretty much guaranteed to be enough... I had been thinking we tested this in OpenShift and found that it didn't always work, but maybe we just didn't have fully-working ProxyTerminatingEndpoints support at that point...

Copy link
Member Author

@andrewsykim andrewsykim Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back when the alpha implementation of this feature was being developed, there was some testing done on OpenShift, see kubernetes/kubernetes#97238 (comment) -- although that comment is pretty old so maybe things have changed. But my impression was that the feature was addressing the issue. If you're seeing otherwise on OpenShift, I'd love to dig deeper and figure out why.

I would also think 30s is plenty, but it does depend on a lot of factors like you mentioned. But if we assume that a Pod handles SIGTERM and then actually serves traffic during the full 30s before terminating, I would think it should work pretty well.

Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2022
@andrewsykim
Copy link
Member Author

@danwinship @aojea gonna defer LGTM to both of you, seems like there's two outstanding issues:

  1. Whether falling back to Cluster is really worth-doing.
  2. If the existing feature is fixing the issue in OpenShift.

@aojea
Copy link
Member

aojea commented Jan 26, 2022

  1. Whether falling back to Cluster is really worth-doing.

I prefer to be conservative here and keep the current behavior, once the feature gate is removed the Cluster traffic will behave different.
If in the future, we have a real evidence that there is an issue for not using terminating pods with Cluster, like the one reported for the externalTrafficPolicy and loadbalancers, we can always change it as a bugfix, but right now I can't see we have a solid use case

  1. If the existing feature is fixing the issue in OpenShift.

Dan should know better, I think that ovn-kubernetes is replacing kube-proxy soon in Openshift, I don't know if is easy to check this change now

@andrewsykim
Copy link
Member Author

like the one reported for the externalTrafficPolicy and loadbalancers, we can always change it as a bugfix, but right now I can't see we have a solid use case

I don't think we can could pull this off as a bug-fix in the future, it would probably require a new feature gate starting at alpha -- which was why I thought it was worthwhile to consolidate the effort here. I see your point about being conservative here though 🤔

@aojea
Copy link
Member

aojea commented Jan 26, 2022

I don't think we can could pull this off as a bug-fix in the future, it would probably require a new feature gate starting at alpha -- which was why I thought it was worthwhile to consolidate the effort here. I see your point about being conservative here though thinking

why not? if there is a reasonable and reproducible bug because the traffic somehow doesn't work correctly or is dropped because of this ... this is a kube-proxy implementation detail, ipvs version per example has different implementation details, same as ovnk and others, ... my personal opinion though

The thing is that we are not able to come up with a solid reason to change the traffic for Cluster, we've speculated about several scenarios and I don't think that we are convinced that is going to fix something , so is a possibility that instead we can break something .... like the conntrack logic for the UDP stale connections 😨

@andrewsykim
Copy link
Member Author

That's fair, I'm fine to defer it, but I really don't think it can be considered a bug fix at that point, because it is introducing new functionality/behavior into kube-proxy at that point, and not fixing an issue with an existing one.

@danwinship
Copy link
Contributor

I don't think we can could pull this off as a bug-fix in the future, it would probably require a new feature gate starting at alpha -- which was why I thought it was worthwhile to consolidate the effort here. I see your point about being conservative here though thinking

why not? if there is a reasonable and reproducible bug because the traffic somehow doesn't work correctly or is dropped because of this ... this is a kube-proxy implementation detail, ipvs version per example has different implementation details, same as ovnk and others, ... my personal opinion though

Imagine a pod that does not close its listening socket when it receives SIGTERM, but it starts doing other cleanup, which has the side effect that if you were to make a request to the pod 10 seconds after SIGTERM, it would return error messages in response to valid requests.

In current kubernetes, those semantics would never be noticed and would not cause problems. If kube-proxy started using terminating endpoints for Cluster services, then the service would become broken whenever it had only terminating endpoints.

(Though arguably this was true for adding PTE to Local services too...)

Anyway, I still don't think there's a good use case for Cluster policy PTE; it is one thing to go to extra lengths to ensure that a service remains available when it seems like that's what the user wanted. But if there's a service with only 1 endpoint, or with endpoints on only 1 node which is about to go away, then I think we can say "clearly the user didn't expect this service to have 100% uptime", and just let it fail.

@thockin
Copy link
Member

thockin commented Jan 26, 2022

My main goal with advocating for this at Cluster was simplicity.

It should be simpler to explain.
It should be simpler to implement.
It should be simpler to test.

Dan, I think your "10 seconds after SIGTERM" error case is true today in that a kube-proxy could be out-to-lunch (control plane) while the data plane still routed traffic to such pods.

For a service which has only terminating endpoints, the choice is really "blackhole" or "try to use terminating endpoints". I don't think PTE for cluster changes anything except when that happens.

Note: we only use LB healthchecks when ETP is local. We could use it for ETP=cluster, but that is a big change, I think.

Assume a cloud LB -> node -> cluster setup.

Today we get:
t0: packet flow is LB -> node -> backend pod (happy)
t1: delete pod (assume reasonable graceful term) in API
t2: packet flow is LB -> node -> backend pod (terminating)
t3: endpoint is marked terminating
t4: packet flow is LB -> node -> blackhole
t5: pod is actually deleted
t6: packet flow is LB -> node -> blackhole

If we did PTE for cluster:
t0: packet flow is LB -> node -> backend pod (happy)
t1: delete pod (assume reasonable graceful term) in API
t2: packet flow is LB -> node -> backend pod (terminating)
t3: endpoint is marked terminating
t4: packet flow is LB -> node -> backend pod (terminating)
t5: pod is actually deleted
t6: packet flow is LB -> node -> blackhole

The difference is in t4-t5

Is that at all convincing? :)

@aojea
Copy link
Member

aojea commented Jan 26, 2022

The difference is in t4-t5

Is that at all convincing? :)

why should you use Cluster for that when we have a canonical way of using LB for these setups with Local?

@thockin
Copy link
Member

thockin commented Jan 26, 2022 via email

@aojea
Copy link
Member

aojea commented Jan 26, 2022

well, this is alpha, we can always try and see what happens, it seems the best way to figure it out
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2022
@andrewsykim
Copy link
Member Author

Agreed, more than happy to discuss this in more detail during the implementation PR when we're in front of the actual code :)

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit 8c81ae0 into kubernetes:master Jan 27, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Jan 27, 2022
## Proposal

This KEP proposes that if all endpoints for a given Service scoped to its traffic policy are terminating (i.e. pod.DeletionTimestamp != nil), then all traffic should be sent to
terminating Pods that are still Ready. Note that the EndpointSlice API introduced a new condition called "Serving" which is semantically equivalent to "Ready" except that the Ready condition

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I found it very confusing to read this paragraph that claims "the Ready condition must always be "False" for terminating pods" and the following examples which make statements like "then traffic should be routed to any terminating endpoint that is ready."

I think what you mean in the examples is regarding readiness checks, not the Ready condition. Could we note that distinction, between the use of "ready endpoints" in the most of the text, vs the "Ready" condition on the EndpointSliceAPI, and that the document mostly uses the former to mean readiness checks, not the Ready condition.

Currently there are several workarounds that users can leverage:
* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time.
* Reducing the load balancer health check interval to very small time windows. This may not alwyays be possible based on the load balancer implementation.
* Use a preStop hook in the Pod to delay the time between a Pod terminating and the process receiving SIGTERM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't believe this helps at all with the loadbalancer issue, as the pod has already entered terminating status and is no longer routable before the preStop hook is executed. Maybe we should remove it to avoid confusion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants