From afa2ffc402ba0d8e043218dd61869954f7856415 Mon Sep 17 00:00:00 2001 From: Andrew Sy Kim Date: Fri, 21 Jan 2022 11:51:34 -0500 Subject: [PATCH 1/2] KEP-1669: updates for v1.24 Signed-off-by: Andrew Sy Kim --- .../README.md | 425 ------------------ .../README.md | 378 ++++++++++++++++ .../kep.yaml | 6 +- 3 files changed, 381 insertions(+), 428 deletions(-) delete mode 100644 keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md create mode 100644 keps/sig-network/1669-proxy-terminating-endpoints/README.md rename keps/sig-network/{1669-graceful-termination-local-external-traffic-policy => 1669-proxy-terminating-endpoints}/kep.yaml (90%) diff --git a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md b/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md deleted file mode 100644 index ea2eadc8563..00000000000 --- a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md +++ /dev/null @@ -1,425 +0,0 @@ -# KEP-1669: Graceful Termination for Local External Traffic Policy - - -- [Release Signoff Checklist](#release-signoff-checklist) -- [Summary](#summary) -- [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) -- [Proposal](#proposal) - - [User Stories (optional)](#user-stories-optional) - - [Story 1](#story-1) - - [Risks and Mitigations](#risks-and-mitigations) -- [Design Details](#design-details) - - [Additions to EndpointSlice](#additions-to-endpointslice) - - [kube-proxy](#kube-proxy) - - [Test Plan](#test-plan) - - [Unit Tests](#unit-tests) - - [E2E Tests](#e2e-tests) - - [Graduation Criteria](#graduation-criteria) - - [Alpha](#alpha) - - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - - [Version Skew Strategy](#version-skew-strategy) -- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - - [Monitoring Requirements](#monitoring-requirements) - - [Dependencies](#dependencies) - - [Scalability](#scalability) - - [Troubleshooting](#troubleshooting) -- [Implementation History](#implementation-history) -- [Drawbacks](#drawbacks) -- [Alternatives](#alternatives) - - -## Release Signoff Checklist - -- [X] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [X] KEP approvers have approved the KEP status as `implementable` -- [X] Design details are appropriately documented -- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input -- [X] Graduation criteria is in place -- [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes - -[kubernetes.io]: https://kubernetes.io/ -[kubernetes/enhancements]: https://git.k8s.io/enhancements -[kubernetes/kubernetes]: https://git.k8s.io/kubernetes -[kubernetes/website]: https://git.k8s.io/website - -## Summary - -Services with externalTrafficPolicy=Local lack the ability to gracefully handle traffic from a loadbalancer when it goes from N to 0 endpoints. -Since terminating pods are never considered "ready" in Endpoints/EndpointSlice, a node with only terminating endpoints would drop traffic even though -it may still be part of a loadbalancer's node pool. Even with loadbalancer health checks, there is usually a delay between when the health check -fails and when a node is completely decommissioned. This KEP proposes changes to gracefully handle traffic to a node that has only terminating endpoints -for a Service with externalTrafficPolicy=Local. - -## Motivation - -### Goals - -* enable zero downtime rolling updates for Services with ExternalTrafficPolicy=Local via nodeports/loadbalancerIPs/externalIPs. - -### Non-Goals - -* changing the behavior of terminating pods/endpoints outside the scope of Services with ExternalTrafficPolicy=Local via a nodeport/loadbalancerIPs/externalIPs. - -## Proposal - -This KEP proposes that if all endpoints for a given Service (with externalTrafficPolicy=Local) within the bounds of a node are terminating (i.e pod.DeletionTimestamp != nil), -then all external traffic on this node should be sent to **ready** and **not ready** terminating endpoints, preferring the former if there are any. This ensures that traffic -is not dropped between the time a node fails its health check (has 0 endpoints) and when a node is decommissioned from the loadbalancer's node pool. - -The proposed changes in this KEP depend on KEP-1672 and the EndpointSlice API. - -### User Stories (optional) - -#### Story 1 - -As a user I would like to do a rolling update of a Deployment fronted by a Service Type=LoadBalancer with ExternalTrafficPolicy=Local. -If a node that has only 1 pod of said deployment goes into the `Terminating` state, all traffic to that node is dropped until either a new pod -comes up or my cloud provider removes the node from the loadbalancer's node pool. Ideally the terminating pod should gracefully handle traffic to this node -until either one of the conditions are satisfied. - -### Risks and Mitigations - -There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but -future testing may change this decision. See KEP 1672 for more details. - -## Design Details - -### Additions to EndpointSlice - -This work depends on the `Terminating` condition existing on the EndpointSlice API (see KEP 1672) in order to check the termination state of an endpoint. - -### kube-proxy - -Updates to kube-proxy when watching EndpointSlice: -* update kube-proxy endpoints info to track terminating endpoints based on endpoint.condition.terminating in EndpointSlice. -* update kube-proxy endpoints info to track endpoint readiness based on endpoint.condition.ready in EndpointSlice -* if externalTrafficPolicy=Local, record all local endpoints that are ready && terminating and endpoints that are !ready && terminating. When there are no local ready endpoints, fall back in the preferred order: - * local ready & terminating endpoints - * local not ready & terminating endpoints - * blackhole traffic -* for all other traffic (i.e. externalTrafficPolicy=Cluster), preserve existing behavior where traffic is only sent to ready && !terminating endpoints. - -In addition, kube-proxy's node port health check should fail if there are only `Terminating` endpoints, regardless of their readiness in order to: -* remove the node from a loadbalancer's node pool as quickly as possible -* gracefully handle any new connections that arrive before the loadbalancer is able to remove the node -* allow existing connections to gracefully terminate - -### Test Plan - -#### Unit Tests - -kube-proxy unit tests: - -* Unit tests will validate the correct behavior when there are only local terminating endpoints. -* Unit tests will validate the new change in behavior only applies for Services with ExternalTrafficPolicy=Local via nodeports/loadbalancerIPs/externalIPs. -* Existing unit tests will validate that terminating endpoints are only used when there are no ready endpoints AND externalTrafficPolicy=Local, otherwise ready && !terminating endpoints are used. -* Unit tests will validate health check node port succeeds only when there are ready && !terminating endpoints. - -#### E2E Tests - -E2E tests will be added to validate that no traffic is dropped during a rolling update for a Service with ExternalTrafficPolicy=Local. - -All existing E2E tests for Services should continue to pass. - -### Graduation Criteria - -#### Alpha - -* kube-proxy internally tracks the `terminating` and `serving` condition from EndpointSlice -* kube-proxy falls back to terminating endpoints if and only if they are the only available endpoints. -* feature is only enabled if the feature gate `ProxyTerminatingEndpoints` is on. -* unit tests in kube-proxy. - -### Upgrade / Downgrade Strategy - -Behavioral changes to terminating endpoints will apply once kube-proxy is upgraded to v1.19 and the `EndpointSlice`/`EndpointSliceProxying` feature gates are enabled. -On downgrade, the worse case scenario is that kube-proxy falls back to the existing behavior. See [Version Skew Strategy](#version-skew-strategy) below. - -### Version Skew Strategy - -The worse case version skew scenario is that kube-proxy falls back to the existing behavior today where traffic does not fall back to terminating endpoints. -This would either happen if a version of the control plane was not aware of the additions to EndpointSlice or if the version of kube-proxy did not know to consume the additions to EndpointSlice. - -There's not much risk involved as the worse case scenario is falling back to existing behavior. - -## Production Readiness Review Questionnaire - -### Feature Enablement and Rollback - -###### How can this feature be enabled / disabled in a live cluster? - -- [X] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: ProxyTerminatingEndpoints - - Components depending on the feature gate: kube-proxy - -###### Does enabling the feature change any default behavior? - -Yes, when externalTrafficPolicy=Local and there are only terminating endpoints, -kube-proxy will route traffic to those endpoints. Before this change, kube-proxy -dropped this traffic instead. - -###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? - -Yes. - -###### What happens if we reenable the feature if it was previously rolled back? - -kube-proxy will no longer drop traffic if only terminating endpoints are available. - -###### Are there any tests for feature enablement/disablement? - -Yes, there will be unit tests in kube-proxy with the feature gate enabled and disabled. - -### Rollout, Upgrade and Rollback Planning - - - -###### How can a rollout fail? Can it impact already running workloads? - - - -TBD for beta. - -###### What specific metrics should inform a rollback? - - - -TBD for beta. - -###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - - - -TBD for beta. - -###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - - - -TBD for beta. - -### Monitoring Requirements - - - -###### How can an operator determine if the feature is in use by workloads? - - - -TBD for beta. - -###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - - - -TBD for beta. - -- [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: -- [ ] Other (treat as last resort) - - Details: - -###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? - - - -TBD for beta. - -###### Are there any missing metrics that would be useful to have to improve observability of this feature? - - - -TBD for beta. - -### Dependencies - - - -###### Does this feature depend on any specific services running in the cluster? - - - -TBD for beta. - -### Scalability - - - -TBD for beta. - -###### Will enabling / using this feature result in any new API calls? - - - -TBD for beta. - -###### Will enabling / using this feature result in introducing new API types? - - - -TBD for beta. - -###### Will enabling / using this feature result in any new calls to the cloud provider? - - - -###### Will enabling / using this feature result in increasing size or count of the existing API objects? - - - -###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? - - - -###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - - - -### Troubleshooting - - - -###### How does this feature react if the API server and/or etcd is unavailable? - -###### What are other known failure modes? - - - -###### What steps should be taken if SLOs are not being met to determine the problem? - -## Implementation History - -- [x] 2020-04-23: KEP accepted as implementable for v1.19 - -## Drawbacks - -* scalability: this KEP (and KEP 1672) would add more writes per endpoint to EndpointSlice as each terminating endpoint adds at least 1 and at -most 2 additional writes - 1 write for marking an endpoint as "terminating" and another if an endpoint changes it's readiness during termination. -* complexity: an additional corner case is added to kube-proxy adding to it's complexity. - -## Alternatives - -Some users work around this issue today by adding a preStop hook that sleeps for some duration. Though this may work in some scenarios, better handling from kube-proxy -would alleviate the need for this work around altogether. - diff --git a/keps/sig-network/1669-proxy-terminating-endpoints/README.md b/keps/sig-network/1669-proxy-terminating-endpoints/README.md new file mode 100644 index 00000000000..49c726bee51 --- /dev/null +++ b/keps/sig-network/1669-proxy-terminating-endpoints/README.md @@ -0,0 +1,378 @@ +# KEP-1669: Proxy Terminating Endpoints + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Example: only some endpoints terminating when traffic policy is "Cluster"](#example-only-some-endpoints-terminating-when-traffic-policy-is-cluster) + - [Example: only some endpoints terminating on a node when traffic policy is "Local"](#example-only-some-endpoints-terminating-on-a-node-when-traffic-policy-is-local) + - [Example: all endpoints terminating and traffic policy is "Cluster"](#example-all-endpoints-terminating-and-traffic-policy-is-cluster) + - [Example: all endpoints terminating on a node when traffic policy is "Local"](#example-all-endpoints-terminating-on-a-node-when-traffic-policy-is-local) + - [Handling terminating endpoints that are not ready.](#handling-terminating-endpoints-that-are-not-ready) + - [User Stories (optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Additions to EndpointSlice](#additions-to-endpointslice) + - [kube-proxy](#kube-proxy) + - [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [E2E Tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +- [X] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [X] KEP approvers have approved the KEP status as `implementable` +- [X] Design details are appropriately documented +- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [X] Graduation criteria is in place +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes some enhancements to kube-proxy to handle terminating endpoints in an effort to improve the traffic engineering capabilities and overall relability of Kubernetes. +These changes will depend on recent changes to the EndpointSlice API as part of KEP-1672 to include terminating pods in the EndpointSlice API. + +## Motivation + +Historically, Kubernetes has ignored terminating Pods from both the Endpoints and EndpointSlice API. KEP-1672 recently introduced an API change in EndpointSlice +where terminating endpoints are now included in EndpointSlice, with the addition of two new endpoint conditions "Serving" and "Terminating". Even though the EndpointSlice +API now includes terminating endpoints, kube-proxy strictly forwards traffic to only "Ready" pods that are not terminating. There are several scenarios where not handling +terminating endpoints can lead to traffic loss. It's worth diving into one specific scenario described in [this issue](https://github.com/kubernetes/kubernetes/issues/85643): + +When using Service Type=LoadBalancer w/ externalTrafficPolicy=Local, the availability of node backend is determined by the healthCheckNodePort served by kube-proxy. +Kube-proxy returns a "200 OK" http response on this endpoint if there is a local ready endpoint for a Serivce, otherwise it returns 500 http response signalling to the load balancer that the node should be removed +from the backend pool. Upon performing a rolling update of a Deployment, there can be a small window of time where old pods on a node are terminating (hence not "Ready") but the load balancer +has not probed kube-proxy's healthCheckNodePort yet. In this event, there is traffic loss because the load balancer is routing traffic to a node where the proxy rules will blackhole +the traffic due to a lack of local endpoints. The likihood of this traffic loss is impacted by two factors: the number of local endpoints on the node and the interval between health checks +from the load balancer. The worse case scenario is a node with 1 local endpoint and a load balancer with a long health check interval. + +Currently there are several workarounds that users can leverage: +* Use Kubernetes scheduling/deployment features such that a node would never only have terminating endpoints. For example, always scheduling two pods on a node and only allowing 1 pod to update at a time. +* Reducing the load balancer health check interval to very small time windows. This may not alwyays be possible based on the load balancer implementation. +* Use a preStop hook in the Pod to delay the time between a Pod terminating and the process receiving SIGTERM. + +While some of these solutions help, there's more that Kubernetes can do to handle this complexity for users. + +### Goals + +* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because trafffic is sent to Pods that are terminating. + +### Non-Goals + +* Changing the behavior of how pods terminate. +* Handling terminating endpoints for other consumers of the EndpointSlice API, such as ingress controllers or external load balancers. + +## Proposal + +This KEP proposes that if all endpoints for a given Service scoped to its traffic policy are terminating (i.e. pod.DeletionTimestamp != nil), then all traffic should be sent to +terminating Pods that are still Ready. Note that the EndpointSlice API introduced a new condition called "Serving" which is semantically equivalent to "Ready" except that the Ready condition +must always be "False" for terminating pods for compatibility reasons. For consumers of the EndpointSLice API that want to route traffic strictly based on a Pod's readiness ignoring +it's terminating state, they should be reading the Serving condition going forward. Below are some examples to help illustrate the proposed behavior: + +### Example: only some endpoints terminating when traffic policy is "Cluster" + +When the traffic policy is "Cluster" and some endpoints are terminating, all traffic should be routed to the ready endpoints that are not terminating.. + +### Example: only some endpoints terminating on a node when traffic policy is "Local" + +When the traffic policy is "Local" and some endpoints are terminating within a single node, traffic should be routed to ready endpoints on that node that are not terminating. + +### Example: all endpoints terminating and traffic policy is "Cluster" + +When the traffic policy is "Cluster" and all endpoints are terminating, then traffic should be routed to any terminating endpoint that is ready. + +### Example: all endpoints terminating on a node when traffic policy is "Local" + +When the traffic policy is "Local" and all endpoints are terminating within a single node, then traffic should be routed to any terminating endpoint that is ready on that node. + + +### Handling terminating endpoints that are not ready. + +It is worth noting that traffic should not be routed to terminating pods if their readiness probe is failing, even if it is the only endpoints remaining. This is to give workloads +the flexibility/control to opt out of this behavior by either exiting immediately or failing the readiness probe when receiving SIGTERM from kubelet. This would also be counter-intuitive +to the current understanding of readiness probes. + + +### User Stories (optional) + +#### Story 1 + +As a user I would like to do a rolling update of a Deployment fronted by a Service Type=LoadBalancer with ExternalTrafficPolicy=Local. +If a node that has only 1 pod of said deployment goes into the `Terminating` state, all traffic to that node is dropped until either a new pod +comes up or my cloud provider removes the node from the loadbalancer's node pool. Ideally the terminating pod should gracefully handle traffic to this node +until either one of the conditions are satisfied. + +### Risks and Mitigations + +There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but +future testing may change this decision. See KEP 1672 for more details. + +## Design Details + +### Additions to EndpointSlice + +This work depends on the `Terminating` condition existing on the EndpointSlice API (see KEP 1672) in order to check the termination state of an endpoint. + +### kube-proxy + +Updates to kube-proxy when watching EndpointSlice: +* update kube-proxy endpoints info to track terminating endpoints based on endpoint.condition.terminating in EndpointSlice. +* update kube-proxy endpoints info to track endpoint readiness based on endpoint.condition.ready in EndpointSlice +* within the scope of the traffic policy for a Service, iterate the following set of endpoints, picking the first set that has at least 1 ready endpoint: + * ready endpoints that are not terminating + * ready endpoints that are terminating + +In addition, kube-proxy's node port health check should fail if there are only `Terminating` endpoints, regardless of their readiness in order to: +* remove the node from a loadbalancer's node pool as quickly as possible +* gracefully handle any new connections that arrive before the loadbalancer is able to remove the node +* allow existing connections to gracefully terminate + +### Test Plan + +#### Unit Tests + +kube-proxy unit tests: + +* Unit tests will validate the correct behavior when there are only local terminating endpoints. +* Unit tests will validate the changein behavior against the matrix of possible Service configurations using both internalTrafficPolicy and externalTrafficPolicy. +* Existing unit tests will validate that terminating endpoints are only used when there are no ready endpoints, otherwise ready && !terminating endpoints are used. +* Unit tests will validate health check node port succeeds only when there are ready && !terminating endpoints. + +#### E2E Tests + +E2E tests will be added to validate that no traffic is dropped during a rolling update for a Service. E2E tests should cover all permutations of externalTrafficPolicy +and internalTrafficPolicy. + +All existing E2E tests for Services should continue to pass. + +### Graduation Criteria + +#### Alpha + +* kube-proxy internally tracks the `terminating` and `serving` condition from EndpointSlice +* kube-proxy falls back to terminating endpoints if and only if they are the only available endpoints. +* feature is only enabled if the feature gate `ProxyTerminatingEndpoints` is on. +* unit tests in kube-proxy. + +#### Beta + +* E2E tests are in place, exercising all permutations of internalTrafficPolicy and externalTrafficPolicy. +* Metrics to publish how many Services/Endpoints are routing traffic to terminating endpoints. + +### Upgrade / Downgrade Strategy + +Behavioral changes to terminating endpoints will apply when the feature gate is enabled. It is required that the cluster has the EndpointSlice API enabled and +the EndpointSliceTerminatingCondition feature is also enabled. On downgrade, the worse case scenario is that kube-proxy falls back to the existing behavior where it always +excludes terminating endpoints. See [Version Skew Strategy](#version-skew-strategy) below. + +### Version Skew Strategy + +The worse case version skew scenario is that kube-proxy falls back to the existing behavior today where traffic does not fall back to terminating endpoints. +This would either happen if a version of the control plane was not aware of the additions to EndpointSlice or if the version of kube-proxy did not know to consume the terminating condintion in EndpointSlice. + +There's not much risk involved as the worse case scenario is falling back to existing behavior. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: ProxyTerminatingEndpoints + - Components depending on the feature gate: kube-proxy + +###### Does enabling the feature change any default behavior? + +Yes, when there are only terminating (and ready) endpoints, kube-proxy will route traffic to those endpoints. Before this change, kube-proxy +dropped or disallowed this traffic instead. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. + +###### What happens if we reenable the feature if it was previously rolled back? + +kube-proxy will no longer drop traffic if only terminating endpoints are available. + +###### Are there any tests for feature enablement/disablement? + +Yes, there will be unit tests in kube-proxy with the feature gate enabled and disabled. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout fail? Can it impact already running workloads? + + + +A rollout can be negatively impacted if workloads are currently dependant on kube-proxy's behavior to never forward traffic to terminating endpoints. +Ideally workloads are configured such that their readiness probes fail when traffic is not desired, but workloads may exist relying on the current behavior. +When the rollout happens, workloads may unexpectedly receive traffic when terminating. + + +###### What specific metrics should inform a rollback? + +There will be metrics added to publish how many Services/Endpoints are routing to terminating pods. It may be expected that clusters +route to many terminating pods at once, especially during rolling updates, but users can correlate this metric with other factors to +gauge if a rollback is necessary. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +Upgrade->downgrade->upgrade path has not been tested yet. We may want to require this for beta or GA. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +In theory, all workloads receiving traffic through kube-proxy will be impacted by this feature when enabled. However, like other existing capabilities, +the traffic can be controlled by workloads through their readiness probes. Operators should assume that workloads passing the readiness probes can now receive traffic +regardless of their termination state. If this is undesired, workloads should be updated such that the readiness probe fails on termination. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [X] Metrics + - Metric name: TBD + - [Optional] Aggregation method: + - Components exposing the metric: + - kube-proxy +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + +It is difficult to gauge a reasonable SLO because it could be expected for a cluster to be handling many terminating endpoints at a time +during large rolling updates. Whether those terminating pods should receive traffic is also dependant on the cluster topology and the +the workload characteristics. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +No. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + +This feature only depends on core components and APIs. + +### Scalability + +The biggest scalability concern is additional read/writes to the EndpointSlice API for track terminating endpoints. This is covered in more depth +as part of KEP-1672. + +###### Will enabling / using this feature result in any new API calls? + +No. + +###### Will enabling / using this feature result in introducing new API types? + +No. New additions to EndpointSlice is covered in KEP-1672. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +This could impact the existing SLI: + +"Latency of programming in-cluster load balancing mechanism (e.g. iptables), measured from when service spec or list of its Ready pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes aggregated across all programmers." + +This is because kube-proxy will be updated to handle terminating endpoints, expanding the total set of endpoints it needs to reconcile. + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +More CPU/RAM may be consumed by kube-proxy to handle terminating endpoints, however we don't anticipate that it will be significant. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +kube-proxy may forward traffic to an endpoint that has terminated already. However, this scenario +is possible today if apiserver becomes unavailable. + +###### What are other known failure modes? + + + +- Traffic is sent to terminating endpoints when the user did not want it. + - Detection: typically by the workload + - Mitigations: workload should be updated to fail readiness probe on termination. + - Diagnostics: metrics should indicate that traffic is being forwarded to terminating endpoints. + - Testing: there are no tests for this failure mode since routing traffic to terminating endpoints based on their readiness is the desired behavior. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- [x] 2020-04-23: KEP accepted as implementable for v1.19 +- [x] 2021-01-21: KEP scope expanded to include both internal and external traffic. + +## Drawbacks + +* scalability: this KEP (and KEP 1672) would add more writes per endpoint to EndpointSlice as each terminating endpoint adds at least 1 and at +most 2 additional writes - 1 write for marking an endpoint as "terminating" and another if an endpoint changes it's readiness during termination. +* complexity: an additional corner case is added to kube-proxy adding to it's complexity. + +## Alternatives + +Some users work around this issue today by adding a preStop hook that sleeps for some duration. Though this may work in some scenarios, better handling from kube-proxy +would alleviate the need for this work around altogether. + diff --git a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml b/keps/sig-network/1669-proxy-terminating-endpoints/kep.yaml similarity index 90% rename from keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml rename to keps/sig-network/1669-proxy-terminating-endpoints/kep.yaml index 2e586daa551..c0d9d5c52c8 100644 --- a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml +++ b/keps/sig-network/1669-proxy-terminating-endpoints/kep.yaml @@ -1,4 +1,4 @@ -title: Graceful Termination for Local External Traffic Policy +title: Proxy Terminating Endpoints kep-number: 1669 authors: - "@andrewsykim" @@ -14,7 +14,7 @@ approvers: prr-approvers: - "@johnbelamaric" creation-date: 2020-04-07 -last-updated: 2020-04-07 +last-updated: 2022-01-21 status: implementable see-also: - "/keps/sig-network/1672-tracking-terminating-endpoints/README.md" @@ -26,7 +26,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.22" +latest-milestone: "v1.24" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From bea551e531afc97acac141a7a4561c71f0267a7c Mon Sep 17 00:00:00 2001 From: Andrew Sy Kim Date: Mon, 24 Jan 2022 23:14:41 -0500 Subject: [PATCH 2/2] kep-1669: fix some typos Signed-off-by: Andrew Sy Kim --- .../1669-proxy-terminating-endpoints/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-network/1669-proxy-terminating-endpoints/README.md b/keps/sig-network/1669-proxy-terminating-endpoints/README.md index 49c726bee51..10a165b2570 100644 --- a/keps/sig-network/1669-proxy-terminating-endpoints/README.md +++ b/keps/sig-network/1669-proxy-terminating-endpoints/README.md @@ -67,10 +67,10 @@ API now includes terminating endpoints, kube-proxy strictly forwards traffic to terminating endpoints can lead to traffic loss. It's worth diving into one specific scenario described in [this issue](https://github.com/kubernetes/kubernetes/issues/85643): When using Service Type=LoadBalancer w/ externalTrafficPolicy=Local, the availability of node backend is determined by the healthCheckNodePort served by kube-proxy. -Kube-proxy returns a "200 OK" http response on this endpoint if there is a local ready endpoint for a Serivce, otherwise it returns 500 http response signalling to the load balancer that the node should be removed +Kube-proxy returns a "200 OK" http response on this endpoint if there is a local ready endpoint for a Service, otherwise it returns 500 http response signalling to the load balancer that the node should be removed from the backend pool. Upon performing a rolling update of a Deployment, there can be a small window of time where old pods on a node are terminating (hence not "Ready") but the load balancer has not probed kube-proxy's healthCheckNodePort yet. In this event, there is traffic loss because the load balancer is routing traffic to a node where the proxy rules will blackhole -the traffic due to a lack of local endpoints. The likihood of this traffic loss is impacted by two factors: the number of local endpoints on the node and the interval between health checks +the traffic due to a lack of local endpoints. The likihood of this traffic loss is impacted by two factors: the number of local endpoints on the node and the interval between health checks from the load balancer. The worse case scenario is a node with 1 local endpoint and a load balancer with a long health check interval. Currently there are several workarounds that users can leverage: @@ -82,7 +82,7 @@ While some of these solutions help, there's more that Kubernetes can do to handl ### Goals -* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because trafffic is sent to Pods that are terminating. +* Reduce potential traffic loss from kube-proxy that occurs on rolling updates because traffic is sent to Pods that are terminating. ### Non-Goals @@ -132,7 +132,7 @@ until either one of the conditions are satisfied. ### Risks and Mitigations There are scalability implications to tracking termination state in EndpointSlice. For now we are assuming that the performance trade-offs are worthwhile but -future testing may change this decision. See KEP 1672 for more details. +future testing may change this decision. See [KEP 1672](../1672-tracking-terminating-endpoints) for more details. ## Design Details