-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1669: promote ProxyTerminatingEndpoints to beta in v1.23 #2952
Conversation
andrewsykim
commented
Sep 7, 2021
- One-line PR description: Promote ProxyTerminatingEndpoints to beta in v1.23
- Issue link: Proxy Terminating Endpoints #1669
- Other comments: initial PR was opened in [WIP] Promote KEP-1672 to GA #2938. Splitting each KEP into it's own PR as requested by @wojtek-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readding my questions from the original PR - neither of those were answered.
--> | ||
|
||
TBD for beta. | ||
Roll out can fail if there are pods receiving traffic during termination but are unable to handle it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L182 - have those been added? If so, can you please link them?
--> | ||
|
||
TBD for beta. | ||
Application-level metrics should be used to determine if traffic received during termination is causing issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's far from ideal, as often cluster admins may not understand application-level metrics.
I agree it might be hard to reflect the exact user-oriented behavior, but can we e.g. at least expose some kube-proxy level metric that will be showing how many:
(a) terminating & ready
(b) terminating & not-ready
endpoints can in theory be targetted?
[i.e. counters]
--> | ||
|
||
TBD for beta. | ||
No, but upgrade testing should be done prior to Beta. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automated or just manual? [I guess manual - if so, let's make it explicit]
Manual tests are still to be run.
Also - please ensure that the results will be added here to the KEP before the actual graduation will happen in k/k code [something like: https://github.com//pull/2538/files ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, will update. I think @aojea mentioend that openshift does have some automated testing around this, but I'm not sure we'll get an upstream signal for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton @danwinship I think that we ended switching to externalTrafficPolicy=Cluster
for the PDB tests , do you remember if there are others with externalTrafficPolicy=Local
?
if you don't remember out of your head I'll dig into current tests to find out
- [ ] Other (treat as last resort) | ||
- Details: | ||
- [X] Other (treat as last resort) | ||
- Details: SLIs are difficult to measure for this feature since health of a service is dependant on the underlying process in the Pod as well as the load balancer implementation fronting the service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we're mixing two things here:
(a) what is the end-user experience [i.e. if user requests are being served correctly]
(b) if the feature itself works correctly at k8s level [e.g. if there is a terminating endpoint we send traffic to it instead of black-holing the traffic]
Your answer is generally about (a). But if answering (a) is hard, we should at least try to answer (b).
And answering (b) sounds possible to me.
As an example - this SLI (and corresponding SLO) should actually serve the purpose relatively well - and should require just adjusting some labels in the reported metrics....
TBD for beta. | ||
* A Service Type=LoadBalancer sets externalTrafficPolicy=Local. | ||
* The load balancer implementation uses `spec.healthCheckNodePort` for node health checking. | ||
* A Pod can receive traffic during termination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those conditions are non-trivial to determine for cluster operators.
How about having the metrics I suggested above [this won't allow the answer for specific service, but at least for conjunctions of all services.
|
||
TBD for beta. | ||
We may consider adding metrics for total endpoints that are in the terminating state -- this will be evaluated based on the | ||
cardinality of such metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm not fully following - can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is saying that we can add some level of metrics for terminating endpoints, but we need to be careful about the labels we apply to them. For example, if we included the endpoint as part of the metrics label, we are exploding metrics cardinality because every endpoint is unique. So maybe per endpoint metrics is not possible but total endpoints is. But I'm on the fence about how useful total endpoints is if you can't map it back to which nodes and pods it applies to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless I'm going to take a stab at trying to add metrics for this feature, but I think it wouldn't be that useful if the metric does not surface per pod / endpoint level details.
Should we hold on moving this forward until we sort kubernetes/kubernetes#100313 and kubernetes/kubernetes#106030 (comment) ? |
Yeah, I think we should at least make sure we have a path forward that we're happy with, even if we haven't started to implement it yet. |
Signed-off-by: Andrew Sy Kim <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@andrewsykim: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/hold |
/close Closing this in favor of #3174 |
@andrewsykim: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |