Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1669: promote ProxyTerminatingEndpoints to beta in v1.23 #2952

Closed
wants to merge 1 commit into from

Conversation

andrewsykim
Copy link
Member

  • One-line PR description: Promote ProxyTerminatingEndpoints to beta in v1.23

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. labels Sep 7, 2021
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readding my questions from the original PR - neither of those were answered.

-->

TBD for beta.
Roll out can fail if there are pods receiving traffic during termination but are unable to handle it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L182 - have those been added? If so, can you please link them?

-->

TBD for beta.
Application-level metrics should be used to determine if traffic received during termination is causing issues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's far from ideal, as often cluster admins may not understand application-level metrics.

I agree it might be hard to reflect the exact user-oriented behavior, but can we e.g. at least expose some kube-proxy level metric that will be showing how many:
(a) terminating & ready
(b) terminating & not-ready
endpoints can in theory be targetted?
[i.e. counters]

-->

TBD for beta.
No, but upgrade testing should be done prior to Beta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated or just manual? [I guess manual - if so, let's make it explicit]

Manual tests are still to be run.

Also - please ensure that the results will be added here to the KEP before the actual graduation will happen in k/k code [something like: https://github.com//pull/2538/files ]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will update. I think @aojea mentioend that openshift does have some automated testing around this, but I'm not sure we'll get an upstream signal for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton @danwinship I think that we ended switching to externalTrafficPolicy=Cluster for the PDB tests , do you remember if there are others with externalTrafficPolicy=Local ?
if you don't remember out of your head I'll dig into current tests to find out

- [ ] Other (treat as last resort)
- Details:
- [X] Other (treat as last resort)
- Details: SLIs are difficult to measure for this feature since health of a service is dependant on the underlying process in the Pod as well as the load balancer implementation fronting the service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we're mixing two things here:
(a) what is the end-user experience [i.e. if user requests are being served correctly]
(b) if the feature itself works correctly at k8s level [e.g. if there is a terminating endpoint we send traffic to it instead of black-holing the traffic]

Your answer is generally about (a). But if answering (a) is hard, we should at least try to answer (b).
And answering (b) sounds possible to me.

As an example - this SLI (and corresponding SLO) should actually serve the purpose relatively well - and should require just adjusting some labels in the reported metrics....

TBD for beta.
* A Service Type=LoadBalancer sets externalTrafficPolicy=Local.
* The load balancer implementation uses `spec.healthCheckNodePort` for node health checking.
* A Pod can receive traffic during termination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those conditions are non-trivial to determine for cluster operators.

How about having the metrics I suggested above [this won't allow the answer for specific service, but at least for conjunctions of all services.


TBD for beta.
We may consider adding metrics for total endpoints that are in the terminating state -- this will be evaluated based on the
cardinality of such metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not fully following - can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is saying that we can add some level of metrics for terminating endpoints, but we need to be careful about the labels we apply to them. For example, if we included the endpoint as part of the metrics label, we are exploding metrics cardinality because every endpoint is unique. So maybe per endpoint metrics is not possible but total endpoints is. But I'm on the fence about how useful total endpoints is if you can't map it back to which nodes and pods it applies to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless I'm going to take a stab at trying to add metrics for this feature, but I think it wouldn't be that useful if the metric does not surface per pod / endpoint level details.

@wojtek-t wojtek-t self-assigned this Sep 8, 2021
@thockin
Copy link
Member

thockin commented Nov 8, 2021

Should we hold on moving this forward until we sort kubernetes/kubernetes#100313 and kubernetes/kubernetes#106030 (comment) ?

@danwinship

@thockin thockin removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2021
@danwinship
Copy link
Contributor

Yeah, I think we should at least make sure we have a path forward that we're happy with, even if we haven't started to implement it yet.

Signed-off-by: Andrew Sy Kim <[email protected]>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 21, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2022
@k8s-ci-robot
Copy link
Contributor

@andrewsykim: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 1c219bd link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@andrewsykim
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2022
@andrewsykim
Copy link
Member Author

/close

Closing this in favor of #3174

@k8s-ci-robot
Copy link
Contributor

@andrewsykim: Closed this PR.

In response to this:

/close

Closing this in favor of #3174

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants