Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Promote KEP-1672 to GA #2938

Closed

Conversation

andrewsykim
Copy link
Member

  • One-line PR description: Promote feature EndpointSliceTerminatingCondition to GA and ProxyTerminatingEndpoints to Beta.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 2, 2021
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 2, 2021
@andrewsykim andrewsykim changed the title Promote KEP-1672 to GA & KEP-1669 to Beta [WIP] Promote KEP-1672 to GA & KEP-1669 to Beta Sep 2, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 2, 2021
@andrewsykim
Copy link
Member Author

Marking this one WIP for now, I think some of the PRR questions need to be answered for KEP-1669

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 2, 2021
@andrewsykim andrewsykim force-pushed the terminating-endpoints-keps branch from ac4b84d to 5e7ed98 Compare September 2, 2021 21:56
@wojtek-t
Copy link
Member

wojtek-t commented Sep 3, 2021

/assign

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrewsykim, thockin
To complete the pull request process, please ask for approval from wojtek-t after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.20"
beta: "v1.22"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we went Beta in 1.22 without:
(a) having this tracked appropriately by RT?
(b) having PRR approved?

This sounds very wrong to me...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a last-minute decision during a SIG Network call to include this feature as beta in v1.22 -- I guess we slipped the KEP updates though. Sorry, that's my bad :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no particular recollection of how this came to be, and certainly did not mean to bypass any process.

@@ -20,16 +20,18 @@ see-also:
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: stable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fill in the PRR questionaire for this KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm there's no diff for PRR cause all the questions were answered in the last release, let me know if there's a specific question I missed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a diff - there are couple questions that weren't answered back then, including:

Are there any tests for feature enablement/disablement?

Have the tests been added? If so, please link them. If not - we shouldn't even go to beta...

What specific metrics should inform a rollback?

We were discussing adding a metric - has that happened?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Has that happened? Findings?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Has the metric mentioned there been added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have the tests been added? If so, please link them. If not - we shouldn't even go to beta...

Oh yes, of course we added tests for feature enablement, do we want the PRR to link to every test case we added? The answer to the question says:

Yes, there will be strategy API unit tests validating if the new API field is allowed based on the feature gate.

Is that enough or do you want more details on the specific test cases (there's a lot)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add more details on metrics, I think there was some back and forth on the viability of per endpoint metrics due to cardinality. But maybe total endpoints by condition is acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that enough or do you want more details on the specific test cases (there's a lot)?

Please link them - no need to describe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all the PRs that adds unit, integration or e2e tests for this feature under Are there any tests for feature enablement/disablement?

@@ -189,34 +194,25 @@ Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
-->

TBD for beta.
Roll out can fail if there are pods receiving traffic during termination but are unable to handle it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - can't comment on unchanged lines:

L182 - have those been added? If so, can you please link them?

-->

TBD for beta.
Application-level metrics should be used to determine if traffic received during termination is causing issues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's far from ideal, as often cluster admins may not understand application-level metrics.

I agree it might be hard to reflect the exact user-oriented behavior, but can we e.g. at least expose some kube-proxy level metric that will be showing how many:
(a) terminating & ready
(b) terminating & not-ready
endpoints can in theory be targetted?
[i.e. counters]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me thinks it's weird that we would have endpoints-level metrics for exclusively termianting endpoints. Should we address this more holsitically and metrics for all condition types?

e.g. at least expose some kube-proxy level metric that will be showing how many:

If we only care about total endpoints by condition should it be aggregated total from endpointslice controller or do we still want per-node terminating condition metrics?

-->

TBD for beta.
No, but upgrade testing should be done prior to Beta.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated or just manual? [I guess manual - if so, let's make it explicit]

Manual tests are still to be run.

Also - please ensure that the results will be added here to the KEP before the actual graduation will happen in k/k code [something like: https://github.com//pull/2538/files ]

TBD for beta.
* A Service Type=LoadBalancer sets externalTrafficPolicy=Local.
* The load balancer implementation uses `spec.healthCheckNodePort` for node health checking.
* A Pod can receive traffic during termination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those conditions are non-trivial to determine for cluster operators.

How about having the metrics I suggested above [this won't allow the answer for specific service, but at least for conjunctions of all services.

- [ ] Other (treat as last resort)
- Details:
- [X] Other (treat as last resort)
- Details: SLIs are difficult to measure for this feature since health of a service is dependant on the underlying process in the Pod as well as the load balancer implementation fronting the service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we're mixing two things here:
(a) what is the end-user experience [i.e. if user requests are being served correctly]
(b) if the feature itself works correctly at k8s level [e.g. if there is a terminating endpoint we send traffic to it instead of black-holing the traffic]

Your answer is generally about (a). But if answering (a) is hard, we should at least try to answer (b).
And answering (b) sounds possible to me.

As an example - this SLI (and corresponding SLO) should actually serve the purpose relatively well - and should require just adjusting some labels in the reported metrics....

@@ -270,7 +266,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
implementation difficulties, etc.).
-->

TBD for beta.
We may consider adding metrics for total endpoints that are in the terminating state -- this will be evaluated based on the
cardinality of such metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not fully following - can you clarify?

keps/prod-readiness/sig-network/1672.yaml Show resolved Hide resolved
@gracenng
Copy link
Member

gracenng commented Sep 7, 2021

Hi @andrewsykim 1.23 enhancements shadow here. The current status of this enhancement is at risk as PRR is not approved and GA graduation criteria is missing. Friendly reminder that enhancement freeze is Sept 9th. Feel free to ping me on Slack once this PR is merged. Thanks

@andrewsykim andrewsykim force-pushed the terminating-endpoints-keps branch from 5e7ed98 to e3fd196 Compare September 7, 2021 21:32
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2021
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 7, 2021
@thockin
Copy link
Member

thockin commented Sep 8, 2021

I think I got my head screwed up. I was thinking that KEP freeze meant that all KEP PRs had to merge, but that can't be right. The KEP update for alpha->beta or beta->GA is a trailing indicator - and should only be merged AFTER the code is merged.

We can say these intend to change in 1.23 and merge changes to the KEP body, but we should NOT update the stage until AFTER the code is in.

So it's fine to block these PRs until code lands. That way, we don't end up with a KEP saying "beta" but the code not being complete (which is probably what happened here).

That's my fault, but I will ask sig-release to clarify and amplify that.

@andrewsykim andrewsykim changed the title [WIP] Promote KEP-1672 to GA & KEP-1669 to Beta [WIP] Promote KEP-1672 to GA Sep 8, 2021
@andrewsykim andrewsykim force-pushed the terminating-endpoints-keps branch from e3fd196 to 53568f6 Compare September 9, 2021 14:08
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2021
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2021
@andrewsykim
Copy link
Member Author

/close

@k8s-ci-robot
Copy link
Contributor

@andrewsykim: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants