-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Promote KEP-1672 to GA #2938
[WIP] Promote KEP-1672 to GA #2938
Conversation
andrewsykim
commented
Sep 2, 2021
- One-line PR description: Promote feature EndpointSliceTerminatingCondition to GA and ProxyTerminatingEndpoints to Beta.
- Issue link: Proxy Terminating Endpoints #1669 & Tracking Terminating Endpoints #1672
- Other comments:
Marking this one WIP for now, I think some of the PRR questions need to be answered for KEP-1669 |
ac4b84d
to
5e7ed98
Compare
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andrewsykim, thockin The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v1.20" | ||
beta: "v1.22" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we went Beta in 1.22 without:
(a) having this tracked appropriately by RT?
(b) having PRR approved?
This sounds very wrong to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a last-minute decision during a SIG Network call to include this feature as beta in v1.22 -- I guess we slipped the KEP updates though. Sorry, that's my bad :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no particular recollection of how this came to be, and certainly did not mean to bypass any process.
@@ -20,16 +20,18 @@ see-also: | |||
replaces: [] | |||
|
|||
# The target maturity stage in the current dev cycle for this KEP. | |||
stage: alpha | |||
stage: stable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fill in the PRR questionaire for this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm there's no diff for PRR cause all the questions were answered in the last release, let me know if there's a specific question I missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a diff - there are couple questions that weren't answered back then, including:
Are there any tests for feature enablement/disablement?
Have the tests been added? If so, please link them. If not - we shouldn't even go to beta...
What specific metrics should inform a rollback?
We were discussing adding a metric - has that happened?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Has that happened? Findings?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Has the metric mentioned there been added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have the tests been added? If so, please link them. If not - we shouldn't even go to beta...
Oh yes, of course we added tests for feature enablement, do we want the PRR to link to every test case we added? The answer to the question says:
Yes, there will be strategy API unit tests validating if the new API field is allowed based on the feature gate.
Is that enough or do you want more details on the specific test cases (there's a lot)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add more details on metrics, I think there was some back and forth on the viability of per endpoint metrics due to cardinality. But maybe total endpoints by condition is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that enough or do you want more details on the specific test cases (there's a lot)?
Please link them - no need to describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added all the PRs that adds unit, integration or e2e tests for this feature under Are there any tests for feature enablement/disablement?
@@ -189,34 +194,25 @@ Try to be as paranoid as possible - e.g., what if some components will restart | |||
mid-rollout? | |||
--> | |||
|
|||
TBD for beta. | |||
Roll out can fail if there are pods receiving traffic during termination but are unable to handle it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - can't comment on unchanged lines:
L182 - have those been added? If so, can you please link them?
--> | ||
|
||
TBD for beta. | ||
Application-level metrics should be used to determine if traffic received during termination is causing issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's far from ideal, as often cluster admins may not understand application-level metrics.
I agree it might be hard to reflect the exact user-oriented behavior, but can we e.g. at least expose some kube-proxy level metric that will be showing how many:
(a) terminating & ready
(b) terminating & not-ready
endpoints can in theory be targetted?
[i.e. counters]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me thinks it's weird that we would have endpoints-level metrics for exclusively termianting endpoints. Should we address this more holsitically and metrics for all condition types?
e.g. at least expose some kube-proxy level metric that will be showing how many:
If we only care about total endpoints by condition should it be aggregated total from endpointslice controller or do we still want per-node terminating condition metrics?
--> | ||
|
||
TBD for beta. | ||
No, but upgrade testing should be done prior to Beta. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automated or just manual? [I guess manual - if so, let's make it explicit]
Manual tests are still to be run.
Also - please ensure that the results will be added here to the KEP before the actual graduation will happen in k/k code [something like: https://github.com//pull/2538/files ]
TBD for beta. | ||
* A Service Type=LoadBalancer sets externalTrafficPolicy=Local. | ||
* The load balancer implementation uses `spec.healthCheckNodePort` for node health checking. | ||
* A Pod can receive traffic during termination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those conditions are non-trivial to determine for cluster operators.
How about having the metrics I suggested above [this won't allow the answer for specific service, but at least for conjunctions of all services.
- [ ] Other (treat as last resort) | ||
- Details: | ||
- [X] Other (treat as last resort) | ||
- Details: SLIs are difficult to measure for this feature since health of a service is dependant on the underlying process in the Pod as well as the load balancer implementation fronting the service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we're mixing two things here:
(a) what is the end-user experience [i.e. if user requests are being served correctly]
(b) if the feature itself works correctly at k8s level [e.g. if there is a terminating endpoint we send traffic to it instead of black-holing the traffic]
Your answer is generally about (a). But if answering (a) is hard, we should at least try to answer (b).
And answering (b) sounds possible to me.
As an example - this SLI (and corresponding SLO) should actually serve the purpose relatively well - and should require just adjusting some labels in the reported metrics....
@@ -270,7 +266,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co | |||
implementation difficulties, etc.). | |||
--> | |||
|
|||
TBD for beta. | |||
We may consider adding metrics for total endpoints that are in the terminating state -- this will be evaluated based on the | |||
cardinality of such metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm not fully following - can you clarify?
Hi @andrewsykim 1.23 enhancements shadow here. The current status of this enhancement is |
5e7ed98
to
e3fd196
Compare
New changes are detected. LGTM label has been removed. |
I think I got my head screwed up. I was thinking that KEP freeze meant that all KEP PRs had to merge, but that can't be right. The KEP update for alpha->beta or beta->GA is a trailing indicator - and should only be merged AFTER the code is merged. We can say these intend to change in 1.23 and merge changes to the KEP body, but we should NOT update the stage until AFTER the code is in. So it's fine to block these PRs until code lands. That way, we don't end up with a KEP saying "beta" but the code not being complete (which is probably what happened here). That's my fault, but I will ask sig-release to clarify and amplify that. |
Signed-off-by: Andrew Sy Kim <[email protected]>
Signed-off-by: Andrew Sy Kim <[email protected]>
e3fd196
to
53568f6
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/close |
@andrewsykim: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |