-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update random pod scaledown KEP for beta #2691
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @alculquicondor @soltysh
I've tried updating this with the PRR answers required for beta. Though, imo many of them seem irrelevant since this is a minor change to a previously-undocumented behavior. For example, I don't think we need a metric to expose this behavior (but it could be easy to add a simple boolean, or in some way measure the logarithmic performance of cumulative scaledowns on the cluster). Please let me know if you feel there are better responses we could provide.
You also need to update the file in |
keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
Outdated
Show resolved
Hide resolved
keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
Outdated
Show resolved
Hide resolved
keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
Outdated
Show resolved
Hide resolved
/assign |
@@ -263,37 +263,36 @@ _This section must be completed when targeting alpha to a release._ | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How can a rollout fail? Can it impact already running workloads?** | |||
Try to be as paranoid as possible - e.g., what if some components will restart | |||
mid-rollout? | |||
Rollouts should have no effect, as this only relates to scaledown of replicasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can imagine a bug that the new logic panics in some cases.
[agree that it should never impact running workloads]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t maybe you can clarify, because I'm a bit confused by the rollout/rollback terminology: is this referring to a rollout as in a new deployment, or rolling out a cluster upgrade?
Describe manual testing that was done and the outcomes. | ||
Longer term, we may want to require automated upgrade/rollback tests, but we | ||
are missing a bunch of machinery and tooling and can't do that now. | ||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always applicable.
Every single feature should be updage->downgrade->upgrade tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor could you please provide manual testing for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do it once the code changes are in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please indicate the steps that should be applied in the manual test (probably just create a Replicaset and down scale?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do it once the code changes are in.
Would that just be the beta promotion change kubernetes/kubernetes#101767? I'm also confused about how this should be tested with the upgrade->downgrade->upgrade path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a text that you will test that before the feature will graduate to beta.
Also - please update the KEP within the release cycle [see https://github.com//pull/2538 as an example where we did that.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So manual testing is literal: you create a cluster, you upgrade, you run something, you downgrade, run the same thing, upgrade again, report.
Now, the question is, what is the "something" that we should run? Ideally the results can be observed in metrics, but I don't think we can do that for this feature.
EDIT: I just noticed the recommended metric below :)
- [ ] Other (treat as last resort) | ||
- Details: | ||
- [x] Other (treat as last resort) | ||
- Details: The health of this feature is dependent on the overall health of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't helpful - it's true for almost any feature.
What we need is a signal to determine if the kcm is doing what it should be doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh are you aware of any KCM metrics (or where I could find a list) that would be good indicators for this feature?
Maybe something measuring successful deletions, or deletions-vs-recreations to detect errant victim selection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt we have any metrics around that, that particular piece of doesn't have any specific metrics. Looking at the overall deletions vs creations won't help either since the summary numbers should not change, the new algorithm affects which we pick not how many we pick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we come up with some metrics for this? Even if whitebox-y, that's still much better than nothing.
Maybe we can expose histogram of something like:
- |age of deleted pod| / |age of yougest pod|
[We expect all values to be within [1,2], so having values outside of it is a good signal that something is wrong.
[Well - technically it will collide with https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost so maybe we additionally need a metric label whether such annotation was set or not]
[And then the SLO would be that there are no values >2 when the pod_cost_set=false]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|age of deleted pod| / |age of yougest pod|
are these currently available, or is there work needed to add them to KCM? @soltysh where are the KCM metrics set up so that we can start work on that?
- Details: The health of this feature is dependent on the overall health of the | ||
control plane services, specifically kube-controller-manager but also to an extent | ||
services like API server, kubelet, etc. | ||
We could add a metric to KCM to indicate that this feature gate is enabled | ||
|
||
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fill in these two questions too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@damemi suggestion, the previous deletions vs recreations are identical to current rates or close to similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention, but I don't think it's a good SLO - SLI/SLO needs precise definition and this value will heavily depend on the load (and can change over time).
That may be a good metric informing rollback though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t do you want to use the metric you suggested above in #2691 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should build SLO on that metric - yes.
@@ -369,18 +356,11 @@ details). For now, we leave it here. | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How does this feature react if the API server and/or etcd is unavailable?** | |||
It would be irrelevant, because the feature relies on Pod timestamps. If the API server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kcm won't be able to scale-up/down replicasets, so it doesn't matter.
The important thing is that it's not feature of "running workloads".
|
||
* **What are other known failure modes?** | ||
For each of them, fill in the following information by copying the below template: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add n/a (if none)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same below
@@ -263,37 +263,39 @@ _This section must be completed when targeting alpha to a release._ | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How can a rollout fail? Can it impact already running workloads?** | |||
Try to be as paranoid as possible - e.g., what if some components will restart | |||
mid-rollout? | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing answer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm waiting for some clarification on #2691 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something along:
It shouldn't impact already running workloads. One possible problem during rollout is a panic in the algorithm which might crash controller-manager.
- [ ] Other (treat as last resort) | ||
- Details: | ||
- [x] Other (treat as last resort) | ||
- Details: The health of this feature is dependent on the overall health of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt we have any metrics around that, that particular piece of doesn't have any specific metrics. Looking at the overall deletions vs creations won't help either since the summary numbers should not change, the new algorithm affects which we pick not how many we pick.
- Details: The health of this feature is dependent on the overall health of the | ||
control plane services, specifically kube-controller-manager but also to an extent | ||
services like API server, kubelet, etc. | ||
We could add a metric to KCM to indicate that this feature gate is enabled | ||
|
||
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@damemi suggestion, the previous deletions vs recreations are identical to current rates or close to similar.
@@ -369,27 +359,21 @@ details). For now, we leave it here. | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How does this feature react if the API server and/or etcd is unavailable?** | |||
N/a - this is not a feature of running workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller won't work, if api and/or etcd is unavailable thus it can't delete and create pods.
|
||
* **What are the SLIs (Service Level Indicators) an operator can use to determine | ||
the health of the service?** | ||
- [ ] Metrics | ||
- Metric name: | ||
- [Optional] Aggregation method: | ||
- Components exposing the metric: | ||
- [ ] Other (treat as last resort) | ||
- Details: | ||
- [x] Other (treat as last resort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not "other". It's a metric, just one that needs to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions/answers - hope these are useful.
- [ ] Metrics | ||
- Metric name: | ||
- [x] Metrics | ||
- Metric name: deleted_pod_age_ratio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to be the metric that I proposed in: #2691 (comment)
can you please define it somewhere in the KEP (and probably add a beta criteria that this metric is implemented)?
[If it's going to be a different metric, please define that too :) ]
- Details: The health of this feature is dependent on the overall health of the | ||
control plane services, specifically kube-controller-manager but also to an extent | ||
services like API server, kubelet, etc. | ||
We could add a metric to KCM to indicate that this feature gate is enabled | ||
|
||
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should build SLO on that metric - yes.
- 99% percentile over day of absolute value from (job creation time minus expected | ||
job creation time) for cron job <= 10% | ||
- 99,9% of /health requests per day finish with 200 code | ||
There should be no values `>2` in the above metric when the Pod Cost annotation is set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/set/unset/, right?
[Or to be more specific - if this is unset for all pods of the replicaset...]
Basically (it's a comment toward my ask above to define the metric in the kep), I think that the metric should have a field of "pod-cost-annotation-exists" (better name welcome :-) ) that replicaset controller will be computing and setting when exporing that metric (and then the check is trivial).
/assign @soltysh [we will need SIG approval too] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion for that one missing bit, the rest
/lgtm
/approve
@@ -263,37 +263,39 @@ _This section must be completed when targeting alpha to a release._ | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How can a rollout fail? Can it impact already running workloads?** | |||
Try to be as paranoid as possible - e.g., what if some components will restart | |||
mid-rollout? | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something along:
It shouldn't impact already running workloads. One possible problem during rollout is a panic in the algorithm which might crash controller-manager.
1fe373c
to
95abb6b
Compare
Updated with the latest feedback, and squashed. One note about the metric (mentioned this to @alculquicondor offline) is that the defined range of Similar to how this conflicts with the Pod Cost Annotation, there are other criteria which could put a pod older than |
/lgtm |
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damemi, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ref: #2185