Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update random pod scaledown KEP for beta #2691

Merged
merged 1 commit into from
May 12, 2021

Conversation

damemi
Copy link
Contributor

@damemi damemi commented May 7, 2021

Ref: #2185

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels May 7, 2021
@k8s-ci-robot k8s-ci-robot requested review from kow3ns and mattfarina May 7, 2021 18:20
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 7, 2021
Copy link
Contributor Author

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @alculquicondor @soltysh
I've tried updating this with the PRR answers required for beta. Though, imo many of them seem irrelevant since this is a minor change to a previously-undocumented behavior. For example, I don't think we need a metric to expose this behavior (but it could be easy to add a simple boolean, or in some way measure the logarithmic performance of cumulative scaledowns on the cluster). Please let me know if you feel there are better responses we could provide.

@alculquicondor
Copy link
Member

You also need to update the file in /keps/prod-readiness

@wojtek-t
Copy link
Member

/assign

@@ -263,37 +263,36 @@ _This section must be completed when targeting alpha to a release._
_This section must be completed when targeting beta graduation to a release._

* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
Rollouts should have no effect, as this only relates to scaledown of replicasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can imagine a bug that the new logic panics in some cases.

[agree that it should never impact running workloads]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t maybe you can clarify, because I'm a bit confused by the rollout/rollback terminology: is this referring to a rollout as in a new deployment, or rolling out a cluster upgrade?

Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always applicable.

Every single feature should be updage->downgrade->upgrade tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor could you please provide manual testing for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do it once the code changes are in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indicate the steps that should be applied in the manual test (probably just create a Replicaset and down scale?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do it once the code changes are in.

Would that just be the beta promotion change kubernetes/kubernetes#101767? I'm also confused about how this should be tested with the upgrade->downgrade->upgrade path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a text that you will test that before the feature will graduate to beta.
Also - please update the KEP within the release cycle [see https://github.com//pull/2538 as an example where we did that.]

Copy link
Member

@alculquicondor alculquicondor May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So manual testing is literal: you create a cluster, you upgrade, you run something, you downgrade, run the same thing, upgrade again, report.

Now, the question is, what is the "something" that we should run? Ideally the results can be observed in metrics, but I don't think we can do that for this feature.

EDIT: I just noticed the recommended metric below :)

- [ ] Other (treat as last resort)
- Details:
- [x] Other (treat as last resort)
- Details: The health of this feature is dependent on the overall health of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't helpful - it's true for almost any feature.

What we need is a signal to determine if the kcm is doing what it should be doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh are you aware of any KCM metrics (or where I could find a list) that would be good indicators for this feature?

Maybe something measuring successful deletions, or deletions-vs-recreations to detect errant victim selection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we have any metrics around that, that particular piece of doesn't have any specific metrics. Looking at the overall deletions vs creations won't help either since the summary numbers should not change, the new algorithm affects which we pick not how many we pick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we come up with some metrics for this? Even if whitebox-y, that's still much better than nothing.

Maybe we can expose histogram of something like:

  • |age of deleted pod| / |age of yougest pod|
    [We expect all values to be within [1,2], so having values outside of it is a good signal that something is wrong.

[Well - technically it will collide with https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost so maybe we additionally need a metric label whether such annotation was set or not]
[And then the SLO would be that there are no values >2 when the pod_cost_set=false]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|age of deleted pod| / |age of yougest pod|

are these currently available, or is there work needed to add them to KCM? @soltysh where are the KCM metrics set up so that we can start work on that?

- Details: The health of this feature is dependent on the overall health of the
control plane services, specifically kube-controller-manager but also to an extent
services like API server, kubelet, etc.
We could add a metric to KCM to indicate that this feature gate is enabled

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fill in these two questions too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damemi suggestion, the previous deletions vs recreations are identical to current rates or close to similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intention, but I don't think it's a good SLO - SLI/SLO needs precise definition and this value will heavily depend on the load (and can change over time).

That may be a good metric informing rollback though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t do you want to use the metric you suggested above in #2691 (comment)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should build SLO on that metric - yes.

@@ -369,18 +356,11 @@ details). For now, we leave it here.
_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
It would be irrelevant, because the feature relies on Pod timestamps. If the API server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kcm won't be able to scale-up/down replicasets, so it doesn't matter.

The important thing is that it's not feature of "running workloads".


* **What are other known failure modes?**
For each of them, fill in the following information by copying the below template:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add n/a (if none)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same below

@@ -263,37 +263,39 @@ _This section must be completed when targeting alpha to a release._
_This section must be completed when targeting beta graduation to a release._

* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing answer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm waiting for some clarification on #2691 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along:

It shouldn't impact already running workloads. One possible problem during rollout is a panic in the algorithm which might crash controller-manager.

- [ ] Other (treat as last resort)
- Details:
- [x] Other (treat as last resort)
- Details: The health of this feature is dependent on the overall health of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we have any metrics around that, that particular piece of doesn't have any specific metrics. Looking at the overall deletions vs creations won't help either since the summary numbers should not change, the new algorithm affects which we pick not how many we pick.

- Details: The health of this feature is dependent on the overall health of the
control plane services, specifically kube-controller-manager but also to an extent
services like API server, kubelet, etc.
We could add a metric to KCM to indicate that this feature gate is enabled

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damemi suggestion, the previous deletions vs recreations are identical to current rates or close to similar.

@@ -369,27 +359,21 @@ details). For now, we leave it here.
_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
N/a - this is not a feature of running workloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller won't work, if api and/or etcd is unavailable thus it can't delete and create pods.


* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
- [x] Other (treat as last resort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not "other". It's a metric, just one that needs to be added.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions/answers - hope these are useful.

- [ ] Metrics
- Metric name:
- [x] Metrics
- Metric name: deleted_pod_age_ratio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to be the metric that I proposed in: #2691 (comment)

can you please define it somewhere in the KEP (and probably add a beta criteria that this metric is implemented)?

[If it's going to be a different metric, please define that too :) ]

- Details: The health of this feature is dependent on the overall health of the
control plane services, specifically kube-controller-manager but also to an extent
services like API server, kubelet, etc.
We could add a metric to KCM to indicate that this feature gate is enabled

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should build SLO on that metric - yes.

- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
There should be no values `>2` in the above metric when the Pod Cost annotation is set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/set/unset/, right?

[Or to be more specific - if this is unset for all pods of the replicaset...]

Basically (it's a comment toward my ask above to define the metric in the kep), I think that the metric should have a field of "pod-cost-annotation-exists" (better name welcome :-) ) that replicaset controller will be computing and setting when exporing that metric (and then the check is trivial).

@wojtek-t
Copy link
Member

/assign @soltysh [we will need SIG approval too]

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion for that one missing bit, the rest
/lgtm
/approve

@@ -263,37 +263,39 @@ _This section must be completed when targeting alpha to a release._
_This section must be completed when targeting beta graduation to a release._

* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along:

It shouldn't impact already running workloads. One possible problem during rollout is a panic in the algorithm which might crash controller-manager.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 12, 2021
@damemi damemi force-pushed the random-downscale-beta branch from 1fe373c to 95abb6b Compare May 12, 2021 14:23
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 12, 2021
@damemi
Copy link
Contributor Author

damemi commented May 12, 2021

Updated with the latest feedback, and squashed.

One note about the metric (mentioned this to @alculquicondor offline) is that the defined range of [1,2] is only valid for pods where their deletion was based on sorting by timestamp.

Similar to how this conflicts with the Pod Cost Annotation, there are other criteria which could put a pod older than 2 * youngest_pod_age as higher priority for deletion, such as a Pending status phase. I've tried to make this clear in the definitions of the new metric.

@soltysh
Copy link
Contributor

soltysh commented May 12, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 12, 2021
@wojtek-t
Copy link
Member

/lgtm
/approve

Thanks!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2021
@k8s-ci-robot k8s-ci-robot merged commit e0f4972 into kubernetes:master May 12, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 12, 2021
@damemi damemi deleted the random-downscale-beta branch May 12, 2021 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants