From 95abb6b9d1664824d1d1fe5937ff748bbd4a3eea Mon Sep 17 00:00:00 2001
From: Mike Dame <mdame@redhat.com>
Date: Fri, 7 May 2021 14:19:45 -0400
Subject: [PATCH] Update random pod scaledown KEP for beta

---
 keps/prod-readiness/sig-apps/2185.yaml        |  2 +
 .../README.md                                 | 80 ++++++++-----------
 .../kep.yaml                                  |  4 +-
 3 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/keps/prod-readiness/sig-apps/2185.yaml b/keps/prod-readiness/sig-apps/2185.yaml
index b262b3c5e01..307942a1d87 100644
--- a/keps/prod-readiness/sig-apps/2185.yaml
+++ b/keps/prod-readiness/sig-apps/2185.yaml
@@ -1,3 +1,5 @@
 kep-number: 2185
 alpha:
   approver: "@wojtek-t"
+beta:
+  approver: "@wojtek-t"
diff --git a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
index aa4fbb7293f..91f701a4e50 100644
--- a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
+++ b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md
@@ -34,15 +34,15 @@
 
 Items marked with (R) are required *prior to targeting to a milestone / release*.
 
-- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
 - [x] (R) KEP approvers have approved the KEP status as `implementable`
 - [x] (R) Design details are appropriately documented
 - [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
 - [x] (R) Graduation criteria is in place
 - [x] (R) Production readiness review completed
 - [x] Production readiness review approved
-- [ ] "Implementation History" section is up-to-date for milestone
-- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [x] "Implementation History" section is up-to-date for milestone
+- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
 - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
 
 
@@ -206,6 +206,7 @@ Alpha (v1.21):
 
 Beta (v1.22): 
 - Enable LogarithmicScaleDown feature gate by default
+- Enable `deleted_pod_age_ratio` metric
 
 Stable (v1.23):
 - Remove LogarithmicScaleDown feature gate
@@ -263,46 +264,48 @@ _This section must be completed when targeting alpha to a release._
 _This section must be completed when targeting beta graduation to a release._
 
 * **How can a rollout fail? Can it impact already running workloads?**
-  Try to be as paranoid as possible - e.g., what if some components will restart
-   mid-rollout?
+  This should not affect running workloads, though there is the possibility that the logic 
+  panics which would cause kube-controller-manager to crash
 
 * **What specific metrics should inform a rollback?**
+  Increased pod deletions could indicate runaway/hot-loop failures in the scaledown logic.
+  Availability of applications may also be affected. Though the intent of this is to provide 
+  better available through more distributed victim selection, in cases of desired binpacking 
+  pods may remain running on undesired nodes.
 
 * **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
-  Describe manual testing that was done and the outcomes.
-  Longer term, we may want to require automated upgrade/rollback tests, but we
-  are missing a bunch of machinery and tooling and can't do that now.
+  This will be manually tested before the graduation to beta
 
 * **Is the rollout accompanied by any deprecations and/or removals of features, APIs, 
 fields of API types, flags, etc.?**
-  Even if applying deprecation policies, they may still surprise some users.
+  No
 
 ### Monitoring Requirements
 
 _This section must be completed when targeting beta graduation to a release._
 
 * **How can an operator determine if the feature is in use by workloads?**
-  Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
-  checking if there are objects with field X set) may be a last resort. Avoid
-  logs or events for this purpose.
+  The scaledown behavior of all replicasets will be affected by this featuregate being 
+  enabled, so somehow monitoring them will be necessary to determine it
 
 * **What are the SLIs (Service Level Indicators) an operator can use to determine 
 the health of the service?**
-  - [ ] Metrics
-    - Metric name:
+  - [x] Metrics
+    - Metric name: deleted_pod_age_ratio
     - [Optional] Aggregation method:
-    - Components exposing the metric:
+    - Components exposing the metric: kube-controller-manager
   - [ ] Other (treat as last resort)
-    - Details:
+  
+The metric `deleted_pod_age_ratio` will provide a histogram of the ratio between the 
+chosen `deleted pod`'s age over the current `youngest pod`'s age, for pods where the sort 
+algorithm falls back to age. (Pod age is the final criteria in the sorting algorithm, so we don't 
+want to measure this ratio for deletions which don't use this feature, as those may validly fall 
+outside the desired range).
 
 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
-  At a high level, this usually will be in the form of "high percentile of SLI
-  per day <= X". It's impossible to provide comprehensive guidance, but at the very
-  high level (needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99,9% of /health requests per day finish with 200 code
+  There should be no values `>2` in the above metric when the Pod Cost annotation is unset 
+  (see https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2255-pod-cost) and 
+  the pod's deletion was based on a timestamp comparison (rather than, for example, pod state).
 
 * **Are there any missing metrics that would be useful to have to improve observability 
 of this feature?**
@@ -314,19 +317,7 @@ of this feature?**
 _This section must be completed when targeting beta graduation to a release._
 
 * **Does this feature depend on any specific services running in the cluster?**
-  Think about both cluster-level services (e.g. metrics-server) as well
-  as node-level agents (e.g. specific version of CRI). Focus on external or
-  optional services that are needed. For example, if this feature depends on
-  a cloud provider API, or upon an external software-defined storage or network
-  control plane.
-
-  For each of these, fill in the following—thinking about running existing user workloads
-  and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
-
+  No, it is part of the controller-manager
 
 ### Scalability
 
@@ -369,27 +360,22 @@ details). For now, we leave it here.
 _This section must be completed when targeting beta graduation to a release._
 
 * **How does this feature react if the API server and/or etcd is unavailable?**
+  N/a - this is not a feature of running workloads. The main controller will not work and 
+  be unable to scale up or down if API or etcd are unavailable.
 
 * **What are other known failure modes?**
-  For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
+n/a
 
 * **What steps should be taken if SLOs are not being met to determine the problem?**
+n/a
 
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 
 ## Implementation History
 
-- 2020-01-06: Initial KEP submitted
+- 2021-01-06: Initial KEP submitted
+- 2021-05-07: Updated KEP for graduation to beta
 
 ## Drawbacks
 
diff --git a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/kep.yaml b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/kep.yaml
index 1bd8c470215..bdd2c3fa683 100644
--- a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/kep.yaml
+++ b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/kep.yaml
@@ -22,8 +22,8 @@ see-also:
   - "/keps/sig-apps/1828-delete-priority-annotations"
 replaces:
 
-stage: alpha
-latest-milestone: "v1.21"
+stage: beta
+latest-milestone: "v1.22"
 milestone:
   alpha: "v1.21"
   beta: "v1.22"