From 074f9c7d2e5275e152306209c1fd41f7b7b91a71 Mon Sep 17 00:00:00 2001 From: Mike Dame Date: Thu, 14 Jan 2021 10:26:14 -0500 Subject: [PATCH] (squash) feedback --- .../README.md | 31 +++++++------------ 1 file changed, 11 insertions(+), 20 deletions(-) diff --git a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md index 45ed07496d96..53cc498f6248 100644 --- a/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md +++ b/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale/README.md @@ -9,7 +9,6 @@ - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) - - [Story 2](#story-2) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -27,7 +26,8 @@ - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) -- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + - [Make downscale heuristic an option](#make-downscale-heuristic-an-option) + - [Compare pods using their distribution in the failure domains](#compare-pods-using-their-distribution-in-the-failure-domains) ## Release Signoff Checklist @@ -96,22 +96,11 @@ and how a randomized approach solves the issue. This story shows an imbalance cycle after a failure domain fails or gets upgraded. -1. Assume a ReplicaSet has 3N pods evenly distributed across 3 failure domains, +1. Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains, thus each has N pods. -2. A failure or an upgrade happens in one of the domains. The N pods from this - domain get re-scheduled into the other 2 domains. Note that this N pods are - now the youngest. -3. The domain recovers or finishes upgrading. -4. ReplicaSet is downscaled to 2N, due to user action or HPA recommendation. - Given the downscaling algorithm, 2 domains end up with N nodes each, the 2N - Pods that were never restarted, and the remaining domain has 0 Pods. - There is nothing to be done here. A random approach would obtain the same - result. -5. The ReplicaSet is upscaled to 3N again, due to user action or HPA - recommendation. Due to Pod spreading during scheduling, each domain has N - Pods. Balance is recovered. However, one failure domain holds the youngest - Pods. -6. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all +2. An upgrade happens adding a new available domain and the ReplicaSet is upscaled + to 3N. The new domain now holds all the youngest pods due to scheduler spreading. +3. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all the Pods from one domain are removed, leading to imbalance. The situation doesn't improve with repeated upscale and downscale steps. Instead, a randomized approach leaves about 2/3*N nodes in each @@ -155,14 +144,16 @@ there are a number of reasons why we don't need to preserve such behavior as is: We propose a randomized approach to the algorithm for Pod victim selection during ReplicaSet downscale: -1. Do a random shuffle of ReplicaSet Pods. +1. Sort ReplicaSet pods by pod UUID. 2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815) 2. Call sorting algorithm with a modified time comparison for start and creation timestamp. + Instead of directly comparing timestamps, the algorithm compares the elapsed -times since the timestamp until the current time but in a logarithmic scale, -floor rounded. This has the effect of treating elapsed times as equals when they +times since the creation and ready timestamps until the current time but in a +logarithmic scale, floor rounded. These serve as sorting criteria. +This has the effect of treating elapsed times as equals when they have the same scale. That is, Pods that have been running for a few nanoseconds are equal, but they are different from pods that have been running for a few seconds or a few days.