Skip to content

Commit

Permalink
(squash) feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
damemi committed Jan 14, 2021
1 parent db57fbd commit 074f9c7
Showing 1 changed file with 11 additions and 20 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
Expand All @@ -27,7 +26,8 @@
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
- [Make downscale heuristic an option](#make-downscale-heuristic-an-option)
- [Compare pods using their distribution in the failure domains](#compare-pods-using-their-distribution-in-the-failure-domains)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -96,22 +96,11 @@ and how a randomized approach solves the issue.
This story shows an imbalance cycle after a failure domain fails or gets
upgraded.

1. Assume a ReplicaSet has 3N pods evenly distributed across 3 failure domains,
1. Assume a ReplicaSet has 2N pods evenly distributed across 2 failure domains,
thus each has N pods.
2. A failure or an upgrade happens in one of the domains. The N pods from this
domain get re-scheduled into the other 2 domains. Note that this N pods are
now the youngest.
3. The domain recovers or finishes upgrading.
4. ReplicaSet is downscaled to 2N, due to user action or HPA recommendation.
Given the downscaling algorithm, 2 domains end up with N nodes each, the 2N
Pods that were never restarted, and the remaining domain has 0 Pods.
There is nothing to be done here. A random approach would obtain the same
result.
5. The ReplicaSet is upscaled to 3N again, due to user action or HPA
recommendation. Due to Pod spreading during scheduling, each domain has N
Pods. Balance is recovered. However, one failure domain holds the youngest
Pods.
6. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all
2. An upgrade happens adding a new available domain and the ReplicaSet is upscaled
to 3N. The new domain now holds all the youngest pods due to scheduler spreading.
3. ReplicaSet is downscaled to 2N again. Due to the downscaling preference, all
the Pods from one domain are removed, leading to imbalance. The situation
doesn't improve with repeated upscale and downscale steps. Instead, a
randomized approach leaves about 2/3*N nodes in each
Expand Down Expand Up @@ -155,14 +144,16 @@ there are a number of reasons why we don't need to preserve such behavior as is:
We propose a randomized approach to the algorithm for Pod victim selection
during ReplicaSet downscale:

1. Do a random shuffle of ReplicaSet Pods.
1. Sort ReplicaSet pods by pod UUID.
2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815)
2. Call sorting algorithm with a modified time comparison for start and
creation timestamp.


Instead of directly comparing timestamps, the algorithm compares the elapsed
times since the timestamp until the current time but in a logarithmic scale,
floor rounded. This has the effect of treating elapsed times as equals when they
times since the creation and ready timestamps until the current time but in a
logarithmic scale, floor rounded. These serve as sorting criteria.
This has the effect of treating elapsed times as equals when they
have the same scale. That is, Pods that have been running for a few nanoseconds
are equal, but they are different from pods that have been running for a few
seconds or a few days.
Expand Down

0 comments on commit 074f9c7

Please sign in to comment.