generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 284
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
KEP: Add an exponential backoff mechanism to the requeueing strategy (#…
…1608) * Add an exponential backoff mechanism to the requeueing strategy Signed-off-by: tenzen-y <[email protected]> * Rephrase 'maxBackOffRetry' with 'backOffLimit' Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> * Improve expressions Signed-off-by: Yuki Iwai <[email protected]> * Move backOffLimitTimeout to an alternative section Signed-off-by: Yuki Iwai <[email protected]> * Replace backOff with backoff Signed-off-by: Yuki Iwai <[email protected]> * Additional eviction reasons to story 2 Signed-off-by: Yuki Iwai <[email protected]> * Update an API comment for backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]> * Update story3 Signed-off-by: Yuki Iwai <[email protected]> * Move backoffTimeout to an alternative section Signed-off-by: Yuki Iwai <[email protected]> * Update workload API Signed-off-by: Yuki Iwai <[email protected]> * Rephrase strory2 Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]>
- Loading branch information
Showing
2 changed files
with
167 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,12 +8,18 @@ | |
- [Proposal](#proposal) | ||
- [User Stories (Optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Story 3](#story-3) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [API Changes](#api-changes) | ||
- [KueueConfig](#kueueconfig) | ||
- [Workload](#workload) | ||
- [Changes to Queue Sorting](#changes-to-queue-sorting) | ||
- [Existing Sorting](#existing-sorting) | ||
- [Proposed Sorting](#proposed-sorting) | ||
- [Exponential Backoff Mechanism](#exponential-backoff-mechanism) | ||
- [Evaluation](#evaluation) | ||
- [Test Plan](#test-plan) | ||
- [Prerequisite testing updates](#prerequisite-testing-updates) | ||
- [Unit Tests](#unit-tests) | ||
|
@@ -22,6 +28,10 @@ | |
- [Implementation History](#implementation-history) | ||
- [Drawbacks](#drawbacks) | ||
- [Alternatives](#alternatives) | ||
- [Create "FrontOfQueue" and "BackOfQueue"](#create-frontofqueue-and-backofqueue) | ||
- [Configure at the ClusterQueue level](#configure-at-the-clusterqueue-level) | ||
- [Make knob to be possible to set timeout until the workload is deactivated](#make-knob-to-be-possible-to-set-timeout-until-the-workload-is-deactivated) | ||
- [Evaluation](#evaluation-1) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
@@ -56,6 +66,33 @@ Consider the following scenario: | |
|
||
In this case, the administrator would like the evicted workload to be requeued as soon as possible on the newly available capacity. | ||
|
||
#### Story 2 | ||
|
||
In the story 1 scenario, when we set `waitForPodsReady.requeuingStrategy.timestamp=Creation`, | ||
the workload endlessly or repeatedly can be put in front of the queue after eviction in the following eviction reasons: | ||
|
||
1. The workload don't have the proper configurations like image pull credential and pvc name, etc. | ||
2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests. | ||
3. If there are multiple resource flavors that match the workload (for example, flavors 1 & 2) | ||
and the workload was running on flavor 2, it's likely that the workload will be readmitted | ||
on the same flavor indefinitely. | ||
|
||
Specifically, the second reason will often occur if the available quota is fragmented across multiple nodes, | ||
such that the workload can't be scheduled in a node even though there is enough quota in the cluster. | ||
|
||
For example, Given that the workload with a request of 2 gpus is submitted to the cluster that | ||
has 2 worker nodes with 4 gpus, and 3 gpus are used (which means 1 gpu is free in each node), | ||
the workload will be repeatedly evicted because of the lack of resources in each node even though the cluster has enough capacities. | ||
|
||
In this case, to avoid rapid repetition of the admission and eviction cycle, | ||
the administrator would like to use an exponential backoff mechanism and add a maximum number of retries. | ||
|
||
#### Story 3 | ||
|
||
In the story 2 scenario, after the evicted workload reaches the maximum retry criterion | ||
and the workload is never backoff, we want to easily requeue the workload to the queue without recreating the job. | ||
This is possible if the Workload is deactivated (`.spec.active`=`false`) as opposed to deleting it. | ||
|
||
### Risks and Mitigations | ||
|
||
<!-- | ||
|
@@ -75,25 +112,77 @@ Consider including folks who also work outside the SIG or subproject. | |
|
||
### API Changes | ||
|
||
Add an additional field to the Kueue ConfigMap to allow administrators to specify what timestamp to consider during queue sorting (under the pre-existing waitForPodsReady block). | ||
#### KueueConfig | ||
|
||
Add fields to the KueueConfig to allow administrators to specify what timestamp to consider during queue sorting (under the pre-existing waitForPodsReady block). | ||
|
||
Possible settings: | ||
|
||
* `Eviction` (Back of queue) | ||
* `Creation` (Front of queue) | ||
|
||
```yaml | ||
kind: ConfigMap | ||
metadata: | ||
name: kueue-manager-config | ||
namespace: kueue-system | ||
data: | ||
controller_manager_config.yaml: | | ||
apiVersion: config.kueue.x-k8s.io/v1beta1 | ||
kind: Configuration | ||
# ... | ||
waitForPodsReady: | ||
requeuingTimestamp: Creation | Eviction # <-- New field | ||
```go | ||
type WaitForPodsReady struct { | ||
... | ||
// requeuingStrategy defines the strategy for requeuing a Workload | ||
// +optional | ||
RequeuingStrategy *RequeuingStrategy `json:"requeuingStrategy,omitempty"` | ||
} | ||
|
||
type RequeuingStrategy struct { | ||
// timestamp defines the timestamp used for requeuing a Workload | ||
// that was evicted due to Pod readiness. Defaults to Eviction. | ||
// +optional | ||
Timestamp *RequeuingTimestamp `json:"timestamp,omitempty"` | ||
|
||
// backoffLimitCount defines the maximum number of requeuing retries. | ||
// When the number is reached, the workload is deactivated (`.spec.activate`=`false`). | ||
// | ||
// Defaults to null. | ||
// +optional | ||
BackOffLimitCount *int32 `json:"backoffLimitCount,omitempty"` | ||
} | ||
|
||
type RequeuingTimestamp string | ||
|
||
const ( | ||
// creationTimestamp timestamp (from Workload .metadata.creationTimestamp). | ||
CreationTimestamp RequeuingTimestamp = "Creation" | ||
|
||
// evictionTimestamp timestamp (from Workload .status.conditions). | ||
EvictionTimestamp RequeuingTimestamp = "Eviction" | ||
) | ||
``` | ||
|
||
#### Workload | ||
|
||
Add a new field, "requeuedCount", to the Workload to allow recording the number of times a workload is requeued. | ||
|
||
```go | ||
type WorkloadStatus struct { | ||
... | ||
// requeueState holds the state of the requeued Workload according to the requeueing strategy. | ||
// | ||
// +optional | ||
RequeueState *RequeueState `json:"requeueState,omitempty"` | ||
} | ||
|
||
type RequeueState struct { | ||
// count records the number of times a workload has been requeued. | ||
// When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`), | ||
// this count would be reset to 0. | ||
// | ||
// +optional | ||
Count *int32 `json:"count,omitempty"` | ||
|
||
// requeueAt records the time when a workload is requeued. | ||
// When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`), | ||
// this time would be reset to null. | ||
// | ||
// +optional | ||
RequeueAt *metav1.Time `json:"requeueAt,omitempty"` | ||
} | ||
|
||
``` | ||
|
||
### Changes to Queue Sorting | ||
|
@@ -104,10 +193,28 @@ Currently, workloads within a ClusterQueue are sorted based on 1. Priority and 2 | |
|
||
#### Proposed Sorting | ||
|
||
The `pkg/workload` package could be modified to include a conditional (`if evictionReason == kueue.WorkloadEvictedByPodsReadyTimeout`) that controls which timestamp to return based on the configured ordering strategy. The same sorting logic would also be used when sorting the heads of queues. | ||
The `pkg/workload` package could be modified to include a conditional (`if evictionReason == kueue.WorkloadEvictedByPodsReadyTimeout`) | ||
that controls which timestamp to return based on the configured ordering strategy. | ||
The same sorting logic would also be used when sorting the heads of queues. | ||
|
||
Update the `apis/config/<version>` package to include `Creation` and `Eviction` constants. | ||
|
||
### Exponential Backoff Mechanism | ||
|
||
When the kueueConfig `backoffLimitCount` is set and there are evicted workloads by waitForPodsReady, | ||
the queueManager holds evicted workloads with an exponential backoff. | ||
Duration this time, other workloads will have a chance to be admitted. | ||
|
||
The queueManager calculates an exponential backoff duration by [the Step function](https://pkg.go.dev/k8s.io/apimachinery/pkg/util/[email protected]#Backoff.Step). | ||
|
||
#### Evaluation | ||
|
||
When a workload eviction is issued with `PodsReadyTimeout` condition, | ||
a workload `.status.requeuedCount` is incremented by 1 each time in the workload controller. | ||
|
||
After that, when a workload `.status.requeudCount` reaches the kueueConfig `.waitForPodsReady.requeueingStrategy.backoffLimitCount`, | ||
a workload is deactivated by setting false to `.spec.active` instead of be suspended in the jobframework reconciler. | ||
|
||
### Test Plan | ||
|
||
[X] I/we understand the owners of the involved components may require updates to | ||
|
@@ -152,6 +259,8 @@ milestones with these graduation criteria: | |
|
||
## Implementation History | ||
|
||
- Jan 18th: Implemented the re-queue strategy that workloads evicted due to pods-ready (story 1) [#1311](https://github.com/kubernetes-sigs/kueue/pulls/1311) | ||
|
||
<!-- | ||
Major milestones in the lifecycle of a KEP should be tracked in this section. | ||
Major milestones might include: | ||
|
@@ -165,10 +274,50 @@ Major milestones might include: | |
|
||
## Drawbacks | ||
|
||
* When used with `StrictFIFO`, the `requeuingTimestamp: Creation` (front of queue) policy could lead to a blocked queue. This was called out in the issue that set the hardcoded [back-of-queue behavior](https://github.com/kubernetes-sigs/kueue/issues/599). This could be mitigated by recommending administrators select `BestEffortFIFO` when using this setting. | ||
* When used with `StrictFIFO`, the `requeuingStrategy.timestamp: Creation` (front of queue) policy could lead to a blocked queue. This was called out in the issue that set the hardcoded [back-of-queue behavior](https://github.com/kubernetes-sigs/kueue/issues/599). | ||
This could be mitigated by recommending administrators select `BestEffortFIFO` when using this setting. | ||
* Pods that never become ready due to invalid images will constantly be requeued to the front of the queue when the creation timestamp is used. [See Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/122300). | ||
|
||
## Alternatives | ||
|
||
* The same concepts could be exposed to users based on `FrontOfQueue` or `BackOfQueue` settings instead of `Creation` and `Eviction` timestamps. These terms would imply that the workload would be prioritized over higher priority workloads in the queue. This is probably not desired (would likely lead to rapid preemption upon admission when preemption based on priority is enabled). | ||
* These concepts could be configured in the ClusterQueue resource. This alternative would increase flexibility. Without a clear need for this level of granularity, it might be better to set these options at the controller level where `waitForPodsReady` settings already exist. Furthermore, configuring these settings at the ClusterQueue level introduces the question of what timestamp to use when sorting the heads of all ClusterQueues. | ||
### Create "FrontOfQueue" and "BackOfQueue" | ||
|
||
The same concepts could be exposed to users based on `FrontOfQueue` or `BackOfQueue` settings instead of `Creation` and `Eviction` timestamps. | ||
These terms would imply that the workload would be prioritized over higher priority workloads in the queue. | ||
This is probably not desired (would likely lead to rapid preemption upon admission when preemption based on priority is enabled). | ||
|
||
### Configure at the ClusterQueue level | ||
|
||
These concepts could be configured in the ClusterQueue resource. This alternative would increase flexibility. | ||
Without a clear need for this level of granularity, it might be better to set these options at the controller level where `waitForPodsReady` settings already exist. | ||
Furthermore, configuring these settings at the ClusterQueue level introduces the question of what timestamp to use when sorting the heads of all ClusterQueues. | ||
|
||
### Make knob to be possible to set timeout until the workload is deactivated | ||
|
||
Another knob, `backoffCount` is difficult to estimate how many hours jobs will actually be retried (requeued). | ||
So, it might be useful to make a knob to possible to set timeout until the workload is deactivated. | ||
For the first iteration, we don't make this knob since only `backoffLimitCount` would be enough to current stories. | ||
|
||
```go | ||
type RequeuingStrategy struct { | ||
... | ||
// backoffLimitTimeout defines the time for a workload that | ||
// has once been admitted to reach the PodsReady=true condition. | ||
// When the time is reached, the workload is deactivated. | ||
// | ||
// Defaults to null. | ||
// +optional | ||
BackOffLimitTimeout *int32 `json:"backoffLimitTimeout,omitempty"` | ||
} | ||
``` | ||
|
||
#### Evaluation | ||
|
||
When a workload's duration $currentTime - queueOrderingTimestamp$ reaches the kueueConfig `waitForPodsReady.requeueingStrategy.backoffLimitTimeout`, | ||
the workload controller and the queueManager sets false to `.spec.active`. | ||
After that, the jobframework reconciler deactivates a workload. | ||
|
||
Before the jobframework reconciler deactivates a workload, | ||
the workload controller sets false to `.spec.active` after the workload reconciler checks if a workload is finished. | ||
In addition, when the kueue scheduler gets headWorkloads from clusterQueues, | ||
if the queueManager finds the workloads exceeding `backoffLimitTimeout` and sets false to workload `.spec.active`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters