Skip to content

Commit

Permalink
KEP: Add an exponential backoff mechanism to the requeueing strategy (#…
Browse files Browse the repository at this point in the history
…1608)

* Add an exponential backoff mechanism to the requeueing strategy

Signed-off-by: tenzen-y <[email protected]>

* Rephrase 'maxBackOffRetry' with 'backOffLimit'

Signed-off-by: tenzen-y <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>

* Improve expressions

Signed-off-by: Yuki Iwai <[email protected]>

* Move backOffLimitTimeout to an alternative section

Signed-off-by: Yuki Iwai <[email protected]>

* Replace backOff with backoff

Signed-off-by: Yuki Iwai <[email protected]>

* Additional eviction reasons to story 2

Signed-off-by: Yuki Iwai <[email protected]>

* Update an API comment for backoffLimitCount

Signed-off-by: Yuki Iwai <[email protected]>

* Update story3

Signed-off-by: Yuki Iwai <[email protected]>

* Move backoffTimeout to an alternative section

Signed-off-by: Yuki Iwai <[email protected]>

* Update workload API

Signed-off-by: Yuki Iwai <[email protected]>

* Rephrase strory2

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: tenzen-y <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
  • Loading branch information
tenzen-y authored Feb 6, 2024
1 parent 1e1fe6a commit 8cf1893
Show file tree
Hide file tree
Showing 2 changed files with 167 additions and 17 deletions.
183 changes: 166 additions & 17 deletions keps/1282-pods-ready-requeue-strategy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,18 @@
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Story 3](#story-3)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [API Changes](#api-changes)
- [KueueConfig](#kueueconfig)
- [Workload](#workload)
- [Changes to Queue Sorting](#changes-to-queue-sorting)
- [Existing Sorting](#existing-sorting)
- [Proposed Sorting](#proposed-sorting)
- [Exponential Backoff Mechanism](#exponential-backoff-mechanism)
- [Evaluation](#evaluation)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit Tests](#unit-tests)
Expand All @@ -22,6 +28,10 @@
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Create &quot;FrontOfQueue&quot; and &quot;BackOfQueue&quot;](#create-frontofqueue-and-backofqueue)
- [Configure at the ClusterQueue level](#configure-at-the-clusterqueue-level)
- [Make knob to be possible to set timeout until the workload is deactivated](#make-knob-to-be-possible-to-set-timeout-until-the-workload-is-deactivated)
- [Evaluation](#evaluation-1)
<!-- /toc -->

## Summary
Expand Down Expand Up @@ -56,6 +66,33 @@ Consider the following scenario:

In this case, the administrator would like the evicted workload to be requeued as soon as possible on the newly available capacity.

#### Story 2

In the story 1 scenario, when we set `waitForPodsReady.requeuingStrategy.timestamp=Creation`,
the workload endlessly or repeatedly can be put in front of the queue after eviction in the following eviction reasons:

1. The workload don't have the proper configurations like image pull credential and pvc name, etc.
2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests.
3. If there are multiple resource flavors that match the workload (for example, flavors 1 & 2)
and the workload was running on flavor 2, it's likely that the workload will be readmitted
on the same flavor indefinitely.

Specifically, the second reason will often occur if the available quota is fragmented across multiple nodes,
such that the workload can't be scheduled in a node even though there is enough quota in the cluster.

For example, Given that the workload with a request of 2 gpus is submitted to the cluster that
has 2 worker nodes with 4 gpus, and 3 gpus are used (which means 1 gpu is free in each node),
the workload will be repeatedly evicted because of the lack of resources in each node even though the cluster has enough capacities.

In this case, to avoid rapid repetition of the admission and eviction cycle,
the administrator would like to use an exponential backoff mechanism and add a maximum number of retries.

#### Story 3

In the story 2 scenario, after the evicted workload reaches the maximum retry criterion
and the workload is never backoff, we want to easily requeue the workload to the queue without recreating the job.
This is possible if the Workload is deactivated (`.spec.active`=`false`) as opposed to deleting it.

### Risks and Mitigations

<!--
Expand All @@ -75,25 +112,77 @@ Consider including folks who also work outside the SIG or subproject.

### API Changes

Add an additional field to the Kueue ConfigMap to allow administrators to specify what timestamp to consider during queue sorting (under the pre-existing waitForPodsReady block).
#### KueueConfig

Add fields to the KueueConfig to allow administrators to specify what timestamp to consider during queue sorting (under the pre-existing waitForPodsReady block).

Possible settings:

* `Eviction` (Back of queue)
* `Creation` (Front of queue)

```yaml
kind: ConfigMap
metadata:
name: kueue-manager-config
namespace: kueue-system
data:
controller_manager_config.yaml: |
apiVersion: config.kueue.x-k8s.io/v1beta1
kind: Configuration
# ...
waitForPodsReady:
requeuingTimestamp: Creation | Eviction # <-- New field
```go
type WaitForPodsReady struct {
...
// requeuingStrategy defines the strategy for requeuing a Workload
// +optional
RequeuingStrategy *RequeuingStrategy `json:"requeuingStrategy,omitempty"`
}

type RequeuingStrategy struct {
// timestamp defines the timestamp used for requeuing a Workload
// that was evicted due to Pod readiness. Defaults to Eviction.
// +optional
Timestamp *RequeuingTimestamp `json:"timestamp,omitempty"`

// backoffLimitCount defines the maximum number of requeuing retries.
// When the number is reached, the workload is deactivated (`.spec.activate`=`false`).
//
// Defaults to null.
// +optional
BackOffLimitCount *int32 `json:"backoffLimitCount,omitempty"`
}

type RequeuingTimestamp string

const (
// creationTimestamp timestamp (from Workload .metadata.creationTimestamp).
CreationTimestamp RequeuingTimestamp = "Creation"

// evictionTimestamp timestamp (from Workload .status.conditions).
EvictionTimestamp RequeuingTimestamp = "Eviction"
)
```

#### Workload

Add a new field, "requeuedCount", to the Workload to allow recording the number of times a workload is requeued.

```go
type WorkloadStatus struct {
...
// requeueState holds the state of the requeued Workload according to the requeueing strategy.
//
// +optional
RequeueState *RequeueState `json:"requeueState,omitempty"`
}

type RequeueState struct {
// count records the number of times a workload has been requeued.
// When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`),
// this count would be reset to 0.
//
// +optional
Count *int32 `json:"count,omitempty"`

// requeueAt records the time when a workload is requeued.
// When a deactivated (`.spec.activate`=`false`) workload is reactivated (`.spec.activate`=`true`),
// this time would be reset to null.
//
// +optional
RequeueAt *metav1.Time `json:"requeueAt,omitempty"`
}

```

### Changes to Queue Sorting
Expand All @@ -104,10 +193,28 @@ Currently, workloads within a ClusterQueue are sorted based on 1. Priority and 2

#### Proposed Sorting

The `pkg/workload` package could be modified to include a conditional (`if evictionReason == kueue.WorkloadEvictedByPodsReadyTimeout`) that controls which timestamp to return based on the configured ordering strategy. The same sorting logic would also be used when sorting the heads of queues.
The `pkg/workload` package could be modified to include a conditional (`if evictionReason == kueue.WorkloadEvictedByPodsReadyTimeout`)
that controls which timestamp to return based on the configured ordering strategy.
The same sorting logic would also be used when sorting the heads of queues.

Update the `apis/config/<version>` package to include `Creation` and `Eviction` constants.

### Exponential Backoff Mechanism

When the kueueConfig `backoffLimitCount` is set and there are evicted workloads by waitForPodsReady,
the queueManager holds evicted workloads with an exponential backoff.
Duration this time, other workloads will have a chance to be admitted.

The queueManager calculates an exponential backoff duration by [the Step function](https://pkg.go.dev/k8s.io/apimachinery/pkg/util/[email protected]#Backoff.Step).

#### Evaluation

When a workload eviction is issued with `PodsReadyTimeout` condition,
a workload `.status.requeuedCount` is incremented by 1 each time in the workload controller.

After that, when a workload `.status.requeudCount` reaches the kueueConfig `.waitForPodsReady.requeueingStrategy.backoffLimitCount`,
a workload is deactivated by setting false to `.spec.active` instead of be suspended in the jobframework reconciler.

### Test Plan

[X] I/we understand the owners of the involved components may require updates to
Expand Down Expand Up @@ -152,6 +259,8 @@ milestones with these graduation criteria:

## Implementation History

- Jan 18th: Implemented the re-queue strategy that workloads evicted due to pods-ready (story 1) [#1311](https://github.com/kubernetes-sigs/kueue/pulls/1311)

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
Expand All @@ -165,10 +274,50 @@ Major milestones might include:

## Drawbacks

* When used with `StrictFIFO`, the `requeuingTimestamp: Creation` (front of queue) policy could lead to a blocked queue. This was called out in the issue that set the hardcoded [back-of-queue behavior](https://github.com/kubernetes-sigs/kueue/issues/599). This could be mitigated by recommending administrators select `BestEffortFIFO` when using this setting.
* When used with `StrictFIFO`, the `requeuingStrategy.timestamp: Creation` (front of queue) policy could lead to a blocked queue. This was called out in the issue that set the hardcoded [back-of-queue behavior](https://github.com/kubernetes-sigs/kueue/issues/599).
This could be mitigated by recommending administrators select `BestEffortFIFO` when using this setting.
* Pods that never become ready due to invalid images will constantly be requeued to the front of the queue when the creation timestamp is used. [See Kubernetes issue](https://github.com/kubernetes/kubernetes/issues/122300).

## Alternatives

* The same concepts could be exposed to users based on `FrontOfQueue` or `BackOfQueue` settings instead of `Creation` and `Eviction` timestamps. These terms would imply that the workload would be prioritized over higher priority workloads in the queue. This is probably not desired (would likely lead to rapid preemption upon admission when preemption based on priority is enabled).
* These concepts could be configured in the ClusterQueue resource. This alternative would increase flexibility. Without a clear need for this level of granularity, it might be better to set these options at the controller level where `waitForPodsReady` settings already exist. Furthermore, configuring these settings at the ClusterQueue level introduces the question of what timestamp to use when sorting the heads of all ClusterQueues.
### Create "FrontOfQueue" and "BackOfQueue"

The same concepts could be exposed to users based on `FrontOfQueue` or `BackOfQueue` settings instead of `Creation` and `Eviction` timestamps.
These terms would imply that the workload would be prioritized over higher priority workloads in the queue.
This is probably not desired (would likely lead to rapid preemption upon admission when preemption based on priority is enabled).

### Configure at the ClusterQueue level

These concepts could be configured in the ClusterQueue resource. This alternative would increase flexibility.
Without a clear need for this level of granularity, it might be better to set these options at the controller level where `waitForPodsReady` settings already exist.
Furthermore, configuring these settings at the ClusterQueue level introduces the question of what timestamp to use when sorting the heads of all ClusterQueues.

### Make knob to be possible to set timeout until the workload is deactivated

Another knob, `backoffCount` is difficult to estimate how many hours jobs will actually be retried (requeued).
So, it might be useful to make a knob to possible to set timeout until the workload is deactivated.
For the first iteration, we don't make this knob since only `backoffLimitCount` would be enough to current stories.

```go
type RequeuingStrategy struct {
...
// backoffLimitTimeout defines the time for a workload that
// has once been admitted to reach the PodsReady=true condition.
// When the time is reached, the workload is deactivated.
//
// Defaults to null.
// +optional
BackOffLimitTimeout *int32 `json:"backoffLimitTimeout,omitempty"`
}
```

#### Evaluation

When a workload's duration $currentTime - queueOrderingTimestamp$ reaches the kueueConfig `waitForPodsReady.requeueingStrategy.backoffLimitTimeout`,
the workload controller and the queueManager sets false to `.spec.active`.
After that, the jobframework reconciler deactivates a workload.

Before the jobframework reconciler deactivates a workload,
the workload controller sets false to `.spec.active` after the workload reconciler checks if a workload is finished.
In addition, when the kueue scheduler gets headWorkloads from clusterQueues,
if the queueManager finds the workloads exceeding `backoffLimitTimeout` and sets false to workload `.spec.active`.
1 change: 1 addition & 0 deletions keps/1282-pods-ready-requeue-strategy/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ title: Pods Ready Requeue Strategy
kep-number: 1282
authors:
- "@nstogner"
- "@tenzen-y"
status: provisional
creation-date: 2023-11-01
reviewers:
Expand Down

0 comments on commit 8cf1893

Please sign in to comment.