Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Introduce a new timeout to WaitForPodsReady config #2737

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
93 changes: 82 additions & 11 deletions keps/349-all-or-nothing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ tags, and then generate with `hack/update-toc.sh`.
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
Expand Down Expand Up @@ -92,6 +93,7 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
unsuspended by Kueue
- a timeout on getting the physical resources assigned by a Job since
unsuspended by Kueue
- a timeout on replacing a failed pod

<!--
List the specific goals of the KEP. What is it trying to achieve? How will we
Expand All @@ -102,8 +104,7 @@ know that this has succeeded?

- guarantee that two jobs would not schedule pods concurrently. Example
scenarios in which two jobs may still concurrently schedule their pods:
- when succeeded pods are replaced with new because job's parallelism is less than its completions;
- when a failed pod gets replaced
- when succeeded pods are replaced with new because job's parallelism is less than its completions.

<!--
What is out of scope for this KEP? Listing non-goals helps to focus discussion
Expand Down Expand Up @@ -147,6 +148,11 @@ the configured cluster queue quota and when the Jobs don't specify priorities
My use case can be supported by enabling `waitForPodsReady` in the Kueue
configuration.

#### Story 2

As a Kueue administrator I want to ensure that a Workload will be evicted after
configured timeout if a pod fails during its execution and the replacement Pod can't be scheduled.

### Notes/Constraints/Caveats (Optional)

<!--
Expand All @@ -162,8 +168,8 @@ If a workload fails to schedule its pods it could block admission of other
workloads indefinitely.

To mitigate this issue we introduce a timeout on reaching the `PodsReady`
condition by a workload since its job start (see:
[Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)).
condition by a workload since its job start, and a timeout on reaching the `PodsReady` condition since its pod has failed
(see:[Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)).

<!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
Expand Down Expand Up @@ -226,6 +232,16 @@ type WaitForPodsReady struct {
// RequeuingStrategy defines the strategy for requeuing a Workload.
// +optional
RequeuingStrategy *RequeuingStrategy `json:"requeuingStrategy,omitempty"`

// RecoveryTimeout defines an optional timeout, measured since the
// last transition to the PodsReady=false condition after a Workload is Admitted and running.
// Such a transition may happen when a Pod failed and the replacement Pod
// is awaited to be scheduled.
// After exceeding the timeout the corresponding job gets suspended again
// and requeued after the backoff delay. The timeout is enforced only if waitForPodsReady.enable=true.
// Defaults to 3 mins.
PBundyra marked this conversation as resolved.
Show resolved Hide resolved
// +optional
RecoveryTimeout *metav1.Duration `json:"recoveryTimeout,omitempty"`
}

type RequeuingStrategy struct {
Expand Down Expand Up @@ -315,9 +331,11 @@ type RequeueState struct {

We introduce a new workload condition, called `PodsReady`, to indicate
if the workload's startup requirements are satisfied. More precisely, we add
the condition when `job.status.ready + job.status.succeeded` is greater or equal
the condition when `job.status.ready + len(job.status.uncountedTerimnatedPods.succeeded) + job.status.succeeded` is greater or equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this indicate

func (j *Job) PodsReady() bool {
ready := ptr.Deref(j.Status.Ready, 0)
return j.Status.Succeeded+ready >= j.podsCount()
}
?

In that case, what if other jobs except batch/v1 Jobs?

than `job.spec.parallelism`.

Note that we count `job.status.uncountedTerminatedPods` - this is meant to prevent flickering of the `PodsReady` condition when pods are transitioning to the `Succeeded` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this improve the existing waitForPodsReady feature even if we do not introduce the recoveryTimeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
In that case, are there any improvements related to the flickering issue for other Jobs like RayJob MPIJob?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that other job types do not use the active pod count.
My question was, is there any solution to mitigate the flickering of the PodsReady condition in other Job types?


Note that, we don't take failed pods into account when verifying if the
`PodsReady` condition should be added. However, a buggy admitted workload is
eliminated as the corresponding job fails due to exceeding the `.spec.backoffLimit`
Expand Down Expand Up @@ -345,12 +363,64 @@ condition, so the corresponding job is unsuspended without further waiting.

### Timeout on reaching the PodsReady condition

We introduce a timeout, defined in the `waitForPodsReady.timeoutSeconds` field, on reaching the `PodsReady` condition since the job
is unsuspended (the time of unsuspending a job is marked by the Job's
`job.status.startTime` field). When the timeout is exceeded, the Kueue's Job

We introduce two timeouts defined in the `waitForPodsReady.timeout` and `waitForPodsReady.recoveryTimeout` fields.

First one applies before the job has started. It tracks the time between job getting unsuspended for the first time (the time of unsuspending a job is marked by the Job's
`job.status.startTime` field) and reaching the `PodsReady=true` condition (marked by condition's `.lastTransitionTime`).

```mermaid
flowchart TD;
start@{ shape: f-circ};
id1(Suspended=true);
id2("PodsReady=false
waits for .timeoutSeconds");
id3(PodsReady=true);
id4("Suspended=true (Requeued)");

start--Workload gets admitted-->id1;
id1 --> id2;

id2 --"Doesn't exceed the timeout" --> id3
id2 --"Exceeds the timeout" --> id4
```


Second one applies when the job has already started and some pod failed while the job is running. It tracks the time between changing `PodsReady` condition to `false` and reaching the
`PodsReady=true` condition once again.
Comment on lines +389 to +390
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the Job is re-failed after the Job gets ready after the recovery?

I meant, what if the Job falls down the below loop?

flowchart TD;
	id3(PodsReady=true);
	id4("PodsReady=false(2nd)
	waits for
	.recoveryTimeout");
	id3 --"Pod failed"--> id4
	id4 --"Pod recovered"--> id3
Loading

Copy link
Contributor Author

@PBundyra PBundyra Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow multiple (infinite) recoveries if they fit within the timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about falling into the infinite loop, and then the Pods accidentally dominate the resources.
Should we introduce the mitigations for the loop?

We probably want to have a comprehensive timeout separate from RecoveryTimeout so that we can evict the infinity loop Pods. Is there any idea instead of a comprehensive timeout?


```mermaid
flowchart TD;
start@{ shape: f-circ};
id1(Suspended=true);
id2("PodsReady=false(1st)");
id3(PodsReady=true);
id4("PodsReady=false(2nd)
waits for
.recoveryTimeout");
id5("Suspended=true (Requeued)");


start--Workload gets admitted-->id1;
id1 --> id2;

id2 --"Job started" --> id3
id3 --"Pod failed"--> id4
id4 --"Pod recovered"--> id3
id4 --"timeout exceeded"--> id5
```

We introduce new `WorkloadWaitForPodsStart` and `WorkloadWaitForPodsRecovery` reasons to distinguish the reasons of setting the `PodsReady=false` condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can existing users migrate the previous "PodsReady" reason to the new reason?
Is there any migration plans?

`WorkloadWaitForPodsStart` will be set before the job started, and `WorkloadWaitForPodsRecovery` after.

When any of the timeouts is exceeded, the Kueue's Job
Controller suspends the Job corresponding to the workload and puts into the
ClusterQueue's `inadmissibleWorkloads` list. The timeout is enforced only when
`waitForPodsReady` is enabled.
ClusterQueue's `inadmissibleWorkloads` list. It also updates Workload's `.requeueState` field.
When `.requeueState.count` surpasses `waitForPodsReady.requeuingBackoffLimitCount` workloads gets
deactivated and won't be requeued.

Both timeouts apply only when `waitForPodsReady` is enabled.


### Test Plan

Expand Down Expand Up @@ -412,7 +482,8 @@ extending the production code to implement this enhancement.
The following scenarios will be covered with integration tests when `waitForPodsReady` is enabled:
- no workloads are admitted if there is already an admitted workload which is not in the `PodsReady` condition
- a workload gets admitted if all other admitted workloads are in the `PodsReady` condition
- a workload which exceeds the `waitForPodsReady.timeoutSeconds` timeout is suspended and put into the `inadmissibleWorkloads` list
- a workload which exceeds the `waitForPodsReady.timeout` timeout is suspended and put into the `inadmissibleWorkloads` list
- a workload which exceeds the `waitForPodsReady.recoveryTimeout` timeout is suspended and put into the `inadmissibleWorkloads` list

<!--
Describe what tests will be added to ensure proper quality of the enhancement.
Expand Down
4 changes: 4 additions & 0 deletions keps/349-all-or-nothing/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,17 @@ title: All-or-nothing semantics for workload resource assignment
kep-number: 349
authors:
- "@mimowo"
- "@pbundyra" # for recoveryTimeout extension
status: provisional
creation-date: 2022-11-23
reviewers:
- "@kerthcet"
- "@alculquicondor"
- "@mimowo" # for recoveryTimeout extension
- "@yaroslava-serdiuk" # for recoveryTimeout extension
approvers:
- "@ahg-g"
- "@mimowo" # for recoveryTimeout extension

# The target maturity stage in the current dev cycle for this KEP.
stage: stable
Expand Down