kubernetes-sigs · PBundyra · Aug 1, 2024 · Aug 5, 2024 · Aug 5, 2024 · Dec 11, 2024
diff --git a/keps/349-all-or-nothing/README.md b/keps/349-all-or-nothing/README.md
@@ -24,6 +24,7 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Proposal](#proposal)
   - [User Stories (Optional)](#user-stories-optional)
     - [Story 1](#story-1)
+    - [Story 2](#story-2)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
@@ -92,6 +93,7 @@ demonstrate the interest in a KEP within the wider Kubernetes community.
 unsuspended by Kueue
 - a timeout on getting the physical resources assigned by a Job since
 unsuspended by Kueue
+- a timeout on replacing a failed pod
 
 <!--
 List the specific goals of the KEP. What is it trying to achieve? How will we
@@ -102,8 +104,7 @@ know that this has succeeded?
 
 - guarantee that two jobs would not schedule pods concurrently. Example
 scenarios in which two jobs may still concurrently schedule their pods:
-  - when succeeded pods are replaced with new because job's parallelism is less than its completions;
-  - when a failed pod gets replaced
+  - when succeeded pods are replaced with new because job's parallelism is less than its completions.
 
 <!--
 What is out of scope for this KEP? Listing non-goals helps to focus discussion
@@ -147,6 +148,11 @@ the configured cluster queue quota and when the Jobs don't specify priorities
 My use case can be supported by enabling `waitForPodsReady` in the Kueue
 configuration.
 
+#### Story 2
+
+As a Kueue administrator I want to ensure that a Workload will be evicted after
+configured timeout if a pod fails during its execution and the replacement Pod can't be scheduled.
+
 ### Notes/Constraints/Caveats (Optional)
 
 <!--
@@ -162,8 +168,8 @@ If a workload fails to schedule its pods it could block admission of other
 workloads indefinitely.
 
 To mitigate this issue we introduce a timeout on reaching the `PodsReady`
-condition by a workload since its job start (see:
-[Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)).
+condition by a workload since its job start, and a timeout on reaching the `PodsReady` condition since its pod has failed
+ (see:[Timeout on reaching the PodsReady condition](#timeout-on-reaching-the-podsready-condition)).
 
 <!--
 What are the risks of this proposal, and how do we mitigate? Think broadly.
@@ -226,6 +232,16 @@ type WaitForPodsReady struct {
 	// RequeuingStrategy defines the strategy for requeuing a Workload.
 	// +optional
 	RequeuingStrategy *RequeuingStrategy `json:"requeuingStrategy,omitempty"`
+
+	// RecoveryTimeout defines an optional timeout, measured since the
+	// last transition to the PodsReady=false condition after a Workload is Admitted and running.
+	// Such a transition may happen when a Pod failed and the replacement Pod 
+	// is awaited to be scheduled.
+	// After exceeding the timeout the corresponding job gets suspended again
+	// and requeued after the backoff delay. The timeout is enforced only if waitForPodsReady.enable=true. 
+	// Defaults to 3 mins.
+	// +optional
+	RecoveryTimeout *metav1.Duration `json:"recoveryTimeout,omitempty"`
 }
 
 type RequeuingStrategy struct {
@@ -315,9 +331,11 @@ type RequeueState struct {
 
 We introduce a new workload condition, called `PodsReady`, to indicate
 if the workload's startup requirements are satisfied. More precisely, we add
-the condition when `job.status.ready + job.status.succeeded` is greater or equal
+the condition when `job.status.ready + len(job.status.uncountedTerimnatedPods.succeeded) + job.status.succeeded` is greater or equal
 func (j *Job) PodsReady() bool { 
 	ready := ptr.Deref(j.Status.Ready, 0) 
 	return j.Status.Succeeded+ready >= j.podsCount() 
 } 
 func (j *Job) PodsReady() bool { 
 	ready := ptr.Deref(j.Status.Ready, 0) 
 	return j.Status.Succeeded+ready >= j.podsCount() 
 } 
 than `job.spec.parallelism`.
 
+Note that we count `job.status.uncountedTerminatedPods` - this is meant to prevent flickering of the `PodsReady` condition when pods are transitioning to the `Succeeded` state.
+
 Note that, we don't take failed pods into account when verifying if the
 `PodsReady` condition should be added. However, a buggy admitted workload is
 eliminated as the corresponding job fails due to exceeding the `.spec.backoffLimit`
@@ -345,12 +363,64 @@ condition, so the corresponding job is unsuspended without further waiting.
 
 ### Timeout on reaching the PodsReady condition
 
-We introduce a timeout, defined in the `waitForPodsReady.timeoutSeconds` field, on reaching the `PodsReady` condition since the job
-is unsuspended (the time of unsuspending a job is marked by the Job's
-`job.status.startTime` field). When the timeout is exceeded, the Kueue's Job
+
+We introduce two timeouts defined in the `waitForPodsReady.timeout` and `waitForPodsReady.recoveryTimeout` fields.
+
+First one applies before the job has started. It tracks the time between job getting unsuspended for the first time (the time of unsuspending a job is marked by the Job's
+`job.status.startTime` field) and reaching the `PodsReady=true` condition (marked by condition's `.lastTransitionTime`).
+
+```mermaid
+flowchart TD;
+	start@{ shape: f-circ};
+	id1(Suspended=true);
+	id2("PodsReady=false
+	waits for .timeoutSeconds");
+	id3(PodsReady=true);
+	id4("Suspended=true (Requeued)");
+
+	start--Workload gets admitted-->id1;
+	id1 --> id2;
+
+	id2 --"Doesn't exceed the timeout" --> id3
+	id2 --"Exceeds the timeout" --> id4
+```
+
+
+Second one applies when the job has already started and some pod failed while the job is running. It tracks the time between changing `PodsReady` condition to `false` and reaching the
+`PodsReady=true` condition once again.
+
+```mermaid
+flowchart TD;
+	start@{ shape: f-circ};
+	id1(Suspended=true);
+	id2("PodsReady=false(1st)");
+	id3(PodsReady=true);
+	id4("PodsReady=false(2nd)
+	waits for
+	.recoveryTimeout");
+	id5("Suspended=true (Requeued)");
+
+
+	start--Workload gets admitted-->id1;
+	id1 --> id2;
+
+	id2 --"Job started" --> id3
+	id3 --"Pod failed"--> id4
+	id4 --"Pod recovered"--> id3
+	id4 --"timeout	exceeded"--> id5
+```
+
+We introduce new `WorkloadWaitForPodsStart` and `WorkloadWaitForPodsRecovery` reasons to distinguish the reasons of setting the `PodsReady=false` condition.
+`WorkloadWaitForPodsStart` will be set before the job started, and `WorkloadWaitForPodsRecovery` after.
+
+When any of the timeouts is exceeded, the Kueue's Job
 Controller suspends the Job corresponding to the workload and puts into the
-ClusterQueue's `inadmissibleWorkloads` list. The timeout is enforced only when
-`waitForPodsReady` is enabled.
+ClusterQueue's `inadmissibleWorkloads` list. It also updates Workload's `.requeueState` field.
+When `.requeueState.count` surpasses `waitForPodsReady.requeuingBackoffLimitCount` workloads gets
+deactivated and won't be requeued.
+
+Both timeouts apply only when `waitForPodsReady` is enabled.
+
 
 ### Test Plan
 
@@ -412,7 +482,8 @@ extending the production code to implement this enhancement.
 The following scenarios will be covered with integration tests when `waitForPodsReady` is enabled:
 - no workloads are admitted if there is already an admitted workload which is not in the `PodsReady` condition
 - a workload gets admitted if all other admitted workloads are in the `PodsReady` condition
-- a workload which exceeds the `waitForPodsReady.timeoutSeconds` timeout is suspended and put into the `inadmissibleWorkloads` list
+- a workload which exceeds the `waitForPodsReady.timeout` timeout is suspended and put into the `inadmissibleWorkloads` list
+- a workload which exceeds the `waitForPodsReady.recoveryTimeout` timeout is suspended and put into the `inadmissibleWorkloads` list
 
 <!--
 Describe what tests will be added to ensure proper quality of the enhancement.

diff --git a/keps/349-all-or-nothing/kep.yaml b/keps/349-all-or-nothing/kep.yaml
@@ -2,13 +2,17 @@ title: All-or-nothing semantics for workload resource assignment
 kep-number: 349
 authors:
   - "@mimowo"
+  - "@pbundyra"           # for recoveryTimeout extension
 status: provisional
 creation-date: 2022-11-23
 reviewers:
   - "@kerthcet"
   - "@alculquicondor"
+  - "@mimowo" # for recoveryTimeout extension
+  - "@yaroslava-serdiuk"  # for recoveryTimeout extension
 approvers:
   - "@ahg-g"
+  - "@mimowo" # for recoveryTimeout extension
 
 # The target maturity stage in the current dev cycle for this KEP.
 stage: stable