-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Introduce a new timeout to WaitForPodsReady config #2737
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
f01fc98
to
8a49825
Compare
cc @tenzen-y |
/retest |
b72652f
to
3ee57bd
Compare
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's missing details
keps/349-all-or-nothing/README.md
Outdated
First one tracks the time between job getting unsuspended (the time of unsuspending a job is marked by the Job's | ||
`job.status.startTime` field) and reaching the `PodsReady=true` condition. | ||
|
||
Second one tracks the time between changing `PodsReady` condition to `false` after the job is running and reaching the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you plan to track this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to use timestamp in the PodsReady
condition to calculate how much time passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add it to the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to support additional fault recovery capabilities, it could be desirable to have a separate PodsUnhealthy
or WorkloadUnhealthy
condition on the Workload instead of overloading PodsReady
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What semantic differences do you imagine between PodsUnhealthy
/WorkloadUnhealthy
and PodsReady
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout this KEP, I feel that this is lack of details.
So, could you explain
- the timing when this timeout is applied to the worloads.
- how many times we allow the workload failure after the workload is healthy once.
- how to implement this mechanism. I guess that it would be great to clarify the responsibilities for this feature each in components. For example, the JobFramework reconcile is responsible for ... once the Ready Pod crash ...
This is an interesting extension to the state machine for a Workload. It is getting pretty close to what one might need for a fairly general fault detection/recovery mechanism. Is there interest in pursuing that angle? We've explored this space fairly extensively for AppWrappers (https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/) and would be quite interested in bringing similar capabilities to Kueue's GenericJob framework. |
3ee57bd
to
87c6ff2
Compare
025dfef
to
c3b44c5
Compare
I prefer a separate timeout, because it will typically be much smaller than the timeout for all pods (first start). |
LGTM overall |
Co-authored-by: Michał Woźniak <[email protected]>
Please update |
7fb75aa
to
ce5b174
Compare
/lgtm |
LGTM label has been added. Git tree hash: 28e85bd3b3d6c1f7cb41ad315a401eceddf6d269
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, PBundyra The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM |
Co-authored-by: Yaroslava Serdiuk <[email protected]>
New changes are detected. LGTM label has been removed. |
lgtm |
LGTM even better 👍 |
@@ -315,9 +331,11 @@ type RequeueState struct { | |||
|
|||
We introduce a new workload condition, called `PodsReady`, to indicate | |||
if the workload's startup requirements are satisfied. More precisely, we add | |||
the condition when `job.status.ready + job.status.succeeded` is greater or equal | |||
the condition when `job.status.ready + len(job.status.uncountedTerimnatedPods.succeeded) + job.status.succeeded` is greater or equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this indicate
kueue/pkg/controller/jobs/job/job_controller.go
Lines 313 to 316 in 6ac33de
func (j *Job) PodsReady() bool { | |
ready := ptr.Deref(j.Status.Ready, 0) | |
return j.Status.Succeeded+ready >= j.podsCount() | |
} |
In that case, what if other jobs except batch/v1 Jobs?
than `job.spec.parallelism`. | ||
|
||
Note that we count `job.status.uncountedTerminatedPods` - this is meant to prevent flickering of the `PodsReady` condition when pods are transitioning to the `Succeeded` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this improve the existing waitForPodsReady feature even if we do not introduce the recoveryTimeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
In that case, are there any improvements related to the flickering issue for other Jobs like RayJob MPIJob?
Second one applies when the job has already started and some pod failed while the job is running. It tracks the time between changing `PodsReady` condition to `false` and reaching the | ||
`PodsReady=true` condition once again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the Job is re-failed after the Job gets ready after the recovery?
I meant, what if the Job falls down the below loop?
flowchart TD;
id3(PodsReady=true);
id4("PodsReady=false(2nd)
waits for
.recoveryTimeout");
id3 --"Pod failed"--> id4
id4 --"Pod recovered"--> id3
id4 --"timeout exceeded"--> id5 | ||
``` | ||
|
||
We introduce new `WorkloadWaitForPodsStart` and `WorkloadWaitForPodsRecovery` reasons to distinguish the reasons of setting the `PodsReady=false` condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can existing users migrate the previous "PodsReady" reason to the new reason?
Is there any migration plans?
Co-authored-by: Yuki Iwai <[email protected]>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Described in #2732
Which issue(s) this PR fixes:
Part of #2732
Special notes for your reviewer:
As an alternative, instead of adding a new timeout, we could reuse the existing one to provide the needed functionality
Pros:
Cons:
Does this PR introduce a user-facing change?