Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Introduce a new timeout to WaitForPodsReady config #2737

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

PBundyra
Copy link
Contributor

@PBundyra PBundyra commented Aug 1, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Described in #2732

Which issue(s) this PR fixes:

Part of #2732

Special notes for your reviewer:

As an alternative, instead of adding a new timeout, we could reuse the existing one to provide the needed functionality

Pros:

  • no API changes requried

Cons:

  • Users cannot set different timeouts for different scenarios

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 1, 2024
@k8s-ci-robot k8s-ci-robot requested review from denkensk and trasc August 1, 2024 08:33
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2024
@PBundyra
Copy link
Contributor Author

PBundyra commented Aug 1, 2024

/assign @mimowo
cc @mwielgus

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 1, 2024
Copy link

netlify bot commented Aug 1, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 3d6bb18
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/677400b03fd33e000962686c

@PBundyra PBundyra force-pushed the KEP-wait-for-pods-ready branch 2 times, most recently from f01fc98 to 8a49825 Compare August 1, 2024 08:35
@PBundyra
Copy link
Contributor Author

PBundyra commented Aug 1, 2024

cc @tenzen-y

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
@PBundyra
Copy link
Contributor Author

/retest

@PBundyra PBundyra force-pushed the KEP-wait-for-pods-ready branch from b72652f to 3ee57bd Compare August 13, 2024 07:43
@alculquicondor
Copy link
Contributor

/cc

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's missing details

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
First one tracks the time between job getting unsuspended (the time of unsuspending a job is marked by the Job's
`job.status.startTime` field) and reaching the `PodsReady=true` condition.

Second one tracks the time between changing `PodsReady` condition to `false` after the job is running and reaching the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you plan to track this time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to use timestamp in the PodsReady condition to calculate how much time passed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add it to the doc

Copy link
Contributor

@dgrove-oss dgrove-oss Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to support additional fault recovery capabilities, it could be desirable to have a separate PodsUnhealthy or WorkloadUnhealthy condition on the Workload instead of overloading PodsReady.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What semantic differences do you imagine between PodsUnhealthy/WorkloadUnhealthy and PodsReady?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout this KEP, I feel that this is lack of details.
So, could you explain

  • the timing when this timeout is applied to the worloads.
  • how many times we allow the workload failure after the workload is healthy once.
  • how to implement this mechanism. I guess that it would be great to clarify the responsibilities for this feature each in components. For example, the JobFramework reconcile is responsible for ... once the Ready Pod crash ...

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
@dgrove-oss
Copy link
Contributor

This is an interesting extension to the state machine for a Workload. It is getting pretty close to what one might need for a fairly general fault detection/recovery mechanism. Is there interest in pursuing that angle?

We've explored this space fairly extensively for AppWrappers (https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/) and would be quite interested in bringing similar capabilities to Kueue's GenericJob framework.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2024
@PBundyra PBundyra closed this Nov 5, 2024
@PBundyra PBundyra deleted the KEP-wait-for-pods-ready branch November 5, 2024 13:14
@PBundyra PBundyra restored the KEP-wait-for-pods-ready branch December 6, 2024 13:45
@PBundyra PBundyra reopened this Dec 6, 2024
@PBundyra PBundyra force-pushed the KEP-wait-for-pods-ready branch from 3ee57bd to 87c6ff2 Compare December 11, 2024 10:33
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2024
@PBundyra PBundyra force-pushed the KEP-wait-for-pods-ready branch from 025dfef to c3b44c5 Compare December 18, 2024 14:30
@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

As an alternative, instead of adding a new timeout, we could reuse the existing one to provide the needed functionality

I prefer a separate timeout, because it will typically be much smaller than the timeout for all pods (first start).

@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

LGTM overall

Co-authored-by: Michał Woźniak <[email protected]>
@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

Please update keps/349-all-or-nothing/kep.yaml to add me to reviewers and approvers, but add a comment: # for recoveryTimeout extension. @PBundyra please also add yourself and @yaroslava-serdiuk to the appropriate sections, possibly with a comment.

@PBundyra PBundyra force-pushed the KEP-wait-for-pods-ready branch from 7fb75aa to ce5b174 Compare December 18, 2024 14:58
@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

/lgtm
/approve
/hold
Leaving the final tagging to @tenzen-y

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 18, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 28e85bd3b3d6c1f7cb41ad315a401eceddf6d269

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, PBundyra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2024
@dgrove-oss
Copy link
Contributor

LGTM

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
Co-authored-by: Yaroslava Serdiuk <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 19, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo December 19, 2024 09:22
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@yaroslava-serdiuk
Copy link
Contributor

lgtm

@mimowo
Copy link
Contributor

mimowo commented Dec 19, 2024

LGTM even better 👍

keps/349-all-or-nothing/README.md Outdated Show resolved Hide resolved
@@ -315,9 +331,11 @@ type RequeueState struct {

We introduce a new workload condition, called `PodsReady`, to indicate
if the workload's startup requirements are satisfied. More precisely, we add
the condition when `job.status.ready + job.status.succeeded` is greater or equal
the condition when `job.status.ready + len(job.status.uncountedTerimnatedPods.succeeded) + job.status.succeeded` is greater or equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this indicate

func (j *Job) PodsReady() bool {
ready := ptr.Deref(j.Status.Ready, 0)
return j.Status.Succeeded+ready >= j.podsCount()
}
?

In that case, what if other jobs except batch/v1 Jobs?

than `job.spec.parallelism`.

Note that we count `job.status.uncountedTerminatedPods` - this is meant to prevent flickering of the `PodsReady` condition when pods are transitioning to the `Succeeded` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this improve the existing waitForPodsReady feature even if we do not introduce the recoveryTimeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
In that case, are there any improvements related to the flickering issue for other Jobs like RayJob MPIJob?

Comment on lines +389 to +390
Second one applies when the job has already started and some pod failed while the job is running. It tracks the time between changing `PodsReady` condition to `false` and reaching the
`PodsReady=true` condition once again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the Job is re-failed after the Job gets ready after the recovery?

I meant, what if the Job falls down the below loop?

flowchart TD;
	id3(PodsReady=true);
	id4("PodsReady=false(2nd)
	waits for
	.recoveryTimeout");
	id3 --"Pod failed"--> id4
	id4 --"Pod recovered"--> id3
Loading

id4 --"timeout exceeded"--> id5
```

We introduce new `WorkloadWaitForPodsStart` and `WorkloadWaitForPodsRecovery` reasons to distinguish the reasons of setting the `PodsReady=false` condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can existing users migrate the previous "PodsReady" reason to the new reason?
Is there any migration plans?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants