KEP: Add an exponential backoff mechanism to the requeueing strategy #1608

tenzen-y · 2024-01-18T18:39:04Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part-of: #1282

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-01-18T18:39:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2024-01-18T18:39:10Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`3829f79`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65c2a5799ecbe20008f5e012

keps/1282-pods-ready-requeue-strategy/README.md

tenzen-y · 2024-01-26T23:30:08Z

Due to the failure to set up the webhook server:

A BeforeSuite node failed so all tests were skipped.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1608/pull-kueue-test-e2e-main-1-29/1751021175854600192

/test pull-kueue-test-e2e-main-1-29

keps/1282-pods-ready-requeue-strategy/README.md

alculquicondor · 2024-02-05T15:18:09Z

keps/1282-pods-ready-requeue-strategy/README.md

+	//
+	// Defaults to null. 
+	// +optional
+	BackOffLimitCount *int32 `json:"backOffLimitCount,omitempty"`


Please describe the implicit behavior of the backoff. What is the base backoff, and what is the exponent?

It makes sense.

astefanutti · 2024-02-05T18:06:37Z

keps/1282-pods-ready-requeue-strategy/README.md

+	// When a deactivated workload is reactivated, this count is reset to 0. 
+	//
+	// +optional
+	RequeuedCount *int32 `json:"requeuedCount,omitempty"`


How about adding the last requeue / backoff timestamp and the backoff duration, so it's possible to know when the workload will be "retried", without recomputing the exponential backoff logic for an external observer?

+1
I wanted to keep the API as minimal as possible, but I agree that usability can be compromised without a maxBackoff

SGTM
I will extend this API.

can you add it? We are almost closing the release.

I should do ASAP...

@alculquicondor Here are quick API design sharing: Please let me know if the API design wouldn't be expected.

type WorkloadStatus struct { ... // requeueState records // // +optional RequeueState *RequeueState `json:"requeueState,omitempty"` } type RequeueState struct { // count records the number of times a workload has been requeued. // When a deactivated workload is reactivated, this count is reset to 0. // // +optional Count *int32 `json:"requeuedCount,omitempty"` // +optional LastBackoffTime metav1.Time `json:"lastRequeuedTime,omitempty"` // +optional BackoffDuration metav1.Duration `json:"backoffDuration,omitempty"` }

I'm organizing the above API.

instead of BackoffDuration and lastBackoffTime, it might be better to have just requeueAt metav1.Time that indicates the time at which Kueue will consider this workload again for admission.

The time when the workload was evicted is already visible in the condition.

instead of BackoffDuration and lastBackoffTime, it might be better to have just requeueAt metav1.Time that indicates the time at which Kueue will consider this workload again for admission.

The time when the workload was evicted is already visible in the condition.

That makes sense. Even if we add only requeueAt, we can avoid recomputing the exponential backoff logic.
After we truly want to have duration based on user feedback, we can extend API.

keps/1282-pods-ready-requeue-strategy/README.md

Signed-off-by: tenzen-y <[email protected]>

Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]>

Signed-off-by: Yuki Iwai <[email protected]>

alculquicondor · 2024-02-06T19:55:24Z

keps/1282-pods-ready-requeue-strategy/README.md

+
+1. The workload don't have the proper configurations like image pull credential and pvc name, etc.
+2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests.  
+3. Multiple flavors are matched for the workload, but the workload can not be launched on the backed flavors (which means non-primary flavor).


was this always there? I don't really understand it.

I added this story built with https://github.com/kubernetes-sigs/kueue/pull/1608/files#r1460075048.

oh right. Let me rephrase:

If there are multiple resource flavors that match the workload (for example, flavors 1 & 2) and the workload was running on flavor 2, it's likely that the workload will be readmitted on the same flavor indefinitely.

Thank you for this suggestion!

alculquicondor · 2024-02-06T19:58:18Z

keps/1282-pods-ready-requeue-strategy/README.md

+	// When a deactivated workload is reactivated, this count is reset to 0. 
+	//
+	// +optional
+	RequeuedCount *int32 `json:"requeuedCount,omitempty"`


can you add it? We are almost closing the release.

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y · 2024-02-06T21:34:47Z

Pending is only here: #1608 (comment)

alculquicondor · 2024-02-06T21:34:59Z

/lgtm
/approve

k8s-ci-robot · 2024-02-06T21:35:04Z

LGTM label has been added.

Git tree hash: 3982a805d53cc49ef1ad9117f0ced1a31f9cfed8

k8s-ci-robot · 2024-02-06T21:35:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2024-02-06T21:50:20Z

/hold cancel

…ubernetes-sigs#1608) * Add an exponential backoff mechanism to the requeueing strategy Signed-off-by: tenzen-y <[email protected]> * Rephrase 'maxBackOffRetry' with 'backOffLimit' Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> * Improve expressions Signed-off-by: Yuki Iwai <[email protected]> * Move backOffLimitTimeout to an alternative section Signed-off-by: Yuki Iwai <[email protected]> * Replace backOff with backoff Signed-off-by: Yuki Iwai <[email protected]> * Additional eviction reasons to story 2 Signed-off-by: Yuki Iwai <[email protected]> * Update an API comment for backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]> * Update story3 Signed-off-by: Yuki Iwai <[email protected]> * Move backoffTimeout to an alternative section Signed-off-by: Yuki Iwai <[email protected]> * Update workload API Signed-off-by: Yuki Iwai <[email protected]> * Rephrase strory2 Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]>

k8s-ci-robot requested review from denkensk and kerthcet January 18, 2024 18:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 18, 2024

tenzen-y changed the title ~~WIP: Add an exponential backoff mechanism to the requeueing strategy~~ WIP: KEP: Add an exponential backoff mechanism to the requeueing strategy Jan 18, 2024

nstogner reviewed Jan 20, 2024

View reviewed changes

keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved

nstogner reviewed Jan 20, 2024

View reviewed changes

keps/1282-pods-ready-requeue-strategy/README.md Show resolved Hide resolved

tenzen-y force-pushed the add-beckoff-limit-for-requeue branch 2 times, most recently from 93dde79 to fd11474 Compare January 26, 2024 00:02

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 26, 2024

tenzen-y force-pushed the add-beckoff-limit-for-requeue branch 7 times, most recently from c48fd96 to 371e451 Compare January 26, 2024 01:00

tenzen-y changed the title ~~WIP: KEP: Add an exponential backoff mechanism to the requeueing strategy~~ KEP: Add an exponential backoff mechanism to the requeueing strategy Jan 26, 2024

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 26, 2024

tenzen-y marked this pull request as ready for review January 26, 2024 01:02

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2024

k8s-ci-robot requested a review from mimowo January 26, 2024 01:02

tenzen-y force-pushed the add-beckoff-limit-for-requeue branch from 0932906 to 88f2302 Compare January 26, 2024 23:15

alculquicondor reviewed Feb 1, 2024

View reviewed changes

alculquicondor reviewed Feb 5, 2024

View reviewed changes

astefanutti reviewed Feb 5, 2024

View reviewed changes

tenzen-y force-pushed the add-beckoff-limit-for-requeue branch from 405775e to 290d8c1 Compare February 5, 2024 23:51

tenzen-y added 9 commits February 6, 2024 13:52

Add an exponential backoff mechanism to the requeueing strategy

3da98ca

Signed-off-by: tenzen-y <[email protected]>

Rephrase 'maxBackOffRetry' with 'backOffLimit'

1f0af2c

Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]>

Improve expressions

457c094

Signed-off-by: Yuki Iwai <[email protected]>

Move backOffLimitTimeout to an alternative section

572786c

Signed-off-by: Yuki Iwai <[email protected]>

Replace backOff with backoff

c48b142

Signed-off-by: Yuki Iwai <[email protected]>

Additional eviction reasons to story 2

b02ec96

Signed-off-by: Yuki Iwai <[email protected]>

Update an API comment for backoffLimitCount

2b1d7d3

Signed-off-by: Yuki Iwai <[email protected]>

Update story3

4814a03

Signed-off-by: Yuki Iwai <[email protected]>

Move backoffTimeout to an alternative section

cf1be07

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y force-pushed the add-beckoff-limit-for-requeue branch from 7fdf2a2 to cf1be07 Compare February 6, 2024 04:54

alculquicondor reviewed Feb 6, 2024

View reviewed changes

tenzen-y added 2 commits February 7, 2024 06:31

Update workload API

2b5574e

Signed-off-by: Yuki Iwai <[email protected]>

Rephrase strory2

3829f79

Signed-off-by: Yuki Iwai <[email protected]>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2024

k8s-ci-robot merged commit 8cf1893 into kubernetes-sigs:main Feb 6, 2024
14 checks passed

k8s-ci-robot added this to the v0.6 milestone Feb 6, 2024

tenzen-y deleted the add-beckoff-limit-for-requeue branch February 12, 2024 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP: Add an exponential backoff mechanism to the requeueing strategy #1608

KEP: Add an exponential backoff mechanism to the requeueing strategy #1608

tenzen-y commented Jan 18, 2024 •

edited

Loading

k8s-ci-robot commented Jan 18, 2024

netlify bot commented Jan 18, 2024 •

edited

Loading

tenzen-y commented Jan 26, 2024

alculquicondor Feb 5, 2024

tenzen-y Feb 5, 2024

astefanutti Feb 5, 2024

alculquicondor Feb 5, 2024

tenzen-y Feb 5, 2024

alculquicondor Feb 6, 2024

tenzen-y Feb 6, 2024

tenzen-y Feb 6, 2024

alculquicondor Feb 6, 2024

tenzen-y Feb 6, 2024

tenzen-y Feb 6, 2024

alculquicondor Feb 6, 2024

tenzen-y Feb 6, 2024

alculquicondor Feb 6, 2024 •

edited

Loading

tenzen-y Feb 6, 2024

tenzen-y Feb 6, 2024

alculquicondor Feb 6, 2024

tenzen-y commented Feb 6, 2024

alculquicondor commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

tenzen-y commented Feb 6, 2024

KEP: Add an exponential backoff mechanism to the requeueing strategy #1608

KEP: Add an exponential backoff mechanism to the requeueing strategy #1608

Conversation

tenzen-y commented Jan 18, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jan 18, 2024

netlify bot commented Jan 18, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

tenzen-y commented Jan 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 6, 2024

alculquicondor commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

k8s-ci-robot commented Feb 6, 2024

tenzen-y commented Feb 6, 2024

tenzen-y commented Jan 18, 2024 •

edited

Loading

netlify bot commented Jan 18, 2024 •

edited

Loading

alculquicondor Feb 6, 2024 •

edited

Loading