Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Add an exponential backoff mechanism to the requeueing strategy #1608

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jan 18, 2024

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part-of: #1282

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 18, 2024
Copy link

netlify bot commented Jan 18, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 3829f79
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65c2a5799ecbe20008f5e012

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 18, 2024
@tenzen-y tenzen-y changed the title WIP: Add an exponential backoff mechanism to the requeueing strategy WIP: KEP: Add an exponential backoff mechanism to the requeueing strategy Jan 18, 2024
@tenzen-y tenzen-y force-pushed the add-beckoff-limit-for-requeue branch 2 times, most recently from 93dde79 to fd11474 Compare January 26, 2024 00:02
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 26, 2024
@tenzen-y tenzen-y force-pushed the add-beckoff-limit-for-requeue branch 7 times, most recently from c48fd96 to 371e451 Compare January 26, 2024 01:00
@tenzen-y tenzen-y changed the title WIP: KEP: Add an exponential backoff mechanism to the requeueing strategy KEP: Add an exponential backoff mechanism to the requeueing strategy Jan 26, 2024
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 26, 2024
@tenzen-y tenzen-y marked this pull request as ready for review January 26, 2024 01:02
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo January 26, 2024 01:02
@tenzen-y tenzen-y force-pushed the add-beckoff-limit-for-requeue branch from 0932906 to 88f2302 Compare January 26, 2024 23:15
@tenzen-y
Copy link
Member Author

Due to the failure to set up the webhook server:

A BeforeSuite node failed so all tests were skipped.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1608/pull-kueue-test-e2e-main-1-29/1751021175854600192

/test pull-kueue-test-e2e-main-1-29

keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
//
// Defaults to null.
// +optional
BackOffLimitCount *int32 `json:"backOffLimitCount,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the implicit behavior of the backoff. What is the base backoff, and what is the exponent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense.

// When a deactivated workload is reactivated, this count is reset to 0.
//
// +optional
RequeuedCount *int32 `json:"requeuedCount,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding the last requeue / backoff timestamp and the backoff duration, so it's possible to know when the workload will be "retried", without recomputing the exponential backoff logic for an external observer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I wanted to keep the API as minimal as possible, but I agree that usability can be compromised without a maxBackoff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM
I will extend this API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add it? We are almost closing the release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should do ASAP...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor Here are quick API design sharing: Please let me know if the API design wouldn't be expected.

type WorkloadStatus struct {
	...
	// requeueState records 
	// 
	// +optional
	RequeueState *RequeueState `json:"requeueState,omitempty"` 
}

type RequeueState struct {
	// count records the number of times a workload has been requeued.
	// When a deactivated workload is reactivated, this count is reset to 0. 
	//
	// +optional
	Count *int32 `json:"requeuedCount,omitempty"`

	// +optional
	LastBackoffTime metav1.Time `json:"lastRequeuedTime,omitempty"`
	
	// +optional
	BackoffDuration metav1.Duration `json:"backoffDuration,omitempty"`
}

I'm organizing the above API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of BackoffDuration and lastBackoffTime, it might be better to have just requeueAt metav1.Time that indicates the time at which Kueue will consider this workload again for admission.

The time when the workload was evicted is already visible in the condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of BackoffDuration and lastBackoffTime, it might be better to have just requeueAt metav1.Time that indicates the time at which Kueue will consider this workload again for admission.

The time when the workload was evicted is already visible in the condition.

That makes sense. Even if we add only requeueAt, we can avoid recomputing the exponential backoff logic.
After we truly want to have duration based on user feedback, we can extend API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

keps/1282-pods-ready-requeue-strategy/README.md Outdated Show resolved Hide resolved
@tenzen-y tenzen-y force-pushed the add-beckoff-limit-for-requeue branch from 405775e to 290d8c1 Compare February 5, 2024 23:51
@tenzen-y tenzen-y force-pushed the add-beckoff-limit-for-requeue branch from 7fdf2a2 to cf1be07 Compare February 6, 2024 04:54

1. The workload don't have the proper configurations like image pull credential and pvc name, etc.
2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests.
3. Multiple flavors are matched for the workload, but the workload can not be launched on the backed flavors (which means non-primary flavor).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this always there? I don't really understand it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alculquicondor alculquicondor Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right. Let me rephrase:

If there are multiple resource flavors that match the workload (for example, flavors 1 & 2)
and the workload was running on flavor 2, it's likely that the workload will be readmitted
on the same flavor indefinitely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this suggestion!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// When a deactivated workload is reactivated, this count is reset to 0.
//
// +optional
RequeuedCount *int32 `json:"requeuedCount,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add it? We are almost closing the release.

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
@tenzen-y
Copy link
Member Author

tenzen-y commented Feb 6, 2024

Pending is only here: #1608 (comment)

@alculquicondor
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3982a805d53cc49ef1ad9117f0ced1a31f9cfed8

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

tenzen-y commented Feb 6, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2024
@k8s-ci-robot k8s-ci-robot merged commit 8cf1893 into kubernetes-sigs:main Feb 6, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Feb 6, 2024
@tenzen-y tenzen-y deleted the add-beckoff-limit-for-requeue branch February 12, 2024 00:16
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
…ubernetes-sigs#1608)

* Add an exponential backoff mechanism to the requeueing strategy

Signed-off-by: tenzen-y <[email protected]>

* Rephrase 'maxBackOffRetry' with 'backOffLimit'

Signed-off-by: tenzen-y <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>

* Improve expressions

Signed-off-by: Yuki Iwai <[email protected]>

* Move backOffLimitTimeout to an alternative section

Signed-off-by: Yuki Iwai <[email protected]>

* Replace backOff with backoff

Signed-off-by: Yuki Iwai <[email protected]>

* Additional eviction reasons to story 2

Signed-off-by: Yuki Iwai <[email protected]>

* Update an API comment for backoffLimitCount

Signed-off-by: Yuki Iwai <[email protected]>

* Update story3

Signed-off-by: Yuki Iwai <[email protected]>

* Move backoffTimeout to an alternative section

Signed-off-by: Yuki Iwai <[email protected]>

* Update workload API

Signed-off-by: Yuki Iwai <[email protected]>

* Rephrase strory2

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: tenzen-y <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants