-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Add an exponential backoff mechanism to the requeueing strategy #1608
KEP: Add an exponential backoff mechanism to the requeueing strategy #1608
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
93dde79
to
fd11474
Compare
c48fd96
to
371e451
Compare
0932906
to
88f2302
Compare
Due to the failure to set up the webhook server:
/test pull-kueue-test-e2e-main-1-29 |
// | ||
// Defaults to null. | ||
// +optional | ||
BackOffLimitCount *int32 `json:"backOffLimitCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe the implicit behavior of the backoff. What is the base backoff, and what is the exponent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
// When a deactivated workload is reactivated, this count is reset to 0. | ||
// | ||
// +optional | ||
RequeuedCount *int32 `json:"requeuedCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding the last requeue / backoff timestamp and the backoff duration, so it's possible to know when the workload will be "retried", without recomputing the exponential backoff logic for an external observer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I wanted to keep the API as minimal as possible, but I agree that usability can be compromised without a maxBackoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
I will extend this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add it? We are almost closing the release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should do ASAP...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor Here are quick API design sharing: Please let me know if the API design wouldn't be expected.
type WorkloadStatus struct {
...
// requeueState records
//
// +optional
RequeueState *RequeueState `json:"requeueState,omitempty"`
}
type RequeueState struct {
// count records the number of times a workload has been requeued.
// When a deactivated workload is reactivated, this count is reset to 0.
//
// +optional
Count *int32 `json:"requeuedCount,omitempty"`
// +optional
LastBackoffTime metav1.Time `json:"lastRequeuedTime,omitempty"`
// +optional
BackoffDuration metav1.Duration `json:"backoffDuration,omitempty"`
}
I'm organizing the above API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of BackoffDuration
and lastBackoffTime
, it might be better to have just requeueAt metav1.Time
that indicates the time at which Kueue will consider this workload again for admission.
The time when the workload was evicted is already visible in the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of BackoffDuration and lastBackoffTime, it might be better to have just requeueAt metav1.Time that indicates the time at which Kueue will consider this workload again for admission.
The time when the workload was evicted is already visible in the condition.
That makes sense. Even if we add only requeueAt
, we can avoid recomputing the exponential backoff logic.
After we truly want to have duration
based on user feedback, we can extend API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
405775e
to
290d8c1
Compare
Signed-off-by: tenzen-y <[email protected]>
Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
7fdf2a2
to
cf1be07
Compare
|
||
1. The workload don't have the proper configurations like image pull credential and pvc name, etc. | ||
2. The cluster can meet flavorQuotas, but each node doesn't have the resources that each podSet requests. | ||
3. Multiple flavors are matched for the workload, but the workload can not be launched on the backed flavors (which means non-primary flavor). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this always there? I don't really understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this story built with https://github.com/kubernetes-sigs/kueue/pull/1608/files#r1460075048.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right. Let me rephrase:
If there are multiple resource flavors that match the workload (for example, flavors 1 & 2)
and the workload was running on flavor 2, it's likely that the workload will be readmitted
on the same flavor indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// When a deactivated workload is reactivated, this count is reset to 0. | ||
// | ||
// +optional | ||
RequeuedCount *int32 `json:"requeuedCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add it? We are almost closing the release.
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Pending is only here: #1608 (comment) |
/lgtm |
LGTM label has been added. Git tree hash: 3982a805d53cc49ef1ad9117f0ced1a31f9cfed8
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
…ubernetes-sigs#1608) * Add an exponential backoff mechanism to the requeueing strategy Signed-off-by: tenzen-y <[email protected]> * Rephrase 'maxBackOffRetry' with 'backOffLimit' Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> * Improve expressions Signed-off-by: Yuki Iwai <[email protected]> * Move backOffLimitTimeout to an alternative section Signed-off-by: Yuki Iwai <[email protected]> * Replace backOff with backoff Signed-off-by: Yuki Iwai <[email protected]> * Additional eviction reasons to story 2 Signed-off-by: Yuki Iwai <[email protected]> * Update an API comment for backoffLimitCount Signed-off-by: Yuki Iwai <[email protected]> * Update story3 Signed-off-by: Yuki Iwai <[email protected]> * Move backoffTimeout to an alternative section Signed-off-by: Yuki Iwai <[email protected]> * Update workload API Signed-off-by: Yuki Iwai <[email protected]> * Rephrase strory2 Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: tenzen-y <[email protected]> Signed-off-by: Yuki Iwai <[email protected]>
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part-of: #1282
Special notes for your reviewer:
Does this PR introduce a user-facing change?