-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP update: Allow replacement pods in groups of pods #1338
KEP update: Allow replacement pods in groups of pods #1338
Conversation
Change-Id: I894785325d44ff3cc3287f9717e96add499a2b48
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
Change-Id: I5b07d1ad71c99a23bdb612f1662df7dc8b991bec
4913c48
to
88aa7d5
Compare
Change-Id: Ie0281d6ebd6a022abc85ae245a22b3282dd23881
88aa7d5
to
f1b5f78
Compare
/hold I'm rethinking whether all Pods owning the Workload is the best idea. |
keps/976-plain-pods/README.md
Outdated
|
||
### Retrying Failed Pods | ||
|
||
The Pod group will generally only be considered finished if all the Pods finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about pod groups that do not replace failed pods? We have this use case where a user does not allow retries (and I imagine is a common batch case). So a pod group would be finished if all pods have exited, regardless of success or failure.
I think one way to do this is for every pod in the group to have the kueue.x-k8s.io/last-in-group: true
annotation, if "group finished" does not mean the Workload is cleaned up or other running pods in the group are affected.
This seems weird given the name, but would this be the recommendation? Is it worth some kind of alternative annotation configuring this, like pod-group-mode: Batch
or pod-group-retry: false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just having one pod with last-in-group
has the semantics you want: The pod group will be considered Finished when there are no more Running Pods and there is at least one pod with the annotation.
But I agree that the name is maybe not the best for this use case. I'm thinking of alternatives.
Definitely pod-group-mode: Batch
is not accurate, as batch doesn't imply that retries are not possible.
pod-group-retry
doesn't fit the semantics I originally wanted that well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to go with retriable-in-group: false
, similar to your proposal.
But I also added another mechanism for termination: just delete the Workload.
Change-Id: Ia36c32fdf6862881fe0002d3b565c11da5978a54
keps/976-plain-pods/README.md
Outdated
Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are | ||
Failed. This is a simple way of managing finalizers, but it might lead to too many Pods lingering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are | |
Failed. This is a simple way of managing finalizers, but it might lead to too many Pods lingering | |
Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are | |
Failed and replaced. This is a simple way of managing finalizers, but it might lead to too many Pods lingering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I actually wanted to say that we remove finalizers only when the workload is finished.
Change-Id: If5cf5dc60dfbf1f7afb45beee974e92bffa979b8
c0f87ba
to
e0a3af3
Compare
Change-Id: I34fbfdac3e056f559167795938c3c2cab1f41977
oh, and @nstogner |
/assign |
lgtm, thanks Aldo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it is possible to configurable conditions to say that the job failed.
In a group of pods, users may want to mark the job as a failure if the driver pod fails to start.
However, I'm not sure if it would be worth it. Maybe we should suggest users migrate to batch/job or custom job.
Note that fields like `env` and `command` can sometimes change among all the pods of a group and | ||
they don't influence scheduling, so they are safe to skip. `volumes` can influence scheduling, but | ||
they can be parameterized, like in StatefulSets, so we will ignore them for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we compare the entire containers to verify if the existing workload matches the desired workload:
kueue/pkg/util/equality/podset.go
Lines 29 to 34 in 1ecd79f
func comparePodTemplate(a, b *corev1.PodSpec) bool { | |
if !equality.Semantic.DeepEqual(a.InitContainers, b.InitContainers) { | |
return false | |
} | |
return equality.Semantic.DeepEqual(a.Containers, b.Containers) | |
} |
So, I'm wondering if we should update envs and commands. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I'm not sure about the usefulness.
In Pod groups, it is useful because each Pod might have a slightly different spec. But in Jobs, there is only one template.
But that is a separate discussion nevertheless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds reasonable. I may be able to have discussions about this.
Thanks.
Workload. If there is an existing Workload in the cache and it has smaller Pod counters than the | ||
in-memory Workload, then it is considered unmatching and the Workload is evicted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a cached existing Workload has larger than pod counters than an in-memory Workload?
Will The reconciler evict the Workload the same as smaller case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only evict if the counters in the Workload are smaller, not larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. Thanks.
Change-Id: I971e0e87fb09ad9fba8a3805a61d2cc267b29bda
/hold cancel |
@tenzen-y anything to add before merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
LGTM label has been added. Git tree hash: d235a6b17f4db2dfeab923b528fcb2a1a63778d8
|
…#1338) * KEP: Simpler algorithm for admitting groups of Pods Change-Id: I894785325d44ff3cc3287f9717e96add499a2b48 * Clarify that workload is automatically cleaned up Change-Id: I5b07d1ad71c99a23bdb612f1662df7dc8b991bec * Last attempt annotation and reclaimable quota Change-Id: Ie0281d6ebd6a022abc85ae245a22b3282dd23881 * Simplify design Change-Id: Ia36c32fdf6862881fe0002d3b565c11da5978a54 * fix note about failed pods Change-Id: If5cf5dc60dfbf1f7afb45beee974e92bffa979b8 * Clarify that some fields will be excluded Change-Id: I34fbfdac3e056f559167795938c3c2cab1f41977 * Add dynamic reclaim for non retriable groups Change-Id: I971e0e87fb09ad9fba8a3805a61d2cc267b29bda
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of #976
Special notes for your reviewer:
Does this PR introduce a user-facing change?