Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP update: Allow replacement pods in groups of pods #1338

Merged
merged 7 commits into from
Nov 29, 2023

Conversation

alculquicondor
Copy link
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #976

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Change-Id: I894785325d44ff3cc3287f9717e96add499a2b48
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. labels Nov 16, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Contributor Author

cc @achernevskii

@k8s-ci-robot k8s-ci-robot requested review from mimowo and trasc November 16, 2023 20:46
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 16, 2023
Copy link

netlify bot commented Nov 16, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 054a843
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/656510676749ff0008bc3cb8

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 16, 2023
Change-Id: I5b07d1ad71c99a23bdb612f1662df7dc8b991bec
@alculquicondor alculquicondor changed the title WIP KEP: Allow replacement pods in groups of pods KEP update: Allow replacement pods in groups of pods Nov 16, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2023
Change-Id: Ie0281d6ebd6a022abc85ae245a22b3282dd23881
@alculquicondor
Copy link
Contributor Author

/hold

I'm rethinking whether all Pods owning the Workload is the best idea.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2023

### Retrying Failed Pods

The Pod group will generally only be considered finished if all the Pods finish
Copy link

@ahaysx ahaysx Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pod groups that do not replace failed pods? We have this use case where a user does not allow retries (and I imagine is a common batch case). So a pod group would be finished if all pods have exited, regardless of success or failure.

I think one way to do this is for every pod in the group to have the kueue.x-k8s.io/last-in-group: true annotation, if "group finished" does not mean the Workload is cleaned up or other running pods in the group are affected.

This seems weird given the name, but would this be the recommendation? Is it worth some kind of alternative annotation configuring this, like pod-group-mode: Batch or pod-group-retry: false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just having one pod with last-in-group has the semantics you want: The pod group will be considered Finished when there are no more Running Pods and there is at least one pod with the annotation.

But I agree that the name is maybe not the best for this use case. I'm thinking of alternatives.

Definitely pod-group-mode: Batch is not accurate, as batch doesn't imply that retries are not possible.

pod-group-retry doesn't fit the semantics I originally wanted that well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with retriable-in-group: false, similar to your proposal.
But I also added another mechanism for termination: just delete the Workload.

Change-Id: Ia36c32fdf6862881fe0002d3b565c11da5978a54
keps/976-plain-pods/README.md Show resolved Hide resolved
Comment on lines 559 to 560
Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are
Failed. This is a simple way of managing finalizers, but it might lead to too many Pods lingering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are
Failed. This is a simple way of managing finalizers, but it might lead to too many Pods lingering
Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are
Failed and replaced. This is a simple way of managing finalizers, but it might lead to too many Pods lingering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I actually wanted to say that we remove finalizers only when the workload is finished.

Change-Id: If5cf5dc60dfbf1f7afb45beee974e92bffa979b8
Change-Id: I34fbfdac3e056f559167795938c3c2cab1f41977
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 21, 2023
@alculquicondor
Copy link
Contributor Author

@tenzen-y @ahaysx any additional feedback?

@alculquicondor
Copy link
Contributor Author

oh, and @nstogner

@tenzen-y
Copy link
Member

/assign

@ahaysx
Copy link

ahaysx commented Nov 23, 2023

lgtm, thanks Aldo

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it is possible to configurable conditions to say that the job failed.
In a group of pods, users may want to mark the job as a failure if the driver pod fails to start.

However, I'm not sure if it would be worth it. Maybe we should suggest users migrate to batch/job or custom job.

keps/976-plain-pods/README.md Show resolved Hide resolved
Comment on lines +436 to +438
Note that fields like `env` and `command` can sometimes change among all the pods of a group and
they don't influence scheduling, so they are safe to skip. `volumes` can influence scheduling, but
they can be parameterized, like in StatefulSets, so we will ignore them for now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we compare the entire containers to verify if the existing workload matches the desired workload:

func comparePodTemplate(a, b *corev1.PodSpec) bool {
if !equality.Semantic.DeepEqual(a.InitContainers, b.InitContainers) {
return false
}
return equality.Semantic.DeepEqual(a.Containers, b.Containers)
}

So, I'm wondering if we should update envs and commands. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I'm not sure about the usefulness.

In Pod groups, it is useful because each Pod might have a slightly different spec. But in Jobs, there is only one template.

But that is a separate discussion nevertheless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds reasonable. I may be able to have discussions about this.
Thanks.

Comment on lines +547 to +548
Workload. If there is an existing Workload in the cache and it has smaller Pod counters than the
in-memory Workload, then it is considered unmatching and the Workload is evicted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a cached existing Workload has larger than pod counters than an in-memory Workload?
Will The reconciler evict the Workload the same as smaller case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only evict if the counters in the Workload are smaller, not larger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. Thanks.

keps/976-plain-pods/README.md Show resolved Hide resolved
keps/976-plain-pods/README.md Show resolved Hide resolved
Change-Id: I971e0e87fb09ad9fba8a3805a61d2cc267b29bda
@alculquicondor
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2023
@alculquicondor
Copy link
Contributor Author

@tenzen-y anything to add before merging?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 29, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d235a6b17f4db2dfeab923b528fcb2a1a63778d8

@k8s-ci-robot k8s-ci-robot merged commit 8431cbd into kubernetes-sigs:main Nov 29, 2023
3 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Nov 29, 2023
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
…#1338)

* KEP: Simpler algorithm for admitting groups of Pods

Change-Id: I894785325d44ff3cc3287f9717e96add499a2b48

* Clarify that workload is automatically cleaned up

Change-Id: I5b07d1ad71c99a23bdb612f1662df7dc8b991bec

* Last attempt annotation and reclaimable quota

Change-Id: Ie0281d6ebd6a022abc85ae245a22b3282dd23881

* Simplify design

Change-Id: Ia36c32fdf6862881fe0002d3b565c11da5978a54

* fix note about failed pods

Change-Id: If5cf5dc60dfbf1f7afb45beee974e92bffa979b8

* Clarify that some fields will be excluded

Change-Id: I34fbfdac3e056f559167795938c3c2cab1f41977

* Add dynamic reclaim for non retriable groups

Change-Id: I971e0e87fb09ad9fba8a3805a61d2cc267b29bda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants