Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: TopologyUngater #3266

Merged
merged 2 commits into from
Oct 22, 2024
Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 18, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #2724

Special notes for your reviewer:

Done since the initial version:

  • use the TAS dedicated label to filiter out non-TAS pods (DONE)
  • expectations mechanism to prevent race conditions (DONE)

I have also tested the handlers already in the e2e prototype PR: #3218

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. labels Oct 18, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 18, 2024
Copy link

netlify bot commented Oct 18, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit bff9583
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67167713329ee20008829e00

@mimowo
Copy link
Contributor Author

mimowo commented Oct 18, 2024

@mimowo mimowo force-pushed the tas-topology-ungater branch from 621c469 to 1671521 Compare October 18, 2024 17:01
@mimowo mimowo changed the title TAS: TopologyUngater WIP: TAS: TopologyUngater Oct 18, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 18, 2024
pkg/controller/tas/indexer.go Outdated Show resolved Hide resolved
}

func (h *podHandler) queueReconcileForPod(pod *corev1.Pod, q workqueue.TypedRateLimitingInterface[reconcile.Request]) {
if pod == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if pod == nil {
if pod == nil || !utilpod.HasGate(pod, kueuealpha.TopologySchedulingGate) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored this a bit:

  • passing object instead of pod to the function to commonize between event handlers
  • I don't need to check the scheduling gate (it was just a workaround to check "is it a TAS pod")
  • proposed a new Label to use here: TAS: introduce dedicated TAS label #3271

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #3271 (comment), we will go with the TAS label solution during the Alpha phase.

pkg/controller/tas/topology_ungater.go Outdated Show resolved Hide resolved
pkg/controller/tas/topology_ungater.go Outdated Show resolved Hide resolved
pkg/controller/tas/topology_ungater.go Outdated Show resolved Hide resolved
Comment on lines 143 to 162
if wl.Status.Admission == nil {
log.Info("workload is not admitted")
return reconcile.Result{}, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if wl.Status.Admission == nil {
log.Info("workload is not admitted")
return reconcile.Result{}, nil
}

Duplicated check with the predicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep the duplicate checks to avoid panicing on race conditions:

  • the predicates schedule reconcile by workload key , but then the workload is evicted
  • the workload is recreated with the same key, but admitted without TAS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented above, after refactoring, these verifications seem to be needed.

pkg/controller/tas/topology_ungater.go Outdated Show resolved Hide resolved
pkg/controller/tas/topology_ungater.go Outdated Show resolved Hide resolved
pkg/util/tas/tas.go Show resolved Hide resolved
pkg/util/testingjobs/pod/wrappers.go Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the tas-topology-ungater branch from 1671521 to a7569a3 Compare October 21, 2024 06:48
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 21, 2024
@mimowo mimowo changed the title WIP: TAS: TopologyUngater TAS: TopologyUngater Oct 21, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2024
@mimowo mimowo force-pushed the tas-topology-ungater branch 2 times, most recently from eafdc7f to 3d54a9a Compare October 21, 2024 12:38
# Conflicts:
#	go.mod
@mimowo mimowo force-pushed the tas-topology-ungater branch from 3d54a9a to c00c9ed Compare October 21, 2024 15:41
@mimowo mimowo force-pushed the tas-topology-ungater branch from c00c9ed to bff9583 Compare October 21, 2024 15:45
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole mechanism looks excellent. But I left trivial comments.
@mimowo Could you open a follow-up PR?
/lgtm
/approve

pkg/controller/tas/topology_ungater.go Show resolved Hide resolved

allToUngate := make([]podWithUngateInfo, 0)
for _, psa := range wl.Status.Admission.PodSetAssignments {
if psa.TopologyAssignment != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. After refactoring, podHandler has never been checked if the workload has topologyAssignment.

}
}
var err error
if len(allToUngate) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can consolidate expectations and podWithUngateInfo into the TAS dedicated cache.
And then, we can decouple TASController including reconciler, eventHandler, and Predicator, and TASUngator including this ungating processes.
Becuase ideally, in general, we should not spawn the separate go routine in the reconciler. The separate Go routine could delay the next Reconcile and, the object information in reconcile queue could older.

Comment on lines 143 to 162
if wl.Status.Admission == nil {
log.Info("workload is not admitted")
return reconcile.Result{}, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented above, after refactoring, these verifications seem to be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing unit testing for the exposed function should be better.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 22, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2fe42177c76b465861ca7a645f7b794f28f8d6c5

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1ad442b into kubernetes-sigs:main Oct 22, 2024
15 of 16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Oct 22, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Oct 22, 2024

@mimowo Could you open a follow-up PR?

Thanks, sure.

@mimowo
Copy link
Contributor Author

mimowo commented Oct 22, 2024

Though thinking about it I think the refactoring to split TopologyUngater will not be that trivial imo.

Please also note that we already were spawning new goroutines in the pod group integration, and goroutines are also spawned by the k8s code job controller when creating new pods.

I agree that ideally we can avoid it, but I would prefer to prioritize the functional follow ups first and e2e tests.

@tenzen-y
Copy link
Member

Please also note that we already were spawning new goroutines in the pod group integration, and goroutines are also spawned by the k8s code job controller when creating new pods.

I agree that ideally we can avoid it, but I would prefer to prioritize the functional follow ups first and e2e tests.

I agree. I do not claim decoupling should be done before 0.9.

PBundyra pushed a commit to PBundyra/kueue that referenced this pull request Nov 5, 2024
* TAS: TopologyUngater

# Conflicts:
#	go.mod

* review comments
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
* TAS: TopologyUngater

# Conflicts:
#	go.mod

* review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants