Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS: validation for CQs using TAS #3320

Merged
merged 2 commits into from
Oct 29, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 25, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #2724

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 25, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 25, 2024
Copy link

netlify bot commented Oct 25, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 17f5909
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/671fc5a73b68f70008e7e0ca

@mimowo
Copy link
Contributor Author

mimowo commented Oct 25, 2024

/retest

@mimowo mimowo force-pushed the tas-validate-clusterqueue branch 2 times, most recently from 3de949f to 68ceed2 Compare October 28, 2024 07:46
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 28, 2024
@mimowo mimowo force-pushed the tas-validate-clusterqueue branch 2 times, most recently from 8eb92e2 to af4c6d8 Compare October 28, 2024 09:48
@mimowo mimowo changed the title WIP TAS: validation for CQs using TAS TAS: validation for CQs using TAS Oct 28, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 28, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Oct 28, 2024

/cc @PBundyra @gabesaba @tenzen-y

Comment on lines +332 to +335
if !features.Enabled(features.TopologyAwareScheduling) || len(c.tasFlavors) == 0 {
return false
}
return c.HasParent() ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify this during the validation?
I'm wondering if we can follow the similar approach with the MK managedBy validations:

clusterQueueName, ok := w.queues.ClusterQueueFromLocalQueue(queue.QueueKey(job.ObjectMeta.Namespace, localQueueName))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess we could but there are downsides:

  1. this mechanism would be per Job, so plenty of code which is going to be removed when we start to support the features
  2. it would be inconsistent with other CQ validation scenarios when we take the lazy validation approach
  3. it requires the webhooks to access the cache (yes, we did it for managedBy, but there it was last resort option because the managedBy field is immutable)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could support the validations in the CQ, RF, and Topology webhook.
Could you expand the reason why the validations should be done in the Kueue supported Jobs webhooks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I was mistaken (was thinking about the annotations somehow we discussed in the morning). Indeed, it would be validation in the cq_webhook only for the scenarios in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, we already have mechanics to validate CQ and deactivate, so the approach implemented here is more consistent with what we have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @mimowo offline.
Technically, we can validate these points in the CQ, RF, and Topology validating webhooks.

But, the webhook validation solutions have some issues like performance and complexity due to cross Object validations. Those fields are especially not immutable, and we can assume the validations will be performed many times. That could bring us cluster performance issues.

So, we decided to select this non-webhook solutions.

@@ -380,6 +420,7 @@ func (c *clusterQueue) updateWithAdmissionChecks(checks map[string]AdmissionChec
slices.Sort(flavorIndependentCheckOnFlavors)
slices.Sort(perFlavorMultiKueueChecks)
multiKueueAdmissionChecksList := sets.List(multiKueueAdmissionChecks)
provisioningChecksList := sets.List(provisioningAdmissionChecks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
provisioningChecksList := sets.List(provisioningAdmissionChecks)
provisioningChecks := sets.List(provisioningAdmissionChecks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also renamed multiKueueAdmissionChecksList -> multiKueueChecks

Comment on lines +335 to +339
return c.HasParent() ||
c.Preemption.WithinClusterQueue != kueue.PreemptionPolicyNever ||
len(c.multiKueueAdmissionChecks) > 0 ||
len(c.provisioningAdmissionChecks) > 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we support these combinations between TAS and other features in the future (Beta or GA)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in graduation criteria for GA, but if we get requests from users I think we can accelerate support for some of the features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed the GA criteria. SGTM.

Comment on lines +337 to +338
len(c.multiKueueAdmissionChecks) > 0 ||
len(c.provisioningAdmissionChecks) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we support TAS collaboration with other third-party AdmissionCheck controllers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means we don't support validation OOTB, if you use an external admission check with TAS you need to validate yourself. We could maybe make the mechanism under #3106 flexible enough to cover TAS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe make the mechanism under #3106 flexible enough to cover TAS.

Let's investigate the possibility of this in the future.
But, I guess that we can reject all CQs with any admissionCheck names (.spec.admissionChecks or .spec.admissionChecksStrategy), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that, then it would be stronger condition - meaning that TAS cannot be used with some external ACs, even if there is no issue. For example, the AC checks team's budget.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the AC checks team's budget.

That makes sense. In that case, we should reject the CQs with AC only related to the scheduling like ProvisioningRequest and MultiKueue. Thanks.

@@ -343,6 +379,7 @@ func (c *clusterQueue) updateWithAdmissionChecks(checks map[string]AdmissionChec
checksPerController := make(map[string][]string, len(c.AdmissionChecks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extend the unit testing?

func TestClusterQueueUpdate(t *testing.T) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer in a follow up.

I actually tried to extend the unit tests but they seem very unintuitive, because they cover functions without any-user facing impact, so it is hard to interpret them, like: updateClusterQueue, UpdateWithFlavors. So, I thought the integration tests are more reliable here.

I would prefer unit tests here. My idea was to assert on inactiveReason function, and use updateClusterQueue for setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with a follow-up.
As I mentioned here (https://github.com/kubernetes-sigs/kueue/pull/3320/files#r1819498096), let's update the spreadsheet.

Copy link
Contributor Author

@mimowo mimowo Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already covered by the "Add unit test for CQ validation for unsupported configurations (the inactiveReason)" point. In this idea ClusterQueueUpdate will be used for setup, and inactiveReason will be actually asserted.

@@ -315,6 +347,7 @@ func (c *clusterQueue) UpdateWithFlavors(flavors map[kueue.ResourceFlavorReferen

func (c *clusterQueue) updateLabelKeys(flavors map[kueue.ResourceFlavorReference]*kueue.ResourceFlavor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you extend the unit testing?

func TestClusterQueueUpdateWithFlavors(t *testing.T) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the unit tests here aren't very useful as suggested in another comment. I would prefer to cover the inactiveReason function as in the other comment.

Comment on lines +303 to +321
if features.Enabled(features.TopologyAwareScheduling) && len(c.tasFlavors) > 0 {
if c.HasParent() {
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling)
messages = append(messages, "TAS is not supported for cohorts")
}
if c.Preemption.WithinClusterQueue != kueue.PreemptionPolicyNever {
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling)
messages = append(messages, "TAS is not supported for preemption within cluster queue")
}
if len(c.multiKueueAdmissionChecks) > 0 {
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling)
messages = append(messages, "TAS is not supported with MultiKueue admission check")
}
if len(c.provisioningAdmissionChecks) > 0 {
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling)
messages = append(messages, "TAS is not supported with ProvisioningRequest admission check")
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this unit testing?

func TestClusterQueueUpdateWithAdmissionCheck(t *testing.T) {

Or, we may want to prepare the dedicated TestClusterQueueUpdateWithTAS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selected code actually comes from the inactiveReason function. My idea is to write a unit test for this function, using updateClusterQueue as setup. However, I would prefer this as follow up to focus on completing the functional PRs, the integration tests we have cover all the cases here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I would prefer this as follow up to focus on completing the functional PRs, the integration tests we have cover all the cases here.

That makes sense. Let's update the spread sheet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a new "Add unit test for CQ validation for unsupported configurations (the inactiveReason)" entry

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@mimowo mimowo force-pushed the tas-validate-clusterqueue branch from af4c6d8 to 17f5909 Compare October 28, 2024 17:11
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ed890b171c815ca60668df142b2ecba418287ef9

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 038a563 into kubernetes-sigs:main Oct 29, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Oct 29, 2024
PBundyra pushed a commit to PBundyra/kueue that referenced this pull request Nov 5, 2024
* TAS: soft validation for ClusterQueues

* Review remarks
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
* TAS: soft validation for ClusterQueues

* Review remarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants