-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAS: validation for CQs using TAS #3320
TAS: validation for CQs using TAS #3320
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/retest |
3de949f
to
68ceed2
Compare
8eb92e2
to
af4c6d8
Compare
if !features.Enabled(features.TopologyAwareScheduling) || len(c.tasFlavors) == 0 { | ||
return false | ||
} | ||
return c.HasParent() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we verify this during the validation?
I'm wondering if we can follow the similar approach with the MK managedBy validations:
clusterQueueName, ok := w.queues.ClusterQueueFromLocalQueue(queue.QueueKey(job.ObjectMeta.Namespace, localQueueName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess we could but there are downsides:
- this mechanism would be per Job, so plenty of code which is going to be removed when we start to support the features
- it would be inconsistent with other CQ validation scenarios when we take the lazy validation approach
- it requires the webhooks to access the cache (yes, we did it for managedBy, but there it was last resort option because the managedBy field is immutable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could support the validations in the CQ, RF, and Topology webhook.
Could you expand the reason why the validations should be done in the Kueue supported Jobs webhooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I was mistaken (was thinking about the annotations somehow we discussed in the morning). Indeed, it would be validation in the cq_webhook only for the scenarios in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, we already have mechanics to validate CQ and deactivate, so the approach implemented here is more consistent with what we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @mimowo offline.
Technically, we can validate these points in the CQ, RF, and Topology validating webhooks.
But, the webhook validation solutions have some issues like performance and complexity due to cross Object validations. Those fields are especially not immutable, and we can assume the validations will be performed many times. That could bring us cluster performance issues.
So, we decided to select this non-webhook solutions.
pkg/cache/clusterqueue.go
Outdated
@@ -380,6 +420,7 @@ func (c *clusterQueue) updateWithAdmissionChecks(checks map[string]AdmissionChec | |||
slices.Sort(flavorIndependentCheckOnFlavors) | |||
slices.Sort(perFlavorMultiKueueChecks) | |||
multiKueueAdmissionChecksList := sets.List(multiKueueAdmissionChecks) | |||
provisioningChecksList := sets.List(provisioningAdmissionChecks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provisioningChecksList := sets.List(provisioningAdmissionChecks) | |
provisioningChecks := sets.List(provisioningAdmissionChecks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, also renamed multiKueueAdmissionChecksList
-> multiKueueChecks
return c.HasParent() || | ||
c.Preemption.WithinClusterQueue != kueue.PreemptionPolicyNever || | ||
len(c.multiKueueAdmissionChecks) > 0 || | ||
len(c.provisioningAdmissionChecks) > 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we support these combinations between TAS and other features in the future (Beta or GA)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in graduation criteria for GA, but if we get requests from users I think we can accelerate support for some of the features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed the GA criteria. SGTM.
len(c.multiKueueAdmissionChecks) > 0 || | ||
len(c.provisioningAdmissionChecks) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we support TAS collaboration with other third-party AdmissionCheck controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means we don't support validation OOTB, if you use an external admission check with TAS you need to validate yourself. We could maybe make the mechanism under #3106 flexible enough to cover TAS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe make the mechanism under #3106 flexible enough to cover TAS.
Let's investigate the possibility of this in the future.
But, I guess that we can reject all CQs with any admissionCheck names (.spec.admissionChecks
or .spec.admissionChecksStrategy
), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did that, then it would be stronger condition - meaning that TAS cannot be used with some external ACs, even if there is no issue. For example, the AC checks team's budget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the AC checks team's budget.
That makes sense. In that case, we should reject the CQs with AC only related to the scheduling like ProvisioningRequest and MultiKueue. Thanks.
@@ -343,6 +379,7 @@ func (c *clusterQueue) updateWithAdmissionChecks(checks map[string]AdmissionChec | |||
checksPerController := make(map[string][]string, len(c.AdmissionChecks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extend the unit testing?
kueue/pkg/cache/clusterqueue_test.go
Line 400 in 5203ac9
func TestClusterQueueUpdate(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer in a follow up.
I actually tried to extend the unit tests but they seem very unintuitive, because they cover functions without any-user facing impact, so it is hard to interpret them, like: updateClusterQueue, UpdateWithFlavors. So, I thought the integration tests are more reliable here.
I would prefer unit tests here. My idea was to assert on inactiveReason
function, and use updateClusterQueue
for setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with a follow-up.
As I mentioned here (https://github.com/kubernetes-sigs/kueue/pull/3320/files#r1819498096), let's update the spreadsheet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already covered by the "Add unit test for CQ validation for unsupported configurations (the inactiveReason)" point. In this idea ClusterQueueUpdate will be used for setup, and inactiveReason will be actually asserted.
@@ -315,6 +347,7 @@ func (c *clusterQueue) UpdateWithFlavors(flavors map[kueue.ResourceFlavorReferen | |||
|
|||
func (c *clusterQueue) updateLabelKeys(flavors map[kueue.ResourceFlavorReference]*kueue.ResourceFlavor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extend the unit testing?
kueue/pkg/cache/clusterqueue_test.go
Line 38 in 5203ac9
func TestClusterQueueUpdateWithFlavors(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the unit tests here aren't very useful as suggested in another comment. I would prefer to cover the inactiveReason
function as in the other comment.
if features.Enabled(features.TopologyAwareScheduling) && len(c.tasFlavors) > 0 { | ||
if c.HasParent() { | ||
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling) | ||
messages = append(messages, "TAS is not supported for cohorts") | ||
} | ||
if c.Preemption.WithinClusterQueue != kueue.PreemptionPolicyNever { | ||
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling) | ||
messages = append(messages, "TAS is not supported for preemption within cluster queue") | ||
} | ||
if len(c.multiKueueAdmissionChecks) > 0 { | ||
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling) | ||
messages = append(messages, "TAS is not supported with MultiKueue admission check") | ||
} | ||
if len(c.provisioningAdmissionChecks) > 0 { | ||
reasons = append(reasons, kueue.ClusterQueueActiveReasonNotSupportedWithTopologyAwareScheduling) | ||
messages = append(messages, "TAS is not supported with ProvisioningRequest admission check") | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand this unit testing?
kueue/pkg/cache/clusterqueue_test.go
Line 487 in 5203ac9
func TestClusterQueueUpdateWithAdmissionCheck(t *testing.T) { |
Or, we may want to prepare the dedicated TestClusterQueueUpdateWithTAS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selected code actually comes from the inactiveReason
function. My idea is to write a unit test for this function, using updateClusterQueue
as setup. However, I would prefer this as follow up to focus on completing the functional PRs, the integration tests we have cover all the cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I would prefer this as follow up to focus on completing the functional PRs, the integration tests we have cover all the cases here.
That makes sense. Let's update the spread sheet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as a new "Add unit test for CQ validation for unsupported configurations (the inactiveReason)" entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
af4c6d8
to
17f5909
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/lgtm
/approve
LGTM label has been added. Git tree hash: ed890b171c815ca60668df142b2ecba418287ef9
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* TAS: soft validation for ClusterQueues * Review remarks
* TAS: soft validation for ClusterQueues * Review remarks
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of #2724
Special notes for your reviewer:
Does this PR introduce a user-facing change?