Topology Aware Scheduling KEP #2725

mimowo · 2024-07-30T11:40:24Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

To propose the design for TAS.

Which issue(s) this PR fixes:

KEP for #2724

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

netlify · 2024-07-30T11:40:38Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`2ab5bb3`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66eab5b41a7cf5000881a4f8

mimowo · 2024-07-30T11:40:46Z

/cc @PBundyra @mwielgus @tenzen-y

kannon92 · 2024-07-30T15:24:41Z

keps/2724-topology-aware-schedling/README.md

+Tracking of nodes by Kueue will increase its code complexity and memory
+consumption.
+
+## Alternatives


Could you add some comments about how this would inface with scheduler or kubelet based polciies?

https://scheduler-plugins.sigs.k8s.io/docs/kep/119-node-resource-topology-aware-scheduling/readme/

At the end of the day, users can specify this at the kueue level but if the scheduler or kubelet are not set up, I don't see how you can guarantee this at the Kueue level.

Maybe there is an unnecessary naming coincidence, but I don't think the feature in Kueue is related to the plugin.
We don't require it to run. Also, this feature is not related to NUMA nodes. Let me know if I'm missing something.

The aim of this feature is to provide compact placement of pods on nodes, by using the information about node placement in datacenter.

So this feature is not really about topology on a node (where Topology Manager or Topology-aware-schediling) would take in account. You are proposing intra node topology aware scheduling (node - node placement). I guess that this is the distinction.

So from a Kueue follower hat, I think this is a neat feature.

Wearing from openshift hat, I am concerned that this feature is supporting a particular cloud vendor and not really publishing how one can label these nodes so that this feature could be used.

Are you planning on proposing official labels/annotations in k8s to walk through how other cloud provider vendors (or on-prem admins) can actually set up their cluster so that this feature is valid.

And don't take this to be blocking but I would like to see some work being done to maybe formalize how a cluster should be labeled. Otherwise I don't know how other vendors/on-prem could use this feature.

I attended a wg-serving meeting earlier today where someone was talking about the types of ways data centers are laid out and what is necessary for higher speed interconnects. And I think the point of that meeting was to start to define what kinds of labels/information would be necessary.

I think this is a hard problem..

Networking in data centers is complex for sure. Agree, this may proposal does not reflect every nuance of a data center networking. Especially, since details of networking, the depth of hierarchy will differ between data centers of different cloud providers. The details may even differ between multiple data centers of the same cloud provider. The details, I believe, may also differ between machine types: TPU vs GPU.

This is the first iteration of support for Node-node Topology Aware Scheduling in Kueue. Every journey needs to start with a first step.

Thus, the proposed model is deliberately simplified so that it could work as a common denominator between different cloud providers.

Note that, the label names cloud-provider.com/topology-block and cloud-provider.com/topology-rack used in Hierarchy organization are just examples, and they can be customized in the admin-facing API for a ResourceFlavor.

The only thing that matters for the model is that there is some hierarchy of nodes in a data center, which is almost certainly true. Since almost every data center has some concept of racks and blocks I use such labels to call the intuition, but it is flexible. The mechanics of exposing information as labels are left to cloud providers.

PBundyra · 2024-07-31T08:34:46Z

LGTM

mimowo · 2024-09-05T12:42:38Z

FYI to all reviewers: I have updated a couple of sections of the KEP based on the offline review and discussions with @alculquicondor and @mwielgus. PTAL.

keps/2724-topology-aware-schedling/README.md

alculquicondor · 2024-09-05T18:23:49Z

keps/2724-topology-aware-schedling/README.md

+```
+Then, when the compute the assignments per pods sets corresponding to
+ReplicatedJobs, rather than the entire PodSet. Finally, the `PodSetUngater`
+ungates pods per replicatedJob assignment.


Don't we also need changes to the TopologyAssignment?

How can the user specify that they want each replicated job to be in a particular topology domain?

Don't we also need changes to the TopologyAssignment?

I don't think this is needed as the TopologyAssignment contains a list of PodSetAssignments. If there are 2 ReplicatedJobs, there would be two items on the list. Alternative API is to have ReplicatedJobAssignment which would have the ReplicatedJobKey and the PodSetAssignment. However, we would need to do it this way day 0. So, I proposed the incremental approach.

How can the user specify that they want each replicated job to be in a particular topology domain?

A user can use "require: rack" or "prefer: rack". In case of "require: rack" we will require that each instance of ReplicatedJob is fully contained within a rack, but there might be two jobs landing on the same rack. When using "prefer" a single ReplicatedJob may span 2 or more racks, but we try to compactify.

So, in any other job, if I say require: rack, it means that the entire job will land on one rack.

But in jobset, it means that each replica will land on one rack. Or is the annotation only going to be in the Job templates? That's not clearly explained.

This is related to the other question about the scope of topology. There are actually 3 possible levels where we want to require/prefer a topology domain:
L1. Workload level (comprises multiple PodSets for MPIJob, JobSet and others)
L2. PodSet (comprises multiple ReplicatedJobs in case of JobSet)
L3. ReplicatedJob

I discussed with @mwielgus that all are valid use cases we may want to do at some point. The open questions are:
Q1: what are the APIs to request topology at every such level
Q2: which we prioritize for Alpha and which we can defer for Beta

Regarding Q1 I think we can have:
L1: "preferred-topology / required-topology" annotations at the workload level
L2: "replicaset-preferred-topology / replicaset-required-topology" annotations at the PodTemplate level
L3: "replica-preferred-topology / replica-required-topology" annotations at the PodTemplate level

I'm open to other/better names. Marcin suggests to start with whatever is simplest to deliver. I think looking at how flavorassigner works in Kueue L2 is the simplest, even if not the most functional. WDYT @alculquicondor @tenzen-y ? I will update the KEP accordingly.

I think we should start with the most useful scenario, one that we know is already in use. And that's L3.

I agree that L3 is probably the most useful. However at the same time it is significantly more complex than L2, on which Kueue already works right now.

L3 requires some non-trivial refactoring. It has to be done but for the first step I would suggest doing the most basic functionality to make sure that the thing works, more or less, E2E and THEN focusing on refactoring to enable the most valuable use cases.

Ok, please update user stories and design accordingly.

sgtm, the Support for ReplicatedJobs in JobSet section focuses L3, but it still has some open questions.

The options I see:

figure out the open questions there before Alpha

move the section to alternatives or notes (as a backup for the work already done) and add a graduation point about determining the remaining design points then

I would be leaning for (2.), any preference @alculquicondor ?

2 seems fine. But leave notes for whatever work for jobset that is going to happen in alpha.

Ok, moved the started API work to alternatives. I don't think we need any special work for JobSet in alpha then, except for this: https://github.com/kubernetes-sigs/kueue/pull/2725/files#diff-40027534eab1ffd244e3c960acbf91b55b9556bebd8cd56077ba5f3270648284R163-R167, which is already released, so we may consider dropping the note.

Ok, please update user stories and design accordingly.

Done, PTAL.

mimowo · 2024-09-16T15:33:31Z

/cc @tenzen-y @alculquicondor @KPostOffice @PBundyra

I believe I have addressed all outstanding comments, PTAL.

k8s-ci-robot · 2024-09-16T15:33:36Z

@mimowo: GitHub didn't allow me to request PR reviews from the following users: KPostOffice.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @tenzen-y @alculquicondor @KPostOffice @PBundyra

I believe I have addressed all outstanding comments, PTAL.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tenzen-y

Basically, lgtm!
I left some small questions.

tenzen-y · 2024-09-17T12:26:36Z

keps/2724-topology-aware-schedling/README.md

+  // +listType=atomic
+  // +kubebuilder:validation:MinItems=1
+  // +kubebuilder:validation:MaxItems=5
+  Levels []TopologyLevel `json:"levels,omitempty"`


When we use the "podset-preferred-topology" policy, do these levels are evaluated in order? Or is there any strategy to evaluate these levels?

Yes they get evaluated in order: from the one specified going up to the top. At each step we check if it is possible to satisfy the resource demand within a single topology domain. Say user specifies rack. Then we check rack, if not possible we fallback to block, if not possible we admit using multiple blocks.

That makes sense. It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Additionally, I'm wondering if we need to evolve the internal PodSetInfo mechanism to support TAS.
Maybe we need to have nodeSelector information for every Pod instead of PodSet, right?

kueue/pkg/podset/podset.go

Lines 42 to 49 in ac77dea

type PodSetInfo struct {

Name string

Count int32

Annotations map[string]string

Labels map[string]string

NodeSelector map[string]string

Tolerations []corev1.Toleration

}

I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.

Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.

It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.

I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.

That makes sense. Thank you for the clarifications.

Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.

Yeah, I think so too, I just cared what we will go with PodSetsinfo after we introduce this feature. I think we do not think that we need to mention that here as well.

It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.

Great to see. Thank you!

tenzen-y · 2024-09-17T12:28:06Z

keps/2724-topology-aware-schedling/README.md

+`maxPrefixLength` in `workload_names.go`. However, we can also facilitate
+fast lookups based on labels if we index the workloads.
+
+### Implement it in ClusterAutoscaler or kube-scheduler


This section can resolve my core questions. Thank you!

tenzen-y · 2024-09-17T12:32:29Z

keps/2724-topology-aware-schedling/README.md

+Also, both of the components don't have a concept of quota, which we want to be
+supported with this feature. So, if Kueue admits a Job with "required: rack"
+just based on quota, the Pods might be created, but the components wouldn't
+be able to schedule the Pods if there is not enough capacity in the topology
+domain.


This is just question. This means that the kube-scheduler and cluster-autoscaler do not have a retry mechanism when the Pod can not be scheduled based on PodGroup, right?

They do have a retry mechanism. AFAIK the components don't have a native concept of Pod group. You can simulate Pod group using Volcano in kube-scheduler, but kube-scheduler won't be able to provision the nodes by itself if they don't exist.

That makes sense. Maybe we can delegate TAS scheduling to kube-scheduler after we support the native PodGroup feature: kubernetes/enhancements#4671

But, there are not progressing in the k/k now.

Even then I think there are other blockers like lack of API to support topology. I think it makes sense to incubate such an API in CRDs before moving to the core.

Yes, I agree with you.
As an alternative, we can provide the kube-scheduler-plugin for TAS in Kueue side instead of watching Node resources in the kueue, but I guess the kueue custom kube-scheduler-plugin may bring batch admins obstacles to install Kueue.

So, I add +1 to this node watch mechanism in Kueue side for now.

But, if we add more and more features depending on the node information, I'm wondering if we should revisit here and discuss if we should provide the custom kube-scheduler plugin for Kueue.

tenzen-y · 2024-09-17T12:55:32Z

keps/2724-topology-aware-schedling/README.md

+  // +kubebuilder:validation:MaxLength=316
+  // +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$`
+  NodeLabel string `json:"nodeLabel,omitempty"`
+}


Will we want to support tolerations here based on user feedback?
I agree with supporting only nodeLabel during the alpha stage.

I think having different set of tolerations depending on the hierarchy level would be a complication. I don't see it necessary, but yes, if there are use-case we will have a place to introduce it.

That makes sense. Thanks for the clarifications.

Update keps/2724-topology-aware-schedling/README.md Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]>

tenzen-y

Thank you for driving this significantly valuable feature!
I'm looking forward to release this feature!

/lgtm
/approve

tenzen-y · 2024-09-18T11:54:11Z

keps/2724-topology-aware-schedling/README.md

+  // +listType=atomic
+  // +kubebuilder:validation:MinItems=1
+  // +kubebuilder:validation:MaxItems=5
+  Levels []TopologyLevel `json:"levels,omitempty"`


I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.

That makes sense. Thank you for the clarifications.

Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.

Yeah, I think so too, I just cared what we will go with PodSetsinfo after we introduce this feature. I think we do not think that we need to mention that here as well.

tenzen-y · 2024-09-18T11:55:23Z

keps/2724-topology-aware-schedling/README.md

+  // +listType=atomic
+  // +kubebuilder:validation:MinItems=1
+  // +kubebuilder:validation:MaxItems=5
+  Levels []TopologyLevel `json:"levels,omitempty"`


It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.

Great to see. Thank you!

k8s-ci-robot · 2024-09-18T11:56:32Z

LGTM label has been added.

Git tree hash: 7bf5553fb927b2cf1562c60a784833cc7bb27639

k8s-ci-robot · 2024-09-18T11:56:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Topology Aware Scheduling KEP * Review remarks Update keps/2724-topology-aware-schedling/README.md Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]> --------- Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]>

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. labels Jul 30, 2024

k8s-ci-robot requested review from denkensk and trasc July 30, 2024 11:40

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 30, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2024

k8s-ci-robot requested review from mwielgus, PBundyra and tenzen-y July 30, 2024 11:40

mimowo force-pushed the topology-aware-scheduling branch 2 times, most recently from 1ab3cb9 to 5d7847b Compare July 30, 2024 13:50

kannon92 reviewed Jul 30, 2024

View reviewed changes

mimowo force-pushed the topology-aware-scheduling branch 3 times, most recently from 6dc1f7b to 00022a6 Compare September 5, 2024 11:22

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 5, 2024

mimowo force-pushed the topology-aware-scheduling branch 4 times, most recently from fb7a6d5 to 42048d9 Compare September 5, 2024 12:41

alculquicondor reviewed Sep 5, 2024

View reviewed changes

mimowo force-pushed the topology-aware-scheduling branch from 46bf5b7 to 0623107 Compare September 6, 2024 11:58

mimowo mentioned this pull request Sep 6, 2024

KEP: Introduce a new timeout to WaitForPodsReady config #2737

Open

mimowo force-pushed the topology-aware-scheduling branch from 0623107 to c65cc2b Compare September 6, 2024 12:03

mimowo force-pushed the topology-aware-scheduling branch 11 times, most recently from 82e6c19 to fd5ac7c Compare September 16, 2024 15:28

k8s-ci-robot requested review from tenzen-y and alculquicondor September 16, 2024 15:33

tenzen-y reviewed Sep 17, 2024

View reviewed changes

Topology Aware Scheduling KEP

f1eb3f0

mimowo force-pushed the topology-aware-scheduling branch from fd5ac7c to c487be6 Compare September 18, 2024 10:04

Review remarks

2ab5bb3

Update keps/2724-topology-aware-schedling/README.md Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]>

mimowo force-pushed the topology-aware-scheduling branch from c487be6 to 2ab5bb3 Compare September 18, 2024 11:12

tenzen-y reviewed Sep 18, 2024

View reviewed changes

k8s-ci-robot assigned tenzen-y Sep 18, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2024

k8s-ci-robot merged commit 4231a8f into kubernetes-sigs:main Sep 18, 2024
7 checks passed

k8s-ci-robot added this to the v0.9 milestone Sep 18, 2024

VassilisVassiliadis mentioned this pull request Nov 4, 2024

Admission Check controller to verify node capacity before admitting workloads #3211

Open

3 tasks

	type PodSetInfo struct {
	Name string
	Count int32
	Annotations map[string]string
	Labels map[string]string
	NodeSelector map[string]string
	Tolerations []corev1.Toleration
	}

Topology Aware Scheduling KEP #2725

Topology Aware Scheduling KEP #2725

Conversation

mimowo commented Jul 30, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

netlify bot commented Jul 30, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

mimowo commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PBundyra commented Jul 31, 2024

mimowo commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

mimowo commented Sep 16, 2024

k8s-ci-robot commented Sep 16, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 18, 2024

k8s-ci-robot commented Sep 18, 2024

netlify bot commented Jul 30, 2024 •

edited

Loading

kannon92 Jul 31, 2024 •

edited

Loading

mimowo Sep 12, 2024 •

edited

Loading

mimowo Sep 16, 2024 •

edited

Loading

tenzen-y Sep 17, 2024 •

edited

Loading

tenzen-y Sep 17, 2024 •

edited

Loading