Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology Aware Scheduling KEP #2725

Merged
merged 2 commits into from
Sep 18, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jul 30, 2024

What type of PR is this?

/kind documentation

What this PR does / why we need it:

To propose the design for TAS.

Which issue(s) this PR fixes:

KEP for #2724

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/documentation Categorizes issue or PR as related to documentation. labels Jul 30, 2024
@k8s-ci-robot k8s-ci-robot requested review from denkensk and trasc July 30, 2024 11:40
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 30, 2024
Copy link

netlify bot commented Jul 30, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 2ab5bb3
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66eab5b41a7cf5000881a4f8

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 30, 2024

/cc @PBundyra @mwielgus @tenzen-y

@mimowo mimowo force-pushed the topology-aware-scheduling branch 2 times, most recently from 1ab3cb9 to 5d7847b Compare July 30, 2024 13:50
Tracking of nodes by Kueue will increase its code complexity and memory
consumption.

## Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments about how this would inface with scheduler or kubelet based polciies?

https://scheduler-plugins.sigs.k8s.io/docs/kep/119-node-resource-topology-aware-scheduling/readme/

At the end of the day, users can specify this at the kueue level but if the scheduler or kubelet are not set up, I don't see how you can guarantee this at the Kueue level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is an unnecessary naming coincidence, but I don't think the feature in Kueue is related to the plugin.
We don't require it to run. Also, this feature is not related to NUMA nodes. Let me know if I'm missing something.

The aim of this feature is to provide compact placement of pods on nodes, by using the information about node placement in datacenter.

Copy link
Contributor

@kannon92 kannon92 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this feature is not really about topology on a node (where Topology Manager or Topology-aware-schediling) would take in account. You are proposing intra node topology aware scheduling (node - node placement). I guess that this is the distinction.

So from a Kueue follower hat, I think this is a neat feature.

Wearing from openshift hat, I am concerned that this feature is supporting a particular cloud vendor and not really publishing how one can label these nodes so that this feature could be used.

Are you planning on proposing official labels/annotations in k8s to walk through how other cloud provider vendors (or on-prem admins) can actually set up their cluster so that this feature is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And don't take this to be blocking but I would like to see some work being done to maybe formalize how a cluster should be labeled. Otherwise I don't know how other vendors/on-prem could use this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attended a wg-serving meeting earlier today where someone was talking about the types of ways data centers are laid out and what is necessary for higher speed interconnects. And I think the point of that meeting was to start to define what kinds of labels/information would be necessary.

I think this is a hard problem..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Networking in data centers is complex for sure. Agree, this may proposal does not reflect every nuance of a data center networking. Especially, since details of networking, the depth of hierarchy will differ between data centers of different cloud providers. The details may even differ between multiple data centers of the same cloud provider. The details, I believe, may also differ between machine types: TPU vs GPU.

This is the first iteration of support for Node-node Topology Aware Scheduling in Kueue. Every journey needs to start with a first step.

Thus, the proposed model is deliberately simplified so that it could work as a common denominator between different cloud providers.

Note that, the label names cloud-provider.com/topology-block and cloud-provider.com/topology-rack used in Hierarchy organization are just examples, and they can be customized in the admin-facing API for a ResourceFlavor.

The only thing that matters for the model is that there is some hierarchy of nodes in a data center, which is almost certainly true. Since almost every data center has some concept of racks and blocks I use such labels to call the intuition, but it is flexible. The mechanics of exposing information as labels are left to cloud providers.

@PBundyra
Copy link
Contributor

LGTM

@mimowo mimowo force-pushed the topology-aware-scheduling branch 3 times, most recently from 6dc1f7b to 00022a6 Compare September 5, 2024 11:22
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 5, 2024
@mimowo mimowo force-pushed the topology-aware-scheduling branch 4 times, most recently from fb7a6d5 to 42048d9 Compare September 5, 2024 12:41
@mimowo
Copy link
Contributor Author

mimowo commented Sep 5, 2024

FYI to all reviewers: I have updated a couple of sections of the KEP based on the offline review and discussions with @alculquicondor and @mwielgus. PTAL.

keps/2724-topology-aware-schedling/README.md Outdated Show resolved Hide resolved
keps/2724-topology-aware-schedling/README.md Outdated Show resolved Hide resolved
keps/2724-topology-aware-schedling/README.md Outdated Show resolved Hide resolved
keps/2724-topology-aware-schedling/README.md Outdated Show resolved Hide resolved
keps/2724-topology-aware-schedling/README.md Outdated Show resolved Hide resolved
```
Then, when the compute the assignments per pods sets corresponding to
ReplicatedJobs, rather than the entire PodSet. Finally, the `PodSetUngater`
ungates pods per replicatedJob assignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need changes to the TopologyAssignment?

How can the user specify that they want each replicated job to be in a particular topology domain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need changes to the TopologyAssignment?

I don't think this is needed as the TopologyAssignment contains a list of PodSetAssignments. If there are 2 ReplicatedJobs, there would be two items on the list. Alternative API is to have ReplicatedJobAssignment which would have the ReplicatedJobKey and the PodSetAssignment. However, we would need to do it this way day 0. So, I proposed the incremental approach.

How can the user specify that they want each replicated job to be in a particular topology domain?

A user can use "require: rack" or "prefer: rack". In case of "require: rack" we will require that each instance of ReplicatedJob is fully contained within a rack, but there might be two jobs landing on the same rack. When using "prefer" a single ReplicatedJob may span 2 or more racks, but we try to compactify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in any other job, if I say require: rack, it means that the entire job will land on one rack.

But in jobset, it means that each replica will land on one rack. Or is the annotation only going to be in the Job templates? That's not clearly explained.

Copy link
Contributor Author

@mimowo mimowo Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the other question about the scope of topology. There are actually 3 possible levels where we want to require/prefer a topology domain:
L1. Workload level (comprises multiple PodSets for MPIJob, JobSet and others)
L2. PodSet (comprises multiple ReplicatedJobs in case of JobSet)
L3. ReplicatedJob

I discussed with @mwielgus that all are valid use cases we may want to do at some point. The open questions are:
Q1: what are the APIs to request topology at every such level
Q2: which we prioritize for Alpha and which we can defer for Beta

Regarding Q1 I think we can have:
L1: "preferred-topology / required-topology" annotations at the workload level
L2: "replicaset-preferred-topology / replicaset-required-topology" annotations at the PodTemplate level
L3: "replica-preferred-topology / replica-required-topology" annotations at the PodTemplate level

I'm open to other/better names. Marcin suggests to start with whatever is simplest to deliver. I think looking at how flavorassigner works in Kueue L2 is the simplest, even if not the most functional. WDYT @alculquicondor @tenzen-y ? I will update the KEP accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start with the most useful scenario, one that we know is already in use. And that's L3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that L3 is probably the most useful. However at the same time it is significantly more complex than L2, on which Kueue already works right now.

L3 requires some non-trivial refactoring. It has to be done but for the first step I would suggest doing the most basic functionality to make sure that the thing works, more or less, E2E and THEN focusing on refactoring to enable the most valuable use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, please update user stories and design accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, the Support for ReplicatedJobs in JobSet section focuses L3, but it still has some open questions.

The options I see:

  1. figure out the open questions there before Alpha
  2. move the section to alternatives or notes (as a backup for the work already done) and add a graduation point about determining the remaining design points then

I would be leaning for (2.), any preference @alculquicondor ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seems fine. But leave notes for whatever work for jobset that is going to happen in alpha.

Copy link
Contributor Author

@mimowo mimowo Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, moved the started API work to alternatives. I don't think we need any special work for JobSet in alpha then, except for this: https://github.com/kubernetes-sigs/kueue/pull/2725/files#diff-40027534eab1ffd244e3c960acbf91b55b9556bebd8cd56077ba5f3270648284R163-R167, which is already released, so we may consider dropping the note.

Ok, please update user stories and design accordingly.

Done, PTAL.

@mimowo mimowo force-pushed the topology-aware-scheduling branch 11 times, most recently from 82e6c19 to fd5ac7c Compare September 16, 2024 15:28
@mimowo
Copy link
Contributor Author

mimowo commented Sep 16, 2024

/cc @tenzen-y @alculquicondor @KPostOffice @PBundyra

I believe I have addressed all outstanding comments, PTAL.

@k8s-ci-robot
Copy link
Contributor

@mimowo: GitHub didn't allow me to request PR reviews from the following users: KPostOffice.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @tenzen-y @alculquicondor @KPostOffice @PBundyra

I believe I have addressed all outstanding comments, PTAL.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, lgtm!
I left some small questions.

// +listType=atomic
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=5
Levels []TopologyLevel `json:"levels,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use the "podset-preferred-topology" policy, do these levels are evaluated in order? Or is there any strategy to evaluate these levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they get evaluated in order: from the one specified going up to the top. At each step we check if it is possible to satisfy the resource demand within a single topology domain. Say user specifies rack. Then we check rack, if not possible we fallback to block, if not possible we admit using multiple blocks.

Copy link
Member

@tenzen-y tenzen-y Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Additionally, I'm wondering if we need to evolve the internal PodSetInfo mechanism to support TAS.
Maybe we need to have nodeSelector information for every Pod instead of PodSet, right?

type PodSetInfo struct {
Name string
Count int32
Annotations map[string]string
Labels map[string]string
NodeSelector map[string]string
Tolerations []corev1.Toleration
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.

Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.

That makes sense. Thank you for the clarifications.

Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.

Yeah, I think so too, I just cared what we will go with PodSetsinfo after we introduce this feature. I think we do not think that we need to mention that here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.

Great to see. Thank you!

`maxPrefixLength` in `workload_names.go`. However, we can also facilitate
fast lookups based on labels if we index the workloads.

### Implement it in ClusterAutoscaler or kube-scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can resolve my core questions. Thank you!

Comment on lines +697 to +707
Also, both of the components don't have a concept of quota, which we want to be
supported with this feature. So, if Kueue admits a Job with "required: rack"
just based on quota, the Pods might be created, but the components wouldn't
be able to schedule the Pods if there is not enough capacity in the topology
domain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just question. This means that the kube-scheduler and cluster-autoscaler do not have a retry mechanism when the Pod can not be scheduled based on PodGroup, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do have a retry mechanism. AFAIK the components don't have a native concept of Pod group. You can simulate Pod group using Volcano in kube-scheduler, but kube-scheduler won't be able to provision the nodes by itself if they don't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Maybe we can delegate TAS scheduling to kube-scheduler after we support the native PodGroup feature: kubernetes/enhancements#4671

But, there are not progressing in the k/k now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even then I think there are other blockers like lack of API to support topology. I think it makes sense to incubate such an API in CRDs before moving to the core.

Copy link
Member

@tenzen-y tenzen-y Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with you.
As an alternative, we can provide the kube-scheduler-plugin for TAS in Kueue side instead of watching Node resources in the kueue, but I guess the kueue custom kube-scheduler-plugin may bring batch admins obstacles to install Kueue.

So, I add +1 to this node watch mechanism in Kueue side for now.

But, if we add more and more features depending on the node information, I'm wondering if we should revisit here and discuss if we should provide the custom kube-scheduler plugin for Kueue.

// +kubebuilder:validation:MaxLength=316
// +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$`
NodeLabel string `json:"nodeLabel,omitempty"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we want to support tolerations here based on user feedback?
I agree with supporting only nodeLabel during the alpha stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having different set of tolerations depending on the hierarchy level would be a complication. I don't see it necessary, but yes, if there are use-case we will have a place to introduce it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks for the clarifications.

@mimowo mimowo force-pushed the topology-aware-scheduling branch from fd5ac7c to c487be6 Compare September 18, 2024 10:04
Update keps/2724-topology-aware-schedling/README.md

Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: Aldo Culquicondor <[email protected]>
@mimowo mimowo force-pushed the topology-aware-scheduling branch from c487be6 to 2ab5bb3 Compare September 18, 2024 11:12
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for driving this significantly valuable feature!
I'm looking forward to release this feature!

/lgtm
/approve

// +listType=atomic
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=5
Levels []TopologyLevel `json:"levels,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.

That makes sense. Thank you for the clarifications.

Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.

Yeah, I think so too, I just cared what we will go with PodSetsinfo after we introduce this feature. I think we do not think that we need to mention that here as well.

// +listType=atomic
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=5
Levels []TopologyLevel `json:"levels,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.

Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.

Great to see. Thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 18, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7bf5553fb927b2cf1562c60a784833cc7bb27639

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 4231a8f into kubernetes-sigs:main Sep 18, 2024
7 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Sep 18, 2024
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
* Topology Aware Scheduling KEP

* Review remarks

Update keps/2724-topology-aware-schedling/README.md

Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: Aldo Culquicondor <[email protected]>

---------

Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: Aldo Culquicondor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants