Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance scheduling capabilities for a group of pods #162

Open
3 tasks
vie-serendipity opened this issue Jun 10, 2024 · 19 comments · May be fixed by #168
Open
3 tasks

Enhance scheduling capabilities for a group of pods #162

vie-serendipity opened this issue Jun 10, 2024 · 19 comments · May be fixed by #168
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@vie-serendipity
Copy link
Contributor

vie-serendipity commented Jun 10, 2024

What would you like to be added:
Add a field like ScheduleMode. For scheduling a set of pods we may need more strict control. When deploying a distributed inference service, a set of pods, that is, a head plus several workers should be scheduled to neighboring nodes to reduce the communication cost between them.
Why is this needed:
There are many types of resources in a k8s cluster, and if the scheduling to nodes is constrained only by requests field, it is possible that the final distributed inference may not be too good.
An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.
Completion requirements:
I'm not sure about the eventual changes for the api, just an enhancement request. I'm also not sure if such a requirement is a reasonable enhancement request, and I'd be happy to contribute if it is.
This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@vie-serendipity vie-serendipity added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 10, 2024
@googs1025
Copy link
Member

I'm a bit curious, can such scenarios generally be resolved using node selectors or affinity? What are the shortcomings if this method is used?

@vie-serendipity
Copy link
Contributor Author

@googs1025

An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.

In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement.

@googs1025
Copy link
Member

@googs1025

An example is that a group of pods should be dispatched to nodes that have a special way of connecting to them, such as nvlink. Another example is that nodes in a cluster (a non-standard k8s cluster) may be across data centers, and a group of pods should be dispatched into the same data center.

In this scenario, each set of pods uses the same template, so the entire lws corresponding pods will be dispatched to the same data center. However, due to the fact that gpu is distributed in different data centers, so different groups of pods under lws should be scheduled to different data centers. So I think affinity or selector can't fulfill such requirement.

Do different data centers signify the same cluster, or are they assigned to different clusters? If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods. However, if they are distributed across different clusters, there might be concerns regarding communication overhead.

@vie-serendipity
Copy link
Contributor Author

If they are within the same cluster, it might still be feasible to utilize node selectors or affinity methods.

They belong to the same cluster, but how to schedule lws different groups of pods to multiple datacenters and each group of pods in the same datacenter by using affinity? I can't figure out a way to do this.

@googs1025
Copy link
Member

So, if I understand correctly, you want to schedule multiple pods under the workerTemplate to nodes in different data centers, is that correct? I cannot determine whether it is possible to achieve the desired functionality using node selectors or affinity. Additionally, I cannot assess whether it is a reasonable design to have scheduling decisions for the higher-level workload.

@vie-serendipity
Copy link
Contributor Author

This higher-level workload don't need to make any scheduling decisions , it just needs to make sure the workers are scheduled in the same data center as the head. The head's scheduling is entirely decided by the scheduler.

One possible implementation could be to wait until the head is scheduled to a node, then retrieve a label from that node (this label's key is predefined on the lws yaml). Afterwards, the workers would add similar affinities, ensuring they get scheduled to similar nodes.

@kerthcet
Copy link
Contributor

Thanks for bring this to the community, is this you need? https://github.com/kubernetes-sigs/lws/blob/main/docs/examples/sample/README.md#exclusive-placement

@kerthcet
Copy link
Contributor

But this is exclusive, which means two groups can not be located at the same topology.

@vie-serendipity
Copy link
Contributor Author

Thanks, that's what I'm looking for.

But this is exclusive, which means two groups can not be located at the same topology.

LeaderWorkerSet supports exclusive placement through pod affinity/anti-affinity where pods in the same group will be scheduled on the same accelerator island (such as a TPU slice or a GPU clique), but on different nodes. This ensures 1:1 LWS replica to accelerator island placement.

But I have a question, in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods? I feel like it should allow for multiple groups of pods within the same topology. Is it more reasonable to just ensure that the head and workers have the same topology key?

@kerthcet
Copy link
Contributor

in the example, there are GPU and TPU accelerator islands. Does that mean I can only have two groups of pods?

Accelerator is only used for integrations with cloud providers, like TPU with google cloud.

So what you need is slightly different with exclusive placement. Do you have a real use case from your side?

@vie-serendipity
Copy link
Contributor Author

My usage scenario is that there are many nodes in a k8s scenario, some of them are user's and some of them are cloud vendor's, to make it simple there are two datacenters, one for the user and one for the cloud provider. (This is not a standard k8s cluster, but I wonder if there is a need to schedule a group of pods to nodes of the same gpu type if there are many gpu resource types in a cluster)

I want to use lws to deploy inference services to the user's datacenter and the datacenter on the cloud. Although the network of datacenter on the cloud and the user's datacenter is interoperable, they are more expensive to communicate with and may require a public network, which is unstable.

So I want to make sure that a group of pods are dispatched to a single data center, so they can communicate easily. And multiple sets of pods are supported in one data center, and business peaks also require scaling.

@kerthcet
Copy link
Contributor

Make sense to me as a group of Pods should be located at the same topology, and the group number should not be limited.

cc @ahg-g @liurupeng thoughts?

@ahg-g
Copy link
Contributor

ahg-g commented Jun 12, 2024

yes, we can support that by simply removing the exclusive anti-affinity term that is currently getting added. But we need to come up with a proper API first, similar to kubernetes-sigs/jobset#75

@vie-serendipity
Copy link
Contributor Author

@ahg-g I would like to contribute to this feature. I can propose a KEP later. Is this good for you?

@liurupeng
Copy link
Collaborator

@vie-serendipity could you start the KEP so that we could start the review?

@vie-serendipity
Copy link
Contributor Author

@liurupeng Okay, I will propose a KEP recently.

@vie-serendipity vie-serendipity linked a pull request Jul 1, 2024 that will close this issue
@dims
Copy link
Member

dims commented Sep 16, 2024

/subscribe

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2024
@kerthcet
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants