-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LeaderWorkerSet doesn't support gang-scheduling #167
Comments
@xgchena sry for the late reply since I was on vacation last week. It's expected that when you have 2 replicas with size 4 and only 4 nodes, then two leaders can be scheduled, causing some workers not be scheduled. In this case, it's recommended to scale the pod group based on the number of available nodes or use cluster autoscaler to provision new nodes automatically when there are unscheduled pods. We could improve scheduling to accommodate as many pod groups as possible to avoid the case you described but it wouldn't be simple. And we need a real use case that we must add the gang-scheduling support. Since there is a workaround for this, I will add a feature label and would wait for a use case to prioritize. |
Generally, gang scheduling needs the support of scheduler. The upstream has the co-scheduliing plugin and an ongoing proposal about gang scheduling kubernetes/enhancements#4671, which I will try to push next release. |
Thank you both for the responses. |
Hi Rupeng, regarding your comments,
Multi-host inference is often used to resolve the problem that a model is too large to be deployed to a single instance, not even using the most advanced instance types (like those with 8 GPUs). In real world, there is capacity constraint on advanced instance types.
Both are real use cases. |
Hi Kante, thank you for the sharing and glad to know there is already a solution on the way. Regarding the co-scheduling plugin, actually I have tried it, copied from the issue description
Based on the vllm example, see the screenshot below. The problem with the approach is that only one PodGroup can be defined/used to group to all the pods. By "next release" I guess you mean the next release of Kubernetes. Before it is available, I'm wondering if it is doable to fork the lws controller and create a PodGroup for each replica at runtime, as a short-term workaround. |
Thanks for your feedbacks, it values a lot.
Yes, there're still some gaps need to fix.
I guess this is one approach available because based on the co-scheduling design, the podGroup needs to be created manually. However, we still not quite work smoothly with co-scheduling plugin, we have some features like startup policy and exclusive placement, which requires to create the worker pods once leader pod is ready, this will lead to dead lock with gang, because leader pod will not be scheduled if minMember not meet. This is a valid use case for gang scheduling design. |
@kerthcet still thinking of making progress for the kubernetes/enhancements#4671 KEP? |
Yes, but maybe next kubernetes release cycle, I'm rushing for a new milestone for my project, which may take two or three weeks and then I'm sure I'll miss the deadline of KEP code freeze. 🥶 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What happened:
It seems that the LeaderWorkerSet doesn't support gang-scheduling of a group of pods. If more replicas are scheduled at the same time, and there are not enough capacity to host them all, then the scheduler may prioritize scheduling of leader pods, and leave their worker pods pending forever.
What you expected to happen:
LeaderWorkerSet should support gang-scheduling, i.e. the pods of a group are either scheduled all together, or nothing.
How to reproduce it (as minimally and precisely as possible):
I have tried with the vllm example with an EKS cluster which has 4 nodes, each node has 1 GPU and enough resources to meet the requests of the pods. The example manifest uses size 2 and replicas 2, in total 4 pods.
The expected behavior is that the first two groups should be scheduled (pods vllm-0, vllm-0-1, vllm-1, and vllm-1-1).
Anything else we need to know?:
Also tried with the co-scheduling plugin, but grouping all the Pods by the same static pod-group label is the same as no PodGroup.
Environment:
kubectl version
): v1.29.3git describe --tags --dirty --always
): v0.3.0-8-ga4c468ecat /etc/os-release
): Amazon Linux 2uname -a
): 5.10.218The text was updated successfully, but these errors were encountered: