-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Admission Check controller to verify node capacity before admitting workloads #3211
Comments
This looks very similar to what we do in Topology Aware Scheduling. IIUC this could be a special case for a topology with a single level corresponding to an individual pod (the hostname label). PTAL and let us know if there are some relevant differences / gaps. |
Thank you @mimowo , we'll take a look at the KEP and let you know. |
Only my concern for expanding and extending the TAS feature is the fact that the TAS depends on the PodSchedulingGates. So, before extending the TAS feature or beta graduation, I would like to seek another solution instead of PodSchedulingGates. |
After going through #2724 and its KEP PR #2725 we think that Topology Aware Scheduling (TAS) could be used to offer a similar functionality to the proposed AdmissionCheck controller. There is a small difference which depending on the use case can be insignificant. My interpretation of TAS is that it aims to control where the workload pods are placed so as to improve some metric (e.g. throughput due to increased bandwidth between the pods as opposed to scheduling pods on distant nodes). Our proposed AdmissionCheck controller doesn't necessarily decide where the pods should be placed. It aims to increase the resource utilization of a cluster by way of rejecting the creation of pods which simply cannot be scheduled on any available node and thus increasing the chances that the quota of a ClusterQueue is reserved for workloads which have a chance of getting scheduled on some node that the Kubernetes scheduler (or other scheduler) decides. Without delving into too many implementation details we were thinking of the following this 3-step algorithm:
We could extend this basic algorithm by propagating the scheduling decisions of the controller to the Kubernetes scheduler but that could be an optional action, perhaps controlled by some kind of flag. What we find interesting is that this slight difference effectively means that our proposed AdmissionCheck controller is simpler than TAS. There's also a bit of overlap between the 2 proposals: figuring out the allocatable capacity of nodes. We're thinking that we could use our proposed AdmissionCheck controller as a way of figuring out a good-enough recipe to compute the available allocatable resources of nodes which TAS could re-use down the line. Another way to describe the proposed AdmissionCheck controller is that it provides similar functionality to the existing provisioning check admission controller (https://github.com/kubernetes-sigs/kueue/tree/main/pkg/controller/admissionchecks/provisioning). Except this new admission controller will use an alternate (simpler) mechanism to track available capacity and not assume that a cluster autoscaler is enabled. So to summarize, we're proposing that we contribute the proposed AdmissionCheck controller under the experimental controllers (in path: https://github.com/kubernetes-sigs/kueue/tree/main/cmd/experimental). We're planning on architecting the solution in a way that it doesn't have any assumptions about optional cluster features and that other solutions could make use of the code we implement (e.g. TAS). |
It looks like you are trying to provide the functionality the same as the cluster-autoscale, but the solution is different.
https://kueue.sigs.k8s.io/docs/overview/#why-use-kueue TAS is accepted since the cluster-autoscaler and kube-scheduler can not decide the scheduling Nodes based on the Quota as we described in https://github.com/mimowo/kueue/tree/2ab5bb3a6b0af6be4fe11e2d36a06b8c43e105fa/keps/2724-topology-aware-schedling#implement-it-in-clusterautoscaler-or-kube-scheduler So, my root question is the reason why the existing components like kube-scheduler and cluster-autoscaler can not satisfy this feature. |
I see, so the new proposed AdmissionCheck is somewhere between TAS and check-capacity in CA: differences with TAS:
differences with ProvReq check-capacity in CA:
I think that while it is a valid point that differences exist and may matter for some use cases (for example for users who are ultra sensitive about scheduling time and cannot use TAS, or cannot use CA for some reason, what would it be?), these seem edge cases not enough to justify a new controller. Note that the controller seems easy at the surface, but when you need to deal with negative scenarios (like retries) then it becomes complex to maintain, as demonstrated by #3375. I believe sooner or later we would need to solve the analogous problems again. So, I'm leaning to support @tenzen-y PoV and stick to improving the two mechanisms we have. |
Hi @mimowo , I have been following up this issue to make pods in PyTorchJob/MPIJob to be scheduled only when there are nodes with sufficient capacity. As you mentioned above, I tested TAS released in v0.9.0 with this single level topology and noticed that for PyTorchJob/MPIJob (possibly also the others that have multiple pods), this topology configurations applies to all the pods within the job. That said, if we use single level topology with hostname label to PyTorchJob/MPIJob, unless there is a node that can schedule all the pods in the job, it will not be admitted. Is there a plan to extend TAS to handle this scenario? Thank you very much. |
@hy00nc thank you for being one of the early testers or TAS. I think what is want is already achievable if you use the 'preferred' annotation on your PodSet. Then TAS will try to fit them all on a single node, but if not possible it will try to optimize the number of nodes used. It may also be helpful for you to have a deeper topology with the lowest level being a node . Please also consult the new docs we have for TAS. Let me know if this helps or you encounter more issues, or I didn't exactly understand your use case. |
@mimowo , thank you for the quick response!
If done with 'preferred' I think that means even if the node does not have sufficient capacity to schedule a pod, it will be admitted, which doesn't seem to align with the original purpose? The original intention is to never let pod be admitted if there is no node that can schedule it. For example, there are two nodes with 4 GPUs each, but a pod is requesting 8 GPUs. With normal Kueue setup, this pod is admitted because we have sufficient quota but will never get scheduled. This is the problem that we want to solve.
I also tried this, however this also does not meet the original intention because even if we meet the higher level of topology (e.g. topology-rack), this doesn't mean there is a node that can schedule a pod because capacity is summed up for the higher topology nodes which doesn't necessarily reflect the actual individual node capacity. Only the hostname topology seems to be able to do this.. Please let me know if I'm missing something. |
"preferred" still means you need to have enough capacity, it just means that multiple nodes can be used (but their number will remain optimized). TAS operates on pod counts more than on just additive capacity, So, in your example Kueue would see 0 pods fits in Node1, and 0 pods fits in Node2, and thus it sees 0 pods can fit on Node1+Node2. So it would not admit the workload. More precisely the setup I mean (based on the example in docs): levels:
- nodeLabel: "cloud.provider.com/topology-block"
- nodeLabel: "cloud.provider.com/topology-rack"
- nodeLabel: "kubernetes.io/hostname" Then, the user sends the workload using If this works we definitely need to improve documentation. |
I see, thanks for the clarification. I think I misunderstood the term empirically by the tests I conducted using PyTorchJob which looks to have a bug in the integration with TAS. Just to describe here shortly, when the cluster queue has no workloads at all, I could see that pods still go to different hosts even when using I can't bring specific manifests for some reason, but I could see event when I describe the PyTorchJob workload (2 worker pods with 4 GPUs each, and there are nodes in cluster each having 4 GPUs left) which is not correct: I also see some ungating bug in MPIJob integration.. It'd be good for me to create an issue but I'm afraid I can't raise specific issues in Github at the moment.. |
Cool, this -3 is obviously a bug. I will try to repro and fix for 0.9.1, but in the meanwhile any additional repro steps will be very helpful for us. Is this just a bad logging or you think the workloads should get admitted? Same for the ungatibg bug. Any details you can share would be greatly useful and could shorten the time for fixes. |
Could you open another issue? |
+1 for opening an issue I understand the full yamls might be confidential, but you could probably share some of the most important bits of information for repro:
|
Circling back to this point by @mimowo
As open source developers, we tend to think "just upgrade to the new version that has the feature you want". But the reality of production use is quite different. The CA feature that is needed (the That leaves TAS, which may be able to fill the need even though it isn't really its intended design point. But, features such as preemption and cohorts are also required and not yet supported by TAS. TAS may also be overkill when the "topology" is totally flat (just nodes) and all that is desired is to prevent the churn/waste of useless admission of workloads that contain individual pods that can't possibly be scheduled because no node has the requested capacity. Perhaps there is a simple version where the TAS node monitoring controller is enabled, but there is a degenerate case "scheduling" that just does a capacity check on individual pods as an admission check and doesn't activate the whole pod gating machinery. |
Thank you for the summary. I totally understand the production reality where you need to support users on 1.29 whilst it is not yet supported by CA check-capacity. In principle TAS should be enough in the degenerative single-level case, but making it compatible with preemption and cohorts is non-trivial and will take probably till Q2 2025. So, I can see some options:
I would be leaning towards (2.) as this would also serve a role of an example of creating an external AC. It will also let us learn what are the obstacles when doing an external AC, and we could reference it in the docs, so win-win IMO. In fact, we did it in the past for the very similar reasons we developed Pod-integration based on taints and tolerations which was needed quickly for our users on 1.25 (IIRC, prior to the introduction of scheduling gates). WDYT @tenzen-y ? |
What would you like to be added:
I would like to use an Admission Check Controller which admits a job only if there is at least 1 Node with enough available resources to support the execution of the workload without oversubscribing the node. This admission check could be made optional.
Why is this needed:
Kueue does not check if there are nodes with enough available resources for incoming workloads by default and that can lead to Kueue jobs taking up resources from the ClusterQueue even though they cannot actually run on the cluster. The proposed Admission Check Controller could help increase the hardware utilization by only admitting workloads that actually have a chance to schedule on a Node.
Here is an example scenario that motivates the use of the proposed AdmissionCheck Controller:
The current system capacity is:
At this point a user creates a new workload for a Job for a single Pod that requests 3 Cores.
Currently, Kueue will admit this workload because the available capacity of the ClusterQueue is 4 cores which is greater than the CPU request of the Job. However, there are no nodes which have enough available cores for this workload.
Even though pod preemption is a possibility by the kube scheduler, the user would typically not like to change the order of execution after submitting workloads. This ACC would serve as a guard rail to ensure that.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: