-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU Optimization Proposal #1562
Conversation
/sig aws |
@Jeffwan: GitHub didn't allow me to request PR reviews from the following users: seh, ringtail. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
### Proposed Solution | ||
1. Either move `utils/gpu` logic to cloud provider or have a new option passed from commandline to indicate gpu node label cloud provider like to use | ||
2. Scale down case is kind of tricky because training and serving seems like different use case and have confliction. Fit GPU resource into utilization formula doesn't solve this issue. Instead, I think it's better to have a flag to indicate if gpu nodes can be scaled down or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using GPU in resource utilization formula doesn't solve the issue? The problem I'm aware of is high CPU or memory utilization preventing scale-down of nodes with GPU. This can be solved with changing utilization calculation to only care about GPU as described in #1367 (comment).
Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway). Having low utilization doesn't mean a node will actually be deleted. We have plenty of mechanisms for preventing scale-down if you don't want particular node removed (scale-down-disabled annotation) or specific pod restarted (safe-to-evict: false annotation, PDB to name just a few).
Unless there is a use-case that can't be addressed with existing mechanisms I'd rather we improve documentation and recommend an existing solution, rather than add another flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GPU node, in training case, GPU utilization rate is not helpful to make decision.
For example, given a 8 GPU machine, there're 2 tasks and one take 6 GPUs and the other takes 2 GPUs. Once first task finishes and GPU resource release. The GPU utilization rate is 25%. But it can not be killed because failure of this task will result in a failure of entire training job. It's kind of a true of false issue but not comparison with threshold.
That's why I think ignore CPU and memory utilization formula can not resolve this issue.
GPU utilization rate will definitely help on model inference case. Most of workloads are stateless and can be easily migrated to other nodes to reduce cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look at the comment and it's reasonable to create PDB and annotate node. The speciality is most of batch/ML workloads are short term lived. It's kind of tedious to do that if jobs created frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can make training case to more generic, like a flag to indicate if we can remove a node immediately (kill in-flight job) or wait until workloads done. Looks like this is not supported but important for batch jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway)." Do you mean in the future CA has plan to add some other triggers? I think entire cluster node utilization might be a reasonable one. That tradeoff here is scale up speed vs cost. Do you think this feature can be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utilization is not a trigger, it's only the first step in scale-down decision, a precondition. The other part is 'drain simulation' that calculates if all pods are safe to move (if they fit elsewhere in the cluster, if they're owned by controller that will restart them, if PDB allows moving them, all the stuff mentioned in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node).
The drain part is the important one, the utilization part is there to limit the number of evictions (don't restart the pods if the gain is relatively small) and for performance reasons (avoid expensive drain simulation if it's unlikely the node can be deleted).
Regarding your use-case: it seems like putting "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" on your pods is a simple solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to defining this per job, not per cluster. Users may want to run diverse workloads, that's what Kubernetes is for :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Use label per job for fine grained control makes sense. At the beginning, I thought covering both cases using cluster level tag which may makes it a little bit confused and contradictory somehow.
GKE uses gpu label `cloud.google.com/gke-accelerator` to inspect if a node is an accelerator node, other cloud providers don't define their label yet. In order for CA to behavior normally in GPU cluster, it's essential to have this gpu label. | ||
|
||
#### Scale too much without GPU label support | ||
Pending pods will trigger scale up decision to bring up number of nodes in one node group. When new node is added, CA knows it is `upcoming` and it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. Device plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later. CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.
Not really, CA will look at template node for a node group to verify if the node is expected to have GPUs. If so, it'll wait for them and not add surplus nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleksandra-malinowska Yeah. That's different cases or phases.
What you mentioned are two cases
- nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.
2.nodeGroup is GPU family and job requires GPU, CA will check upcoming nodes and don't add surplus node until node is ready.
The situation I meet is second one, the problem is GPU takes time to be allocatable. When then new create node is ready, but GPU resource is not ready, autoscaler will think it's unschedulable and create new nodes. This is hacked in GKE but not other cloud providers yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.
This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?
This is hacked in GKE but not other cloud providers yet.
To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.
This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?
No. As you said, this shouldn't happen. I mean user schedules GPU workloads on non-GPU cluster. CA has support to avoid scaling up in this case because it can not fit into this kind of nodeGroup.
This is hacked in GKE but not other cloud providers yet.
To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.
Yes. Definitely, I think all cloud providers can hack this way. But each has its own GPU label. Either support custom GPU label from arguments or move to cloud providers (preferred.)
#### GPU Resource is not given consideration in scaling down | ||
CA filter candidates with low node cpu and memory utilization rate in every loop. | ||
|
||
* If utilization rate is high, even all GPUs are idle, scale-down will not be triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most users add taints to their GPU node groups that prevent non-GPU pods running there. Note that even without autoscaling, if you don't do that, your GPU job may not run at all if too many non-GPU pods are scheduled on an unprotected node with idle GPUs.
If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down, it can be solved by raising utilization threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Agree. This is what user normally do for training jobs. There's a case that I am not sure if that's common enough. For tensorflow distributed training, parameter server and workers are just deployed on one GPU cluster but PS pod utilize more on CPU and memory and worker utilize GPU.
-
"If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down," That's the inference case and workloads could mix together on some economic GPU nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with "economic" solution of mixing GPU and non-GPU pods on the same node is that unless you very carefully control scheduling, there's always a risk that GPU will end up idle, which is a rather expensive way of running non-GPU pods.
Either way, I still don't see what is the problem here that couldn't be solved with just safe-to-evict and raising utilization threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with safe-to-evict solution for training jobs.
GPU resource is not even considered in the utilization calculation here. So raising utilization threshold won't help this case.
autoscaler/cluster-autoscaler/simulator/cluster.go
Lines 155 to 164 in 14ed6ec
func CalculateUtilization(node *apiv1.Node, nodeInfo *schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) { | |
cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods) | |
if err != nil { | |
return UtilizationInfo{}, err | |
} | |
mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods) | |
if err != nil { | |
return UtilizationInfo{}, err | |
} | |
return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil |
I think this part we can do some improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: this comment is to a section of the doc concerned with idle GPUs + other pods using up CPU/memory above threshold. I'm fairly certain raising threshold will help with that.
I'm not sure what kind of benefit is to be gained from adding GPU utilization into the mix though, other than consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seh Your argument applies to any workload, regardless if it uses GPU or not. It's autoscalers job to kick pods around to limit resource usage. An application may not agree to be restarted in which case it must specify it somehow.
Why should the pod author need to learn about how to convince this particular autoscaler
PDB is a generic Kubernetes way of telling all automation (including CA) not to touch a given pod. The annotation exists because many people requested a way to specifically prevent CA from touching the pod, but allow other automation to restart it.
The machine has one GPU, and the pod is using one GPU
Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node. You'll also be able to set threshold to 0, in which case no node will be removed if any of its GPUs are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, how should the autoscaler determine which pods to move and which not to move? [...] For you, just using GPUs may mean you don't want the pod evicted, ever - but this may simply not work for other users.
The autoscaler normally doesn't "move" pods unless it thinks they're not utilizing a machine's resources adequately. That situation does not—or should not—apply here.
In my example, we have a machine with one GPU, a pod running there that requests and is using one GPU, and no other nodes advertising available GPUs. Why would an autoscaler consider this pod as a candidate to move? By analogy, if a machine has one CPU and a pod there is "using" (really, requested) the whole CPU, the autoscaler wouldn't consider moving that pod. Why do we trust use of a CPU but not use of a GPU?
it can probably be addressed similarly to adding a toleration you described - by admission controller or other automation, which can add annotations or set up PDBs (maxUnavailable: 0 should have the same effect).
The toleration I mentioned is a convenience provided by the "ExtendedResourceToleration" admission control plugin; we used to write them manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node.
Well, that's great news! That's what I'm asking for here. If we're going to count GPU utilization and not move pods that are utilizing the machine's resources adequately, then I have no further complaint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no other nodes advertising available GPUs
In this case the pod won't be moved. Only if you had another node with unused GPU (e.g. a machine with 4 GPUs, 1 of them unused), it may. After the change discussed here, it won't be moved in that case, either. Is that clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that helps. Thank you for clarifying.
It seems like there are 3 problems mentioned:
Note that (1) and (2) are contradictory. Rather than proposed per cluster flag, a fine grained control can be achieved by a combination of safe-to-evict annotation and configurable utilization threshold (both already supported). (3) seems to be easy enough to solve (move label constant to cloud provider), unless I'm missing something. @MaciekPytel WDYT? |
I will update documentation and concentrate on two things.
@MaciekPytel @aleksandra-malinowska Please let me know if updated plan makes sense to you. I will make corresponding change in this doc. |
agree!
|
@ringtail Right now, this is piece of common code. I will make it as generic as I can as GPU related code moves to cloud provider. |
Having cloudprovider provide gpu label seems like the right thing to do.
I don't think we should expose knobs in scale-down heuristics to cloudprovider. If there is a good use-case for keeping CPU / memory formula for nodes with GPU (I'm not aware of one, but I'd like to learn about it) it should be controlled with flag. |
@aleksandra-malinowska You're good and that's everything. I will clean up doc and work on the fix for the two issues. PR will come soon. |
@Jeffwan When the PR comes out. Pls cc different cloud provider's maintainers. Thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple basic tweaks to grammar and spelling.
@Jeffwan - looks like we are waiting for you to fix the grammar/typos. |
Ping |
Thanks for reminder. I will update doc today. @mwielgus |
c8e7f9e
to
b4b3b38
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The changes proposed here were already implemented by @Jeffwan and merged, no need to keep this open. |
…tes#1562) * Trigger the workloads eviction on admission check rejection. * Wait for the eviction to and, add integration test. * Review remarks. * Review Remarks * Review Remarks * Review Remarks * Fix after rebase. * Update pkg/controller/core/workload_controller.go Co-authored-by: Michał Woźniak <[email protected]> --------- Co-authored-by: Michał Woźniak <[email protected]>
Proposal for GPU Optimization in Cluster Autoscaler.
I noticed there're several issues #1367 #1135 talking about problems to support GPU and I'd like to make some change to better support this use case. The issues have been discussed in the original issue ticket but lack of implementation.
Please feel free to leave comments and let me know if this makes sense.