Add GPU Optimization Proposal #1562

Jeffwan · 2019-01-07T18:31:16Z

Proposal for GPU Optimization in Cluster Autoscaler.

I noticed there're several issues #1367 #1135 talking about problems to support GPU and I'd like to make some change to better support this use case. The issues have been discussed in the original issue ticket but lack of implementation.

Please feel free to leave comments and let me know if this makes sense.

d-nishi · 2019-01-07T18:36:17Z

/sig aws

Jeffwan · 2019-01-07T18:52:08Z

/cc @MaciekPytel @seh @ringtail

k8s-ci-robot · 2019-01-07T18:52:10Z

@Jeffwan: GitHub didn't allow me to request PR reviews from the following users: seh, ringtail.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @MaciekPytel @seh @ringtail

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MaciekPytel · 2019-01-07T19:00:35Z

cluster-autoscaler/proposals/gpu_optimization.md

+
+### Proposed Solution
+1. Either move `utils/gpu` logic to cloud provider or have a new option passed from commandline to indicate gpu node label cloud provider like to use
+2. Scale down case is kind of tricky because training and serving seems like different use case and have confliction. Fit GPU resource into utilization formula doesn't solve this issue. Instead, I think it's better to have a flag to indicate if gpu nodes can be scaled down or not.


Why using GPU in resource utilization formula doesn't solve the issue? The problem I'm aware of is high CPU or memory utilization preventing scale-down of nodes with GPU. This can be solved with changing utilization calculation to only care about GPU as described in #1367 (comment).

Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway). Having low utilization doesn't mean a node will actually be deleted. We have plenty of mechanisms for preventing scale-down if you don't want particular node removed (scale-down-disabled annotation) or specific pod restarted (safe-to-evict: false annotation, PDB to name just a few).
Unless there is a use-case that can't be addressed with existing mechanisms I'd rather we improve documentation and recommend an existing solution, rather than add another flag.

For GPU node, in training case, GPU utilization rate is not helpful to make decision.
For example, given a 8 GPU machine, there're 2 tasks and one take 6 GPUs and the other takes 2 GPUs. Once first task finishes and GPU resource release. The GPU utilization rate is 25%. But it can not be killed because failure of this task will result in a failure of entire training job. It's kind of a true of false issue but not comparison with threshold.

That's why I think ignore CPU and memory utilization formula can not resolve this issue.
GPU utilization rate will definitely help on model inference case. Most of workloads are stateless and can be easily migrated to other nodes to reduce cost.

I look at the comment and it's reasonable to create PDB and annotate node. The speciality is most of batch/ML workloads are short term lived. It's kind of tedious to do that if jobs created frequently.

I think maybe we can make training case to more generic, like a flag to indicate if we can remove a node immediately (kill in-flight job) or wait until workloads done. Looks like this is not supported but important for batch jobs.

"Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway)." Do you mean in the future CA has plan to add some other triggers? I think entire cluster node utilization might be a reasonable one. That tradeoff here is scale up speed vs cost. Do you think this feature can be added?

Utilization is not a trigger, it's only the first step in scale-down decision, a precondition. The other part is 'drain simulation' that calculates if all pods are safe to move (if they fit elsewhere in the cluster, if they're owned by controller that will restart them, if PDB allows moving them, all the stuff mentioned in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node).

The drain part is the important one, the utilization part is there to limit the number of evictions (don't restart the pods if the gain is relatively small) and for performance reasons (avoid expensive drain simulation if it's unlikely the node can be deleted).

Regarding your use-case: it seems like putting "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" on your pods is a simple solution.

+1 to defining this per job, not per cluster. Users may want to run diverse workloads, that's what Kubernetes is for :)

Agree. Use label per job for fine grained control makes sense. At the beginning, I thought covering both cases using cluster level tag which may makes it a little bit confused and contradictory somehow.

aleksandra-malinowska · 2019-01-08T12:18:51Z

cluster-autoscaler/proposals/gpu_optimization.md

+GKE uses gpu label `cloud.google.com/gke-accelerator` to inspect if a node is an accelerator node, other cloud providers don't define their label yet. In order for CA to behavior normally in GPU cluster, it's essential to have this gpu label.
+
+#### Scale too much without GPU label support
+Pending pods will trigger scale up decision to bring up number of nodes in one node group. When new node is added, CA knows it is `upcoming` and it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. Device plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later. CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.


CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.

Not really, CA will look at template node for a node group to verify if the node is expected to have GPUs. If so, it'll wait for them and not add surplus nodes.

@aleksandra-malinowska Yeah. That's different cases or phases.
What you mentioned are two cases

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.
2.nodeGroup is GPU family and job requires GPU, CA will check upcoming nodes and don't add surplus node until node is ready.

The situation I meet is second one, the problem is GPU takes time to be allocatable. When then new create node is ready, but GPU resource is not ready, autoscaler will think it's unschedulable and create new nodes. This is hacked in GKE but not other cloud providers yet.

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.

This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?

This is hacked in GKE but not other cloud providers yet.

To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.

This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?

No. As you said, this shouldn't happen. I mean user schedules GPU workloads on non-GPU cluster. CA has support to avoid scaling up in this case because it can not fit into this kind of nodeGroup.

This is hacked in GKE but not other cloud providers yet.

To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.

Yes. Definitely, I think all cloud providers can hack this way. But each has its own GPU label. Either support custom GPU label from arguments or move to cloud providers (preferred.)

aleksandra-malinowska · 2019-01-08T12:25:33Z

cluster-autoscaler/proposals/gpu_optimization.md

+#### GPU Resource is not given consideration in scaling down
+CA filter candidates with low node cpu and memory utilization rate in every loop. 
+
+* If utilization rate is high, even all GPUs are idle, scale-down will not be triggered.  


Most users add taints to their GPU node groups that prevent non-GPU pods running there. Note that even without autoscaling, if you don't do that, your GPU job may not run at all if too many non-GPU pods are scheduled on an unprotected node with idle GPUs.

If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down, it can be solved by raising utilization threshold.

Agree. This is what user normally do for training jobs. There's a case that I am not sure if that's common enough. For tensorflow distributed training, parameter server and workers are just deployed on one GPU cluster but PS pod utilize more on CPU and memory and worker utilize GPU.

"If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down," That's the inference case and workloads could mix together on some economic GPU nodes.

The problem with "economic" solution of mixing GPU and non-GPU pods on the same node is that unless you very carefully control scheduling, there's always a risk that GPU will end up idle, which is a rather expensive way of running non-GPU pods.

Either way, I still don't see what is the problem here that couldn't be solved with just safe-to-evict and raising utilization threshold?

I agree with safe-to-evict solution for training jobs.

GPU resource is not even considered in the utilization calculation here. So raising utilization threshold won't help this case.

autoscaler/cluster-autoscaler/simulator/cluster.go

Lines 155 to 164 in 14ed6ec

func CalculateUtilization(node *apiv1.Node, nodeInfo *schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) {

cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods)

if err != nil {

return UtilizationInfo{}, err

}

mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods)

if err != nil {

return UtilizationInfo{}, err

}

return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil

I think this part we can do some improvement.

To be clear: this comment is to a section of the doc concerned with idle GPUs + other pods using up CPU/memory above threshold. I'm fairly certain raising threshold will help with that.

I'm not sure what kind of benefit is to be gained from adding GPU utilization into the mix though, other than consistency?

@seh Your argument applies to any workload, regardless if it uses GPU or not. It's autoscalers job to kick pods around to limit resource usage. An application may not agree to be restarted in which case it must specify it somehow.

Why should the pod author need to learn about how to convince this particular autoscaler

PDB is a generic Kubernetes way of telling all automation (including CA) not to touch a given pod. The annotation exists because many people requested a way to specifically prevent CA from touching the pod, but allow other automation to restart it.

The machine has one GPU, and the pod is using one GPU

Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node. You'll also be able to set threshold to 0, in which case no node will be removed if any of its GPUs are used.

OK, how should the autoscaler determine which pods to move and which not to move? [...] For you, just using GPUs may mean you don't want the pod evicted, ever - but this may simply not work for other users.

The autoscaler normally doesn't "move" pods unless it thinks they're not utilizing a machine's resources adequately. That situation does not—or should not—apply here.

In my example, we have a machine with one GPU, a pod running there that requests and is using one GPU, and no other nodes advertising available GPUs. Why would an autoscaler consider this pod as a candidate to move? By analogy, if a machine has one CPU and a pod there is "using" (really, requested) the whole CPU, the autoscaler wouldn't consider moving that pod. Why do we trust use of a CPU but not use of a GPU?

it can probably be addressed similarly to adding a toleration you described - by admission controller or other automation, which can add annotations or set up PDBs (maxUnavailable: 0 should have the same effect).

The toleration I mentioned is a convenience provided by the "ExtendedResourceToleration" admission control plugin; we used to write them manually.

Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node.

Well, that's great news! That's what I'm asking for here. If we're going to count GPU utilization and not move pods that are utilizing the machine's resources adequately, then I have no further complaint.

no other nodes advertising available GPUs

In this case the pod won't be moved. Only if you had another node with unused GPU (e.g. a machine with 4 GPUs, 1 of them unused), it may. After the change discussed here, it won't be moved in that case, either. Is that clearer?

Yes, that helps. Thank you for clarifying.

aleksandra-malinowska · 2019-01-08T12:34:09Z

It seems like there are 3 problems mentioned:

GPU and non-GPU pods blocking scale-down of GPU nodes based on utilization,
GPU pods not blocking scale-down,
GKE-specific label used to identify nodes.

Note that (1) and (2) are contradictory. Rather than proposed per cluster flag, a fine grained control can be achieved by a combination of safe-to-evict annotation and configurable utilization threshold (both already supported).

(3) seems to be easy enough to solve (move label constant to cloud provider), unless I'm missing something. @MaciekPytel WDYT?

Jeffwan · 2019-01-08T20:14:59Z

I will update documentation and concentrate on two things.

Try to move GPULabel and GPU specified codes to cloud provider.
Update utilization rate formula to consider GPU in scale down case
The logic would be
if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate, if it's under a threshold, choose the node as candidate. Do you think it's reasonable to have another threshold flag for GPU? Now, CA only have ScaleDownUtilizationThreshold.

@MaciekPytel @aleksandra-malinowska Please let me know if updated plan makes sense to you. I will make corresponding change in this doc.

ringtail · 2019-01-09T03:29:26Z

I will update documentation and concentrate on two things.

Try to move GPULabel and GPU specified codes to cloud provider.

Update utilization rate formula to consider GPU in scale down case
The logic would be
if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate, if it's under a threshold, choose the node as candidate. Do you think it's reasonable to have another threshold flag for GPU? Now, CA only have ScaleDownUtilizationThreshold.

@MaciekPytel @aleksandra-malinowska Please let me know if updated plan makes sense to you. I will make corresponding change in this doc.

agree!

Different cloud providers do have similar labels. (Azure) @feiskyer (Alibaba) @ringtail
if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate may not be a bad idea but whether should this choice bypass to cloud provider? That could be more flexible for more common cases.

Jeffwan · 2019-01-09T05:17:30Z

@ringtail Right now, this is piece of common code. I will make it as generic as I can as GPU related code moves to cloud provider.

MaciekPytel · 2019-01-09T12:35:58Z

Try to move GPULabel and GPU specified codes to cloud provider.
Different cloud providers do have similar labels. (Azure) @feiskyer (Alibaba) @ringtail

Having cloudprovider provide gpu label seems like the right thing to do.

if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate may >not be a bad idea but whether should this choice bypass to cloud provider? That could be more flexible >for more common cases.

I don't think we should expose knobs in scale-down heuristics to cloudprovider. If there is a good use-case for keeping CPU / memory formula for nodes with GPU (I'm not aware of one, but I'd like to learn about it) it should be controlled with flag.

aleksandra-malinowska · 2019-01-09T12:41:43Z

Utilization based on just GPU on GPU nodes is tracked in #1367

Making GPU label either configurable or hard-coded in cloud provider (or both) is tracked in #1135

Is there anything else here that I'm missing?

Jeffwan · 2019-01-09T18:12:29Z

@aleksandra-malinowska You're good and that's everything. I will clean up doc and work on the fix for the two issues. PR will come soon.

ringtail · 2019-01-10T02:26:03Z

@aleksandra-malinowska You're good and that's everything. I will clean up doc and work on the fix for the two issues. PR will come soon.

@Jeffwan When the PR comes out. Pls cc different cloud provider's maintainers. Thanks a lot.

christopherhein

couple basic tweaks to grammar and spelling.

cluster-autoscaler/proposals/gpu_optimization.md

mwielgus · 2019-03-01T11:40:15Z

@Jeffwan - looks like we are waiting for you to fix the grammar/typos.

mwielgus · 2019-03-06T03:08:18Z

Ping

Jeffwan · 2019-03-06T18:25:55Z

Thanks for reminder. I will update doc today. @mwielgus

k8s-ci-robot · 2019-03-07T08:19:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: losipiuk

If they are not already assigned, you can assign the PR to them by writing /assign @losipiuk in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MaciekPytel · 2019-04-05T17:00:27Z

The changes proposed here were already implemented by @Jeffwan and merged, no need to keep this open.

…tes#1562) * Trigger the workloads eviction on admission check rejection. * Wait for the eviction to and, add integration test. * Review remarks. * Review Remarks * Review Remarks * Review Remarks * Fix after rebase. * Update pkg/controller/core/workload_controller.go Co-authored-by: Michał Woźniak <[email protected]> --------- Co-authored-by: Michał Woźniak <[email protected]>

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2019

k8s-ci-robot requested review from feiskyer and piosz January 7, 2019 18:31

k8s-ci-robot added the sig/aws label Jan 7, 2019

k8s-ci-robot requested a review from MaciekPytel January 7, 2019 18:52

MaciekPytel reviewed Jan 7, 2019

View reviewed changes

Jeffwan mentioned this pull request Jan 7, 2019

REQUEST: New membership for Jeffwan kubernetes/org#334

Closed

6 tasks

aleksandra-malinowska reviewed Jan 8, 2019

View reviewed changes

christopherhein reviewed Jan 24, 2019

View reviewed changes

cluster-autoscaler/proposals/gpu_optimization.md Outdated Show resolved Hide resolved

cluster-autoscaler/proposals/gpu_optimization.md Outdated Show resolved Hide resolved

Add GPU Optimization Proposal

b4b3b38

Jeffwan force-pushed the gpu_optimization_doc branch from c8e7f9e to b4b3b38 Compare March 7, 2019 08:18

MaciekPytel closed this Apr 5, 2019

Jeffwan mentioned this pull request May 18, 2019

"safe-to-evict:false" label does not prevent node from scaling down ? #2027

Closed

Jeffwan deleted the gpu_optimization_doc branch November 6, 2019 05:17

	func CalculateUtilization(node apiv1.Node, nodeInfo schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) {
	cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods)
	if err != nil {
	return UtilizationInfo{}, err
	}
	mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods)
	if err != nil {
	return UtilizationInfo{}, err
	}
	return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil

Add GPU Optimization Proposal #1562

Add GPU Optimization Proposal #1562

Conversation

Jeffwan commented Jan 7, 2019 • edited Loading

d-nishi commented Jan 7, 2019

Jeffwan commented Jan 7, 2019

k8s-ci-robot commented Jan 7, 2019

Choose a reason for hiding this comment

Jeffwan Jan 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan Jan 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan Jan 8, 2019 • edited Loading

Choose a reason for hiding this comment

aleksandra-malinowska Jan 8, 2019 • edited Loading

Choose a reason for hiding this comment

Jeffwan Jan 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksandra-malinowska Jan 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksandra-malinowska commented Jan 8, 2019

Jeffwan commented Jan 8, 2019

ringtail commented Jan 9, 2019

Jeffwan commented Jan 9, 2019

MaciekPytel commented Jan 9, 2019

aleksandra-malinowska commented Jan 9, 2019

Jeffwan commented Jan 9, 2019

ringtail commented Jan 10, 2019 • edited Loading

christopherhein left a comment

Choose a reason for hiding this comment

mwielgus commented Mar 1, 2019

mwielgus commented Mar 6, 2019

Jeffwan commented Mar 6, 2019

k8s-ci-robot commented Mar 7, 2019

MaciekPytel commented Apr 5, 2019

Jeffwan commented Jan 7, 2019 •

edited

Loading

Jeffwan Jan 7, 2019 •

edited

Loading

Jeffwan Jan 7, 2019 •

edited

Loading

Jeffwan Jan 8, 2019 •

edited

Loading

aleksandra-malinowska Jan 8, 2019 •

edited

Loading

Jeffwan Jan 9, 2019 •

edited

Loading

aleksandra-malinowska Jan 9, 2019 •

edited

Loading

ringtail commented Jan 10, 2019 •

edited

Loading