Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Optimization Proposal #1562

Closed
wants to merge 1 commit into from

Conversation

Jeffwan
Copy link
Contributor

@Jeffwan Jeffwan commented Jan 7, 2019

Proposal for GPU Optimization in Cluster Autoscaler.

I noticed there're several issues #1367 #1135 talking about problems to support GPU and I'd like to make some change to better support this use case. The issues have been discussed in the original issue ticket but lack of implementation.

Please feel free to leave comments and let me know if this makes sense.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2019
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and piosz January 7, 2019 18:31
@d-nishi
Copy link

d-nishi commented Jan 7, 2019

/sig aws

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 7, 2019

/cc @MaciekPytel @seh @ringtail

@k8s-ci-robot
Copy link
Contributor

@Jeffwan: GitHub didn't allow me to request PR reviews from the following users: seh, ringtail.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @MaciekPytel @seh @ringtail

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


### Proposed Solution
1. Either move `utils/gpu` logic to cloud provider or have a new option passed from commandline to indicate gpu node label cloud provider like to use
2. Scale down case is kind of tricky because training and serving seems like different use case and have confliction. Fit GPU resource into utilization formula doesn't solve this issue. Instead, I think it's better to have a flag to indicate if gpu nodes can be scaled down or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using GPU in resource utilization formula doesn't solve the issue? The problem I'm aware of is high CPU or memory utilization preventing scale-down of nodes with GPU. This can be solved with changing utilization calculation to only care about GPU as described in #1367 (comment).

Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway). Having low utilization doesn't mean a node will actually be deleted. We have plenty of mechanisms for preventing scale-down if you don't want particular node removed (scale-down-disabled annotation) or specific pod restarted (safe-to-evict: false annotation, PDB to name just a few).
Unless there is a use-case that can't be addressed with existing mechanisms I'd rather we improve documentation and recommend an existing solution, rather than add another flag.

Copy link
Contributor Author

@Jeffwan Jeffwan Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GPU node, in training case, GPU utilization rate is not helpful to make decision.
For example, given a 8 GPU machine, there're 2 tasks and one take 6 GPUs and the other takes 2 GPUs. Once first task finishes and GPU resource release. The GPU utilization rate is 25%. But it can not be killed because failure of this task will result in a failure of entire training job. It's kind of a true of false issue but not comparison with threshold.

That's why I think ignore CPU and memory utilization formula can not resolve this issue.
GPU utilization rate will definitely help on model inference case. Most of workloads are stateless and can be easily migrated to other nodes to reduce cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look at the comment and it's reasonable to create PDB and annotate node. The speciality is most of batch/ML workloads are short term lived. It's kind of tedious to do that if jobs created frequently.

Copy link
Contributor Author

@Jeffwan Jeffwan Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can make training case to more generic, like a flag to indicate if we can remove a node immediately (kill in-flight job) or wait until workloads done. Looks like this is not supported but important for batch jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway)." Do you mean in the future CA has plan to add some other triggers? I think entire cluster node utilization might be a reasonable one. That tradeoff here is scale up speed vs cost. Do you think this feature can be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utilization is not a trigger, it's only the first step in scale-down decision, a precondition. The other part is 'drain simulation' that calculates if all pods are safe to move (if they fit elsewhere in the cluster, if they're owned by controller that will restart them, if PDB allows moving them, all the stuff mentioned in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node).

The drain part is the important one, the utilization part is there to limit the number of evictions (don't restart the pods if the gain is relatively small) and for performance reasons (avoid expensive drain simulation if it's unlikely the node can be deleted).

Regarding your use-case: it seems like putting "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" on your pods is a simple solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to defining this per job, not per cluster. Users may want to run diverse workloads, that's what Kubernetes is for :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Use label per job for fine grained control makes sense. At the beginning, I thought covering both cases using cluster level tag which may makes it a little bit confused and contradictory somehow.

GKE uses gpu label `cloud.google.com/gke-accelerator` to inspect if a node is an accelerator node, other cloud providers don't define their label yet. In order for CA to behavior normally in GPU cluster, it's essential to have this gpu label.

#### Scale too much without GPU label support
Pending pods will trigger scale up decision to bring up number of nodes in one node group. When new node is added, CA knows it is `upcoming` and it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. Device plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later. CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.

Not really, CA will look at template node for a node group to verify if the node is expected to have GPUs. If so, it'll wait for them and not add surplus nodes.

Copy link
Contributor Author

@Jeffwan Jeffwan Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksandra-malinowska Yeah. That's different cases or phases.
What you mentioned are two cases

  1. nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.
    2.nodeGroup is GPU family and job requires GPU, CA will check upcoming nodes and don't add surplus node until node is ready.

The situation I meet is second one, the problem is GPU takes time to be allocatable. When then new create node is ready, but GPU resource is not ready, autoscaler will think it's unschedulable and create new nodes. This is hacked in GKE but not other cloud providers yet.

Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.

This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?

This is hacked in GKE but not other cloud providers yet.

To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.

Copy link
Contributor Author

@Jeffwan Jeffwan Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.

This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?

No. As you said, this shouldn't happen. I mean user schedules GPU workloads on non-GPU cluster. CA has support to avoid scaling up in this case because it can not fit into this kind of nodeGroup.

This is hacked in GKE but not other cloud providers yet.

To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.

Yes. Definitely, I think all cloud providers can hack this way. But each has its own GPU label. Either support custom GPU label from arguments or move to cloud providers (preferred.)

#### GPU Resource is not given consideration in scaling down
CA filter candidates with low node cpu and memory utilization rate in every loop.

* If utilization rate is high, even all GPUs are idle, scale-down will not be triggered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users add taints to their GPU node groups that prevent non-GPU pods running there. Note that even without autoscaling, if you don't do that, your GPU job may not run at all if too many non-GPU pods are scheduled on an unprotected node with idle GPUs.

If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down, it can be solved by raising utilization threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agree. This is what user normally do for training jobs. There's a case that I am not sure if that's common enough. For tensorflow distributed training, parameter server and workers are just deployed on one GPU cluster but PS pod utilize more on CPU and memory and worker utilize GPU.

  2. "If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down," That's the inference case and workloads could mix together on some economic GPU nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with "economic" solution of mixing GPU and non-GPU pods on the same node is that unless you very carefully control scheduling, there's always a risk that GPU will end up idle, which is a rather expensive way of running non-GPU pods.

Either way, I still don't see what is the problem here that couldn't be solved with just safe-to-evict and raising utilization threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with safe-to-evict solution for training jobs.

GPU resource is not even considered in the utilization calculation here. So raising utilization threshold won't help this case.

func CalculateUtilization(node *apiv1.Node, nodeInfo *schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) {
cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods)
if err != nil {
return UtilizationInfo{}, err
}
mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods)
if err != nil {
return UtilizationInfo{}, err
}
return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil

I think this part we can do some improvement.

Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: this comment is to a section of the doc concerned with idle GPUs + other pods using up CPU/memory above threshold. I'm fairly certain raising threshold will help with that.

I'm not sure what kind of benefit is to be gained from adding GPU utilization into the mix though, other than consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seh Your argument applies to any workload, regardless if it uses GPU or not. It's autoscalers job to kick pods around to limit resource usage. An application may not agree to be restarted in which case it must specify it somehow.

Why should the pod author need to learn about how to convince this particular autoscaler

PDB is a generic Kubernetes way of telling all automation (including CA) not to touch a given pod. The annotation exists because many people requested a way to specifically prevent CA from touching the pod, but allow other automation to restart it.

The machine has one GPU, and the pod is using one GPU

Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node. You'll also be able to set threshold to 0, in which case no node will be removed if any of its GPUs are used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, how should the autoscaler determine which pods to move and which not to move? [...] For you, just using GPUs may mean you don't want the pod evicted, ever - but this may simply not work for other users.

The autoscaler normally doesn't "move" pods unless it thinks they're not utilizing a machine's resources adequately. That situation does not—or should not—apply here.

In my example, we have a machine with one GPU, a pod running there that requests and is using one GPU, and no other nodes advertising available GPUs. Why would an autoscaler consider this pod as a candidate to move? By analogy, if a machine has one CPU and a pod there is "using" (really, requested) the whole CPU, the autoscaler wouldn't consider moving that pod. Why do we trust use of a CPU but not use of a GPU?

it can probably be addressed similarly to adding a toleration you described - by admission controller or other automation, which can add annotations or set up PDBs (maxUnavailable: 0 should have the same effect).

The toleration I mentioned is a convenience provided by the "ExtendedResourceToleration" admission control plugin; we used to write them manually.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node.

Well, that's great news! That's what I'm asking for here. If we're going to count GPU utilization and not move pods that are utilizing the machine's resources adequately, then I have no further complaint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other nodes advertising available GPUs

In this case the pod won't be moved. Only if you had another node with unused GPU (e.g. a machine with 4 GPUs, 1 of them unused), it may. After the change discussed here, it won't be moved in that case, either. Is that clearer?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that helps. Thank you for clarifying.

@aleksandra-malinowska
Copy link
Contributor

It seems like there are 3 problems mentioned:

  1. GPU and non-GPU pods blocking scale-down of GPU nodes based on utilization,
  2. GPU pods not blocking scale-down,
  3. GKE-specific label used to identify nodes.

Note that (1) and (2) are contradictory. Rather than proposed per cluster flag, a fine grained control can be achieved by a combination of safe-to-evict annotation and configurable utilization threshold (both already supported).

(3) seems to be easy enough to solve (move label constant to cloud provider), unless I'm missing something. @MaciekPytel WDYT?

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 8, 2019

I will update documentation and concentrate on two things.

  1. Try to move GPULabel and GPU specified codes to cloud provider.
  2. Update utilization rate formula to consider GPU in scale down case
    The logic would be
    if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate, if it's under a threshold, choose the node as candidate. Do you think it's reasonable to have another threshold flag for GPU? Now, CA only have ScaleDownUtilizationThreshold.

@MaciekPytel @aleksandra-malinowska Please let me know if updated plan makes sense to you. I will make corresponding change in this doc.

@ringtail
Copy link
Contributor

ringtail commented Jan 9, 2019

I will update documentation and concentrate on two things.

  1. Try to move GPULabel and GPU specified codes to cloud provider.
  2. Update utilization rate formula to consider GPU in scale down case
    The logic would be
    if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate, if it's under a threshold, choose the node as candidate. Do you think it's reasonable to have another threshold flag for GPU? Now, CA only have ScaleDownUtilizationThreshold.

@MaciekPytel @aleksandra-malinowska Please let me know if updated plan makes sense to you. I will make corresponding change in this doc.

agree!

  1. Different cloud providers do have similar labels. (Azure) @feiskyer (Alibaba) @ringtail
  2. if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate may not be a bad idea but whether should this choice bypass to cloud provider? That could be more flexible for more common cases.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 9, 2019

@ringtail Right now, this is piece of common code. I will make it as generic as I can as GPU related code moves to cloud provider.

@MaciekPytel
Copy link
Contributor

Try to move GPULabel and GPU specified codes to cloud provider.
Different cloud providers do have similar labels. (Azure) @feiskyer (Alibaba) @ringtail

Having cloudprovider provide gpu label seems like the right thing to do.

if node has GPU, simply ignore existing cpu & memory formula and calculate GPU utilization rate may >not be a bad idea but whether should this choice bypass to cloud provider? That could be more flexible >for more common cases.

I don't think we should expose knobs in scale-down heuristics to cloudprovider. If there is a good use-case for keeping CPU / memory formula for nodes with GPU (I'm not aware of one, but I'd like to learn about it) it should be controlled with flag.

@aleksandra-malinowska
Copy link
Contributor

Utilization based on just GPU on GPU nodes is tracked in #1367

Making GPU label either configurable or hard-coded in cloud provider (or both) is tracked in #1135

Is there anything else here that I'm missing?

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Jan 9, 2019

@aleksandra-malinowska You're good and that's everything. I will clean up doc and work on the fix for the two issues. PR will come soon.

@ringtail
Copy link
Contributor

ringtail commented Jan 10, 2019

@aleksandra-malinowska You're good and that's everything. I will clean up doc and work on the fix for the two issues. PR will come soon.

@Jeffwan When the PR comes out. Pls cc different cloud provider's maintainers. Thanks a lot.

Copy link
Member

@christopherhein christopherhein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple basic tweaks to grammar and spelling.

cluster-autoscaler/proposals/gpu_optimization.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/gpu_optimization.md Outdated Show resolved Hide resolved
@mwielgus
Copy link
Contributor

mwielgus commented Mar 1, 2019

@Jeffwan - looks like we are waiting for you to fix the grammar/typos.

@mwielgus
Copy link
Contributor

mwielgus commented Mar 6, 2019

Ping

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Mar 6, 2019

Thanks for reminder. I will update doc today. @mwielgus

@Jeffwan Jeffwan force-pushed the gpu_optimization_doc branch from c8e7f9e to b4b3b38 Compare March 7, 2019 08:18
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: losipiuk

If they are not already assigned, you can assign the PR to them by writing /assign @losipiuk in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MaciekPytel
Copy link
Contributor

The changes proposed here were already implemented by @Jeffwan and merged, no need to keep this open.

@MaciekPytel MaciekPytel closed this Apr 5, 2019
@Jeffwan Jeffwan deleted the gpu_optimization_doc branch November 6, 2019 05:17
yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this pull request Feb 22, 2024
…tes#1562)

* Trigger the workloads eviction on admission check rejection.

* Wait for the eviction to and, add integration test.

* Review remarks.

* Review Remarks

* Review Remarks

* Review Remarks

* Fix after rebase.

* Update pkg/controller/core/workload_controller.go

Co-authored-by: Michał Woźniak <[email protected]>

---------

Co-authored-by: Michał Woźniak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants