Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459

hbostan · 2023-02-01T11:23:22Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Extends the GPU notion to all GPU-like accelerators. It adds the GpuConfig struct and GetNodeGpuConfig to cloud provider for this. These methods are used in utilization calculations for scale-downs which were previously depending on single a gpu label and resource name. With GpuConfig different labels and resource names are supported.

Which issue(s) this PR fixes:

This PR doesn't completely fix this issue #5448 But it is a starting point.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

hbostan · 2023-02-01T13:35:46Z

I did messed up with my local branch and commits. I guess everything should be okay now except labels. helm-charts and vertical-pod-autoscaler labels seems wrong and are added while I was trying to fix the commit history. Sorry for the confusion :)

BigDarkClown · 2023-02-03T14:06:46Z

/assign

cluster-autoscaler/cloudprovider/cloud_provider.go

cluster-autoscaler/cloudprovider/test/test_cloud_provider.go

cluster-autoscaler/cloudprovider/externalgrpc/externalgrpc_cloud_provider.go

cluster-autoscaler/core/scaledown/eligibility/eligibility.go

cluster-autoscaler/utils/test/test_utils.go

cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor.go

BigDarkClown · 2023-02-06T17:07:24Z

/lgtm

towca

I know this PR wasn't aiming at covering everything, but just wanted to make sure the following changes are at least planned:

The price expander has a GPU-specific penalty in its scoring function - this should probably be adapted to Accelerators now.
We already have a concept of "CustomResources" that Accelerators fit into, managed by CustomResourcesProcessor. The processor is used for 2 purposes - hacking node readiness for when a custom resource is in fact unready, and calculating cluster-wide resource limits. Currently the only such processor in this repo is GPU-related - it hacks the nodes to be unready when the GPU resource is not in allocatable, and handles GPU resource limits by looking at the GPU label. Do we expect all Accelerators to follow these semantics? The processor should probably be adapted to Accelerators.
In general, do we expect any GPU-specific handling after the whole migration is complete, or is the intention to just replace GPU with Accelerator? Just replacing makes the most sense to me, and in that case we should probably deprecate, and eventually remove the GPU-related methods from CloudProvider.

towca · 2023-02-09T17:20:46Z

cluster-autoscaler/simulator/utilization/info.go

-			klog.V(3).Infof("node %s has unready GPU", nodeInfo.Node().Name)
-			// Return 0 if GPU is unready. This will guarantee we can still scale down a node with unready GPU.
-			return Info{GpuUtil: 0, ResourceName: gpu.ResourceNvidiaGPU, Utilization: 0}, nil
+			klog.V(3).Infof("node %s has unready accelerator", nodeInfo.Node().Name)


nit: Maybe specify which type? It was clear before, it isn't now.

Also added the resource name to the log message.

towca · 2023-02-09T17:34:31Z

cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor.go

-// GetScaleDownGpuUtilizationThreshold returns ScaleDownGpuUtilizationThreshold value that should be used for a given NodeGroup.
-func (p *DelegatingNodeGroupConfigProcessor) GetScaleDownGpuUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error) {
+// GetScaleDownAcceleratorUtilizationThreshold returns the accelerator utilization threshold value that should be used for a given NodeGroup
+func (p *DelegatingNodeGroupConfigProcessor) GetScaleDownAcceleratorUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error) {


If we're renaming everything, shouldn't we rename the corresponding option (ScaleDownGpuUtilizationThreshold) and flag (scale-down-gpu-utilization-threshold) as well?

Renamed the option ScaleDownGpuUtilizationThreshold to ScaleDownAcceleratorUtilizationThreshold and the flag scale-down-gpu-utilization-threshold to scale-down-accelerator-utilization-threshold. Also generated proto files with variables using these new names.

towca · 2023-02-10T15:00:49Z

/assign @towca

hbostan · 2023-02-14T13:49:34Z

Ditched the "Accelerator" name and employed the suggested GpuConfig approach.

* Added GetNodeGpuConfig to cloud provider which returns a GpuConfig struct containing the gpu label, type and resource name if the node has a GPU. * Added initial implementaion of the GetNodeGpuConfig to all cloud providers.

cluster-autoscaler/cloudprovider/cloud_provider.go

towca · 2023-02-14T17:15:57Z

/lgtm
/approve
/hold

The changes LGTM, just one more nit - feel free to unhold if you prefer not to fix.

k8s-ci-robot · 2023-02-14T17:16:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hbostan, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Changed the `utilization.Calculate()` function to use GpuConfig instead of GPU label. * Started using GpuConfig in utilization threshold calculations.

towca · 2023-02-15T10:38:31Z

/lgtm
/unhold

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/cluster-autoscaler labels Feb 1, 2023

k8s-ci-robot requested review from andrewsykim and apricote February 1, 2023 11:24

hbostan force-pushed the master branch 2 times, most recently from 5119159 to acd34be Compare February 1, 2023 13:27

k8s-ci-robot added area/helm-charts area/vertical-pod-autoscaler labels Feb 1, 2023

hbostan force-pushed the master branch from acd34be to cfe78df Compare February 1, 2023 13:32

jbartosik removed the area/vertical-pod-autoscaler label Feb 3, 2023

k8s-ci-robot assigned BigDarkClown Feb 3, 2023

BigDarkClown reviewed Feb 3, 2023

View reviewed changes

hbostan force-pushed the master branch from cfe78df to ae5667a Compare February 6, 2023 13:27

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 6, 2023

hbostan requested review from BigDarkClown and removed request for apricote and andrewsykim February 6, 2023 13:38

BigDarkClown reviewed Feb 6, 2023

View reviewed changes

cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor.go Outdated Show resolved Hide resolved

hbostan force-pushed the master branch from ae5667a to e6fa517 Compare February 6, 2023 14:49

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2023

BigDarkClown mentioned this pull request Feb 9, 2023

Add BigDarkClown to cluster-autoscaler Reviewers #5492

Merged

towca reviewed Feb 10, 2023

View reviewed changes

k8s-ci-robot assigned towca Feb 10, 2023

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 13, 2023

hbostan closed this Feb 14, 2023

hbostan force-pushed the master branch from bee3df3 to 65c098b Compare February 14, 2023 11:42

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 14, 2023

hbostan changed the title ~~Add methods for accelerators to cloud provider. Use accelerators in utilization calculations.~~ Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. Feb 14, 2023

hbostan reopened this Feb 14, 2023

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 14, 2023

hbostan force-pushed the master branch from 57f6636 to 6121a1e Compare February 14, 2023 14:03

Add GetNodeGpuConfig to cloud provider

1f646e4

* Added GetNodeGpuConfig to cloud provider which returns a GpuConfig struct containing the gpu label, type and resource name if the node has a GPU. * Added initial implementaion of the GetNodeGpuConfig to all cloud providers.

hbostan force-pushed the master branch 2 times, most recently from c4c5e24 to 4fbd207 Compare February 14, 2023 16:04

towca reviewed Feb 14, 2023

View reviewed changes

cluster-autoscaler/cloudprovider/cloud_provider.go Outdated Show resolved Hide resolved

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 14, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2023

Use GpuConfig in utilization calculations for scale-down

2b602fc

* Changed the `utilization.Calculate()` function to use GpuConfig instead of GPU label. * Started using GpuConfig in utilization threshold calculations.

hbostan force-pushed the master branch from 4fbd207 to 2b602fc Compare February 15, 2023 08:29

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 15, 2023

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 15, 2023

k8s-ci-robot merged commit 7cba0a0 into kubernetes:master Feb 15, 2023

kawych mentioned this pull request Feb 21, 2023

Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics #5518

Merged

mboersma mentioned this pull request Feb 28, 2023

cluster-autoscaler: add missing import to kubermark_other.go #5549

Merged

BigDarkClown mentioned this pull request Jun 29, 2023

Add BigDarkClown to Cluster Autoscaler approvers #5915

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459

Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459

hbostan commented Feb 1, 2023 •

edited

Loading

hbostan commented Feb 1, 2023

BigDarkClown commented Feb 3, 2023

BigDarkClown commented Feb 6, 2023

towca left a comment

towca Feb 9, 2023

hbostan Feb 12, 2023

towca Feb 9, 2023

hbostan Feb 12, 2023

towca commented Feb 10, 2023

hbostan commented Feb 14, 2023

towca commented Feb 14, 2023

k8s-ci-robot commented Feb 14, 2023

towca commented Feb 15, 2023

Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459

Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459

Conversation

hbostan commented Feb 1, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

hbostan commented Feb 1, 2023

BigDarkClown commented Feb 3, 2023

BigDarkClown commented Feb 6, 2023

towca left a comment

Choose a reason for hiding this comment

towca Feb 9, 2023

Choose a reason for hiding this comment

hbostan Feb 12, 2023

Choose a reason for hiding this comment

towca Feb 9, 2023

Choose a reason for hiding this comment

hbostan Feb 12, 2023

Choose a reason for hiding this comment

towca commented Feb 10, 2023

hbostan commented Feb 14, 2023

towca commented Feb 14, 2023

k8s-ci-robot commented Feb 14, 2023

towca commented Feb 15, 2023

hbostan commented Feb 1, 2023 •

edited

Loading