-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459
Conversation
5119159
to
acd34be
Compare
I did messed up with my local branch and commits. I guess everything should be okay now except labels. helm-charts and vertical-pod-autoscaler labels seems wrong and are added while I was trying to fix the commit history. Sorry for the confusion :) |
/assign |
cluster-autoscaler/cloudprovider/externalgrpc/externalgrpc_cloud_provider.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor.go
Outdated
Show resolved
Hide resolved
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this PR wasn't aiming at covering everything, but just wanted to make sure the following changes are at least planned:
- The price expander has a GPU-specific penalty in its scoring function - this should probably be adapted to Accelerators now.
- We already have a concept of "CustomResources" that Accelerators fit into, managed by
CustomResourcesProcessor
. The processor is used for 2 purposes - hacking node readiness for when a custom resource is in fact unready, and calculating cluster-wide resource limits. Currently the only such processor in this repo is GPU-related - it hacks the nodes to be unready when the GPU resource is not in allocatable, and handles GPU resource limits by looking at the GPU label. Do we expect all Accelerators to follow these semantics? The processor should probably be adapted to Accelerators. - In general, do we expect any GPU-specific handling after the whole migration is complete, or is the intention to just replace GPU with Accelerator? Just replacing makes the most sense to me, and in that case we should probably deprecate, and eventually remove the GPU-related methods from CloudProvider.
klog.V(3).Infof("node %s has unready GPU", nodeInfo.Node().Name) | ||
// Return 0 if GPU is unready. This will guarantee we can still scale down a node with unready GPU. | ||
return Info{GpuUtil: 0, ResourceName: gpu.ResourceNvidiaGPU, Utilization: 0}, nil | ||
klog.V(3).Infof("node %s has unready accelerator", nodeInfo.Node().Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe specify which type? It was clear before, it isn't now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added the resource name to the log message.
// GetScaleDownGpuUtilizationThreshold returns ScaleDownGpuUtilizationThreshold value that should be used for a given NodeGroup. | ||
func (p *DelegatingNodeGroupConfigProcessor) GetScaleDownGpuUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error) { | ||
// GetScaleDownAcceleratorUtilizationThreshold returns the accelerator utilization threshold value that should be used for a given NodeGroup | ||
func (p *DelegatingNodeGroupConfigProcessor) GetScaleDownAcceleratorUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're renaming everything, shouldn't we rename the corresponding option (ScaleDownGpuUtilizationThreshold
) and flag (scale-down-gpu-utilization-threshold
) as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the option ScaleDownGpuUtilizationThreshold
to ScaleDownAcceleratorUtilizationThreshold
and the flag scale-down-gpu-utilization-threshold
to scale-down-accelerator-utilization-threshold
. Also generated proto files with variables using these new names.
/assign @towca |
Ditched the "Accelerator" name and employed the suggested |
* Added GetNodeGpuConfig to cloud provider which returns a GpuConfig struct containing the gpu label, type and resource name if the node has a GPU. * Added initial implementaion of the GetNodeGpuConfig to all cloud providers.
c4c5e24
to
4fbd207
Compare
/lgtm The changes LGTM, just one more nit - feel free to unhold if you prefer not to fix. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hbostan, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Changed the `utilization.Calculate()` function to use GpuConfig instead of GPU label. * Started using GpuConfig in utilization threshold calculations.
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Extends the GPU notion to all GPU-like accelerators. It adds the GpuConfig struct and GetNodeGpuConfig to cloud provider for this. These methods are used in utilization calculations for scale-downs which were previously depending on single a gpu label and resource name. With GpuConfig different labels and resource names are supported.
Which issue(s) this PR fixes:
This PR doesn't completely fix this issue #5448 But it is a starting point.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: