Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. #5459

Merged
merged 2 commits into from
Feb 15, 2023

Conversation

hbostan
Copy link
Contributor

@hbostan hbostan commented Feb 1, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Extends the GPU notion to all GPU-like accelerators. It adds the GpuConfig struct and GetNodeGpuConfig to cloud provider for this. These methods are used in utilization calculations for scale-downs which were previously depending on single a gpu label and resource name. With GpuConfig different labels and resource names are supported.

Which issue(s) this PR fixes:

This PR doesn't completely fix this issue #5448 But it is a starting point.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/cluster-autoscaler labels Feb 1, 2023
@hbostan hbostan force-pushed the master branch 2 times, most recently from 5119159 to acd34be Compare February 1, 2023 13:27
@hbostan
Copy link
Contributor Author

hbostan commented Feb 1, 2023

I did messed up with my local branch and commits. I guess everything should be okay now except labels. helm-charts and vertical-pod-autoscaler labels seems wrong and are added while I was trying to fix the commit history. Sorry for the confusion :)

@BigDarkClown
Copy link
Contributor

/assign

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 6, 2023
@hbostan hbostan requested review from BigDarkClown and removed request for apricote and andrewsykim February 6, 2023 13:38
@BigDarkClown
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2023
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this PR wasn't aiming at covering everything, but just wanted to make sure the following changes are at least planned:

  • The price expander has a GPU-specific penalty in its scoring function - this should probably be adapted to Accelerators now.
  • We already have a concept of "CustomResources" that Accelerators fit into, managed by CustomResourcesProcessor. The processor is used for 2 purposes - hacking node readiness for when a custom resource is in fact unready, and calculating cluster-wide resource limits. Currently the only such processor in this repo is GPU-related - it hacks the nodes to be unready when the GPU resource is not in allocatable, and handles GPU resource limits by looking at the GPU label. Do we expect all Accelerators to follow these semantics? The processor should probably be adapted to Accelerators.
  • In general, do we expect any GPU-specific handling after the whole migration is complete, or is the intention to just replace GPU with Accelerator? Just replacing makes the most sense to me, and in that case we should probably deprecate, and eventually remove the GPU-related methods from CloudProvider.

klog.V(3).Infof("node %s has unready GPU", nodeInfo.Node().Name)
// Return 0 if GPU is unready. This will guarantee we can still scale down a node with unready GPU.
return Info{GpuUtil: 0, ResourceName: gpu.ResourceNvidiaGPU, Utilization: 0}, nil
klog.V(3).Infof("node %s has unready accelerator", nodeInfo.Node().Name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe specify which type? It was clear before, it isn't now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added the resource name to the log message.

// GetScaleDownGpuUtilizationThreshold returns ScaleDownGpuUtilizationThreshold value that should be used for a given NodeGroup.
func (p *DelegatingNodeGroupConfigProcessor) GetScaleDownGpuUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error) {
// GetScaleDownAcceleratorUtilizationThreshold returns the accelerator utilization threshold value that should be used for a given NodeGroup
func (p *DelegatingNodeGroupConfigProcessor) GetScaleDownAcceleratorUtilizationThreshold(context *context.AutoscalingContext, nodeGroup cloudprovider.NodeGroup) (float64, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're renaming everything, shouldn't we rename the corresponding option (ScaleDownGpuUtilizationThreshold) and flag (scale-down-gpu-utilization-threshold) as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the option ScaleDownGpuUtilizationThreshold to ScaleDownAcceleratorUtilizationThreshold and the flag scale-down-gpu-utilization-threshold to scale-down-accelerator-utilization-threshold. Also generated proto files with variables using these new names.

@towca
Copy link
Collaborator

towca commented Feb 10, 2023

/assign @towca

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 13, 2023
@hbostan hbostan closed this Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 14, 2023
@hbostan hbostan changed the title Add methods for accelerators to cloud provider. Use accelerators in utilization calculations. Add GpuConfig to cloud provider. Use GpuConfig in utilization calculations. Feb 14, 2023
@hbostan
Copy link
Contributor Author

hbostan commented Feb 14, 2023

Ditched the "Accelerator" name and employed the suggested GpuConfig approach.

@hbostan hbostan reopened this Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 14, 2023
* Added GetNodeGpuConfig to cloud provider which returns a GpuConfig
  struct containing the gpu label, type and resource name if the node
  has a GPU.
* Added initial implementaion of the GetNodeGpuConfig to all cloud
  providers.
@hbostan hbostan force-pushed the master branch 2 times, most recently from c4c5e24 to 4fbd207 Compare February 14, 2023 16:04
@towca
Copy link
Collaborator

towca commented Feb 14, 2023

/lgtm
/approve
/hold

The changes LGTM, just one more nit - feel free to unhold if you prefer not to fix.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hbostan, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2023
* Changed the `utilization.Calculate()` function to use GpuConfig
  instead of GPU label.
* Started using GpuConfig in utilization threshold calculations.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 15, 2023
@towca
Copy link
Collaborator

towca commented Feb 15, 2023

/lgtm
/unhold

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit 7cba0a0 into kubernetes:master Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/helm-charts cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants