Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterapi scale from zero support #4840

Merged
merged 3 commits into from
Aug 18, 2022

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Apr 29, 2022

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds scale from zero capability to the clusterapi provider for cluster-autoscaler.

Which issue(s) this PR fixes:

Fixes #3150

Special notes for your reviewer:

The cluster-api-provider-kubemark has implemented the provider part of the opt-in API for scaling from zero. I recommend using this as a way to test with a live cluster.

Does this PR introduce a user-facing change?

Added support for scaling to and from zero nodes for the cluster autoscaler's Cluster API provider. Enabling this feature will require changes by the user, for instruction please see the Cluster API (clusterapi) provider README file in the autoscaler repository.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Usage docs are included in the README file.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 29, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from enxebre and shysank April 29, 2022 18:29
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2022
@elmiko
Copy link
Contributor Author

elmiko commented Apr 29, 2022

i am placing a pre-empitve hold on this to allow time for reviews.
/hold for reviews

@enxebre ptal

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022
@elmiko elmiko force-pushed the capi-scale-from-zero branch from a762e2f to 9450e75 Compare April 29, 2022 18:50
@@ -226,7 +234,64 @@ func (ng *nodegroup) Nodes() ([]cloudprovider.Instance, error) {
// node by default, using manifest (most likely only kube-proxy).
// Implementation optional.
func (ng *nodegroup) TemplateNodeInfo() (*schedulerframework.NodeInfo, error) {
return nil, cloudprovider.ErrNotImplemented
if !ng.scalableResource.CanScaleFromZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

@elmiko
Copy link
Contributor Author

elmiko commented May 2, 2022

updated to remove redundant checks for cpu and memory data

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2022
This allows a Machine{Set,Deployment} to scale up/down from 0,
providing the following annotations are set:

```yaml
apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
  kind: MachineSet
  metadata:
    annotations:
      machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "0"
      machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"
      machine.openshift.io/vCPU: "2"
      machine.openshift.io/memoryMb: 8G
      machine.openshift.io/GPU: "1"
      machine.openshift.io/maxPods: "100"
```

Note that `machine.openshift.io/GPU` and `machine.openshift.io/maxPods`
are optional.

For autoscaling from zero, the autoscaler should convert the mem value
received in the appropriate annotation to bytes using powers of two
consistently with other providers and fail if the format received is not
expected. This gives robust behaviour consistent with cloud providers APIs
and providers implementations.

https://cloud.google.com/compute/all-pricing
https://www.iec.ch/si/binary.htm
https://github.com/openshift/kubernetes-autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L366

Co-authored-by:  Enxebre <[email protected]>
Co-authored-by:  Joel Speed <[email protected]>
Co-authored-by:  Michael McCune <[email protected]>
@elmiko elmiko force-pushed the capi-scale-from-zero branch from 57fffa1 to d3edc45 Compare July 18, 2022 19:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2022
@elmiko
Copy link
Contributor Author

elmiko commented Jul 18, 2022

update

  • rebased
  • removed infrastructure machine template cache
  • improve unit tests based on review comments

i am doing some manual testing now to make sure nothing broke, but i think this is ready for re-review

@elmiko
Copy link
Contributor Author

elmiko commented Jul 18, 2022

manual testing is working as expected against capi 1.0.5 and kubemark 0.3.0

@JoelSpeed
Copy link
Contributor

I've given this a review again and it LGTM, do we want to give others a few days to review and then try to get this merged? Is there anyone specific we want to get to review this before we label it?

@enxebre
Copy link
Member

enxebre commented Jul 20, 2022

manual testing is working as expected against capi 1.0.5 and kubemark 0.3.0
I've given this a review again and it LGTM, do we want to give others a few days to review and then try to get this merged? Is there anyone specific we want to get to review this before we label it?

sgtm, though we definitely need to add some sort of smoke testing to validate PRs here.

@enxebre
Copy link
Member

enxebre commented Jul 20, 2022

thanks a lot @elmiko, this looks great to me other than #4840 (comment)

@elmiko
Copy link
Contributor Author

elmiko commented Jul 20, 2022

fyi, i'm leaving this on hold so we don't merge early since it's already approved. i will remove the hold once we work out the last discussion item.

This commit is a combination of several commits. Significant details are
preserved below.

* update functions for resource annotations
  This change converts some of the functions that look at annotation for
  resource usage to indicate their usage in the function name. This helps
  to make room for allowing the infrastructure reference as an alternate
  source for the capacity information.

* migrate capacity logic into a single function
  This change moves the logic to collect the instance capacity from the
  TemplateNodeInfo function into a method of the
  unstructuredScalableResource named InstanceCapacity. This new function
  is created to house the logic that will decide between annotations and
  the infrastructure reference when calculating the capacity for the node.

* add ability to lookup infrastructure references
  This change supplements the annotation lookups by adding the logic to
  read the infrastructure reference if it exists. This is done to
  determine if the machine template exposes a capacity field in its
  status. For more information on how this mechanism works, please see the
  cluster-api enhancement[0].

* add documentation for capi scaling from zero

* improve tests for clusterapi scale from zero
  this change adds functionality to test the dynamic client behavior of
  getting the infrastructure machine templates.

* update README with information about rbac changes
  this adds more information about the rbac changes necessary for the
  scale from zero support to work.

* remove extra check for scaling from zero
  since the CanScaleFromZero function checks to see if both CPU and
  memory are present, there is no need to check a second time. This also
  adds some documentation to the CanScaleFromZero function to make it
  clearer what is happening.

* update unit test for capi scale from zero
  adding a few more cases and details to the scale from zero unit tests,
  including ensuring that the int based annotations do not accept other
  unit types.

[0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md
@elmiko elmiko force-pushed the capi-scale-from-zero branch from d3edc45 to 1a65fde Compare July 23, 2022 00:25
@elmiko
Copy link
Contributor Author

elmiko commented Jul 23, 2022

update

  • removed all references to the custom cache
  • squashes down to 2 commits

@elmiko
Copy link
Contributor Author

elmiko commented Jul 26, 2022

i'm doing some digging around client-go, have found https://github.com/kubernetes/client-go/tree/master/tools/cache but i think we need to add a little more code to make it work. i was under the false impression that the cache was automatic with client-go. hopefully i'll have another patch soon with the cache added.

@elmiko
Copy link
Contributor Author

elmiko commented Jul 26, 2022

i talked with @JoelSpeed a little about client-go internals, i think what i need to add here is the ability to add new informers to the client when we observe new infrastructure machine templates. i'm doing a little more digging to see how we can do that with our current setup.

this change adds logic to create informers for the infrastructure
machine templates that are discovered during the scale from zero checks.
it also adds tests and a slight change to the controller structure to
account for the dynamic informer creation.
@elmiko
Copy link
Contributor Author

elmiko commented Aug 17, 2022

ok, i have finally understood the magical incantations for client-go to get the dynamic informers created and running for the infrastructure templates. i think this is good to merge whenever folks are happy with the code quality.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2022
@JoelSpeed
Copy link
Contributor

This looks good and I think we've addressed everyones comments. I know Mike has been testing this and was seeing it working. I think it's good to merge now.
/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster Autoscaler CAPI provider should support scaling to and from zero nodes
7 participants