Skip to content

Commit

Permalink
Merge pull request #4840 from elmiko/capi-scale-from-zero
Browse files Browse the repository at this point in the history
clusterapi scale from zero support
  • Loading branch information
k8s-ci-robot authored Aug 18, 2022
2 parents d22c7ac + f02c997 commit e478ee2
Show file tree
Hide file tree
Showing 11 changed files with 1,386 additions and 151 deletions.
64 changes: 64 additions & 0 deletions cluster-autoscaler/cloudprovider/clusterapi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,70 @@ There are two annotations that control how a cluster resource should be scaled:
The autoscaler will monitor any `MachineSet` or `MachineDeployment` containing
both of these annotations.

### Scale from zero support

The Cluster API community has defined an opt-in method for infrastructure
providers to enable scaling from zero-sized node groups in the
[Opt-in Autoscaling from Zero enhancement](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md).
As defined in the enhancement, each provider may add support for scaling from
zero to their provider, but they are not required to do so. If you are expecting
built-in support for scaling from zero, please check with the Cluster API
infrastructure providers that you are using.

If your Cluster API provider does not have support for scaling from zero, you
may still use this feature through the capacity annotations. You may add these
annotations to your MachineDeployments, or MachineSets if you are not using
MachineDeployments (it is not needed on both), to instruct the cluster
autoscaler about the sizing of the nodes in the node group. At the minimum,
you must specify the CPU and memory annotations, these annotations should
match the expected capacity of the nodes created from the infrastructure.

For example, if my MachineDeployment will create nodes that have "16000m" CPU,
"128G" memory, 2 NVidia GPUs, and can support 200 max pods, the folllowing
annotations will instruct the autoscaler how to expand the node group from
zero replicas:

```yaml
apiVersion: cluster.x-k8s.io/v1alpha4
kind: MachineDeployment
metadata:
annotations:
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
capacity.cluster-autoscaler.kubernetes.io/memory: "128G"
capacity.cluster-autoscaler.kubernetes.io/cpu: "16"
capacity.cluster-autoscaler.kubernetes.io/gpu-type: "nvidia.com/gpu"
capacity.cluster-autoscaler.kubernetes.io/gpu-count: "2"
capacity.cluster-autoscaler.kubernetes.io/maxPods: "200"
```
*Note* the `maxPods` annotation will default to `110` if it is not supplied.
This value is inspired by the Kubernetes best practices
[Considerations for large clusters](https://kubernetes.io/docs/setup/best-practices/cluster-large/).

#### RBAC changes for scaling from zero

If you are using the opt-in support for scaling from zero as defined by the
Cluster API infrastructure provider, you will need to add the infrastructure
machine template types to your role permissions for the service account
associated with the cluster autoscaler deployment. The service account will
need permission to `get` and `list` the infrastructure machine templates for
your infrastructure provider.

For example, when using the [Kubemark provider](https://github.com/kubernetes-sigs/cluster-api-provider-kubemark)
you will need to set the following permissions:

```yaml
rules:
- apiGroups:
- infrastructure.cluster.x-k8s.io
resources:
- kubemarkmachinetemplates
verbs:
- get
- list
```

## Specifying a Custom Resource Group

By default all Kubernetes resources consumed by the Cluster API provider will
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,17 +204,17 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch bool
}{{
name: "no clustername, namespace, or label selector specified should match any MachineSet",
testSpec: createTestSpec(RandomString(6), RandomString(6), RandomString(6), 1, false, nil),
testSpec: createTestSpec(RandomString(6), RandomString(6), RandomString(6), 1, false, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{labelSelector: labels.NewSelector()},
shouldMatch: true,
}, {
name: "no clustername, namespace, or label selector specified should match any MachineDeployment",
testSpec: createTestSpec(RandomString(6), RandomString(6), RandomString(6), 1, true, nil),
testSpec: createTestSpec(RandomString(6), RandomString(6), RandomString(6), 1, true, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{labelSelector: labels.NewSelector()},
shouldMatch: true,
}, {
name: "clustername specified does not match MachineSet, namespace matches, no labels specified",
testSpec: createTestSpec("default", RandomString(6), RandomString(6), 1, false, nil),
testSpec: createTestSpec("default", RandomString(6), RandomString(6), 1, false, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -223,7 +223,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: false,
}, {
name: "clustername specified does not match MachineDeployment, namespace matches, no labels specified",
testSpec: createTestSpec("default", RandomString(6), RandomString(6), 1, true, nil),
testSpec: createTestSpec("default", RandomString(6), RandomString(6), 1, true, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -232,7 +232,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: false,
}, {
name: "namespace specified does not match MachineSet, clusterName matches, no labels specified",
testSpec: createTestSpec(RandomString(6), "foo", RandomString(6), 1, false, nil),
testSpec: createTestSpec(RandomString(6), "foo", RandomString(6), 1, false, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -241,7 +241,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: false,
}, {
name: "clustername specified does not match MachineDeployment, namespace matches, no labels specified",
testSpec: createTestSpec(RandomString(6), "foo", RandomString(6), 1, true, nil),
testSpec: createTestSpec(RandomString(6), "foo", RandomString(6), 1, true, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -250,7 +250,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: false,
}, {
name: "namespace and clusterName matches MachineSet, no labels specified",
testSpec: createTestSpec("default", "foo", RandomString(6), 1, false, nil),
testSpec: createTestSpec("default", "foo", RandomString(6), 1, false, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -259,7 +259,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: true,
}, {
name: "namespace and clusterName matches MachineDeployment, no labels specified",
testSpec: createTestSpec("default", "foo", RandomString(6), 1, true, nil),
testSpec: createTestSpec("default", "foo", RandomString(6), 1, true, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -268,7 +268,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: true,
}, {
name: "namespace and clusterName matches MachineSet, does not match label selector",
testSpec: createTestSpec("default", "foo", RandomString(6), 1, false, nil),
testSpec: createTestSpec("default", "foo", RandomString(6), 1, false, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -277,7 +277,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: false,
}, {
name: "namespace and clusterName matches MachineDeployment, does not match label selector",
testSpec: createTestSpec("default", "foo", RandomString(6), 1, true, nil),
testSpec: createTestSpec("default", "foo", RandomString(6), 1, true, nil, nil),
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
namespace: "default",
Expand All @@ -286,7 +286,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: false,
}, {
name: "namespace, clusterName, and label selector matches MachineSet",
testSpec: createTestSpec("default", "foo", RandomString(6), 1, false, nil),
testSpec: createTestSpec("default", "foo", RandomString(6), 1, false, nil, nil),
additionalLabels: map[string]string{"color": "green"},
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
Expand All @@ -296,7 +296,7 @@ func Test_allowedByAutoDiscoverySpec(t *testing.T) {
shouldMatch: true,
}, {
name: "namespace, clusterName, and label selector matches MachineDeployment",
testSpec: createTestSpec("default", "foo", RandomString(6), 1, true, nil),
testSpec: createTestSpec("default", "foo", RandomString(6), 1, true, nil, nil),
additionalLabels: map[string]string{"color": "green"},
autoDiscoveryConfig: &clusterAPIAutoDiscoveryConfig{
clusterName: "foo",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ const (
resourceNameMachineSet = "machinesets"
resourceNameMachineDeployment = "machinedeployments"
failedMachinePrefix = "failed-machine-"
machineTemplateKind = "MachineTemplate"
machineDeploymentKind = "MachineDeployment"
machineSetKind = "MachineSet"
machineKind = "Machine"
Expand Down Expand Up @@ -80,6 +81,10 @@ type machineController struct {
machineDeploymentsAvailable bool
accessLock sync.Mutex
autoDiscoverySpecs []*clusterAPIAutoDiscoveryConfig
// stopChannel is used for running the shared informers, and for starting
// informers associated with infrastructure machine templates that are
// discovered during operation.
stopChannel <-chan struct{}
}

func indexMachineByProviderID(obj interface{}) ([]string, error) {
Expand Down Expand Up @@ -170,9 +175,9 @@ func (c *machineController) findMachineSetOwner(machineSet *unstructured.Unstruc

// run starts shared informers and waits for the informer cache to
// synchronize.
func (c *machineController) run(stopCh <-chan struct{}) error {
c.workloadInformerFactory.Start(stopCh)
c.managementInformerFactory.Start(stopCh)
func (c *machineController) run() error {
c.workloadInformerFactory.Start(c.stopChannel)
c.managementInformerFactory.Start(c.stopChannel)

syncFuncs := []cache.InformerSynced{
c.nodeInformer.HasSynced,
Expand All @@ -184,7 +189,7 @@ func (c *machineController) run(stopCh <-chan struct{}) error {
}

klog.V(4).Infof("waiting for caches to sync")
if !cache.WaitForCacheSync(stopCh, syncFuncs...) {
if !cache.WaitForCacheSync(c.stopChannel, syncFuncs...) {
return fmt.Errorf("syncing caches failed")
}

Expand Down Expand Up @@ -327,6 +332,7 @@ func newMachineController(
managementDiscoveryClient discovery.DiscoveryInterface,
managementScaleClient scale.ScalesGetter,
discoveryOpts cloudprovider.NodeGroupDiscoveryOptions,
stopChannel chan struct{},
) (*machineController, error) {
workloadInformerFactory := kubeinformers.NewSharedInformerFactory(workloadClient, 0)

Expand Down Expand Up @@ -409,6 +415,7 @@ func newMachineController(
machineResource: gvrMachine,
machineDeploymentResource: gvrMachineDeployment,
machineDeploymentsAvailable: machineDeploymentAvailable,
stopChannel: stopChannel,
}, nil
}

Expand Down Expand Up @@ -708,3 +715,30 @@ func (c *machineController) allowedByAutoDiscoverySpecs(r *unstructured.Unstruct

return false
}

// Get an infrastructure machine template given its GVR, name, and namespace.
func (c *machineController) getInfrastructureResource(resource schema.GroupVersionResource, name string, namespace string) (*unstructured.Unstructured, error) {
// get an informer for this type, this will create the informer if it does not exist
informer := c.managementInformerFactory.ForResource(resource)
// since this may be a new informer, we need to restart the informer factory
c.managementInformerFactory.Start(c.stopChannel)
// wait for the informer to sync
klog.V(4).Infof("waiting for cache sync on infrastructure resource")
if !cache.WaitForCacheSync(c.stopChannel, informer.Informer().HasSynced) {
return nil, fmt.Errorf("syncing cache on infrastructure resource failed")
}
// use the informer to get the object we want, this will use the informer cache if possible
obj, err := informer.Lister().ByNamespace(namespace).Get(name)
if err != nil {
klog.V(4).Infof("Unable to read infrastructure reference from informer, error: %v", err)
return nil, err
}

infra, ok := obj.(*unstructured.Unstructured)
if !ok {
err := fmt.Errorf("Unable to convert infrastructure reference for %s/%s", namespace, name)
klog.V(4).Infof("%v", err)
return nil, err
}
return infra, err
}
Loading

0 comments on commit e478ee2

Please sign in to comment.