Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --feature-gates flag to support scale up on volume limits (CSI migration enabled) #4539

Merged

Conversation

ialidzhikov
Copy link
Contributor

/kind bug
/sig autoscaling
/sig storage

What this PR does / why we need it:

This PRs adds a --feature-gates flag to the cluster-autoscaler. For more details on why this is needed see #4517.

Which issue(s) this PR fixes:

Fixes #4517

Special notes for your reviewer:

This approach has the small drawback that it adds all K8s feature gates as known ones (which makes the --help output verbose and misleading).
On the other side, with this approach we receive the list of known feature gates and their default values from the upstream (by vendoring clster-autoscaler) -> no need to manually maintain out custom list with upstream feature gates and their defaults.

Does this PR introduce a user-facing change?

The cluster-autoscaler now supports a `--feature-gates` flag that allows enabling CSI migration related feature gates. This is required to support scale up on volume limits in Kubernetes cluster with CSI migration enabled.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/625-csi-migration/README.md
[Other doc]: https://kubernetes.io/blog/2021/12/10/storage-in-tree-to-csi-migration-status-update/

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 19, 2021
@ialidzhikov
Copy link
Contributor Author

/cc @msau42 @jsafrane

@mwielgus
Copy link
Contributor

This approach has the small drawback that it adds all K8s feature gates as known ones (which makes the --help output verbose and misleading).

Can you please paste what would --help output look like with this PR?

@ialidzhikov
Copy link
Contributor Author

Can you please paste what would --help output look like with this PR?

Help output
Usage of ./cluster-autoscaler:
pflag: help requested
      --add-dir-header                                                     If true, adds the file directory to the header of the log messages
      --address string                                                     The address to expose prometheus metrics. (default ":8085")
      --alsologtostderr                                                    log to standard error as well as files
      --aws-use-static-instance-list                                       Should CA fetch instance types in runtime or use a static list. AWS only
      --balance-similar-node-groups                                        Detect similar node groups and balance the number of nodes between them
      --balancing-ignore-label MultiStringFlag                             Specifies a label to ignore in addition to the basic and cloud-provider set of labels when comparing if two node groups are similar (default [])
      --cloud-config string                                                The path to the cloud provider configuration file.  Empty string for no configuration file.
      --cloud-provider string                                              Cloud provider type. Available values: [aws,azure,gce,alicloud,baiducloud,magnum,digitalocean,huaweicloud,clusterapi,mcm] (default "gce")
      --cloud-provider-gce-l7lb-src-cidrs cidrs                            CIDRs opened in GCE firewall for L7 LB traffic proxy & health checks (default 130.211.0.0/22,35.191.0.0/16)
      --cloud-provider-gce-lb-src-cidrs cidrs                              CIDRs opened in GCE firewall for L4 LB traffic proxy & health checks (default 130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16)
      --cluster-name string                                                Autoscaled cluster name, if available
      --control-apiserver-burst int                                        Throttling burst configuration for the client to control cluster's apiserver. (default 10)
      --control-apiserver-qps float                                        Throttling QPS configuration for the client to control cluster's apiserver. (default 5)
      --cores-total string                                                 Minimum and maximum number of cores in cluster, in the format <min>:<max>. Cluster autoscaler will not scale the cluster beyond these numbers. (default "0:320000")
      --estimator string                                                   Type of resource estimator to be used in scale up. Available values: [binpacking] (default "binpacking")
      --expander string                                                    Type of node group expander to be used in scale up. Available values: [random,most-pods,least-waste,price,priority] (default "random")
      --expendable-pods-priority-cutoff int                                Pods with priority below cutoff will be expendable. They can be killed without any consideration during scale down and they don't cause scale up. Pods with null priority (PodPriority disabled) are non expendable. (default -10)
      --feature-gates mapStringBool                                        A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:
                                                                           APIListChunking=true|false (BETA - default=true)
                                                                           APIPriorityAndFairness=true|false (ALPHA - default=false)
                                                                           APIResponseCompression=true|false (BETA - default=true)
                                                                           AllAlpha=true|false (ALPHA - default=false)
                                                                           AllBeta=true|false (BETA - default=false)
                                                                           AllowInsecureBackendProxy=true|false (BETA - default=true)
                                                                           AnyVolumeDataSource=true|false (ALPHA - default=false)
                                                                           AppArmor=true|false (BETA - default=true)
                                                                           BalanceAttachedNodeVolumes=true|false (ALPHA - default=false)
                                                                           BoundServiceAccountTokenVolume=true|false (ALPHA - default=false)
                                                                           CPUManager=true|false (BETA - default=true)
                                                                           CRIContainerLogRotation=true|false (BETA - default=true)
                                                                           CSIInlineVolume=true|false (BETA - default=true)
                                                                           CSIMigration=true|false (BETA - default=true)
                                                                           CSIMigrationAWS=true|false (BETA - default=false)
                                                                           CSIMigrationAWSComplete=true|false (ALPHA - default=false)
                                                                           CSIMigrationAzureDisk=true|false (BETA - default=false)
                                                                           CSIMigrationAzureDiskComplete=true|false (ALPHA - default=false)
                                                                           CSIMigrationAzureFile=true|false (ALPHA - default=false)
                                                                           CSIMigrationAzureFileComplete=true|false (ALPHA - default=false)
                                                                           CSIMigrationGCE=true|false (BETA - default=false)
                                                                           CSIMigrationGCEComplete=true|false (ALPHA - default=false)
                                                                           CSIMigrationOpenStack=true|false (BETA - default=false)
                                                                           CSIMigrationOpenStackComplete=true|false (ALPHA - default=false)
                                                                           CSIMigrationvSphere=true|false (BETA - default=false)
                                                                           CSIMigrationvSphereComplete=true|false (BETA - default=false)
                                                                           CSIStorageCapacity=true|false (ALPHA - default=false)
                                                                           CSIVolumeFSGroupPolicy=true|false (ALPHA - default=false)
                                                                           ConfigurableFSGroupPolicy=true|false (ALPHA - default=false)
                                                                           CustomCPUCFSQuotaPeriod=true|false (ALPHA - default=false)
                                                                           DefaultPodTopologySpread=true|false (ALPHA - default=false)
                                                                           DevicePlugins=true|false (BETA - default=true)
                                                                           DisableAcceleratorUsageMetrics=true|false (ALPHA - default=false)
                                                                           DynamicKubeletConfig=true|false (BETA - default=true)
                                                                           EndpointSlice=true|false (BETA - default=true)
                                                                           EndpointSliceProxying=true|false (BETA - default=true)
                                                                           EphemeralContainers=true|false (ALPHA - default=false)
                                                                           ExpandCSIVolumes=true|false (BETA - default=true)
                                                                           ExpandInUsePersistentVolumes=true|false (BETA - default=true)
                                                                           ExpandPersistentVolumes=true|false (BETA - default=true)
                                                                           ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false)
                                                                           GenericEphemeralVolume=true|false (ALPHA - default=false)
                                                                           HPAScaleToZero=true|false (ALPHA - default=false)
                                                                           HugePageStorageMediumSize=true|false (BETA - default=true)
                                                                           HyperVContainer=true|false (ALPHA - default=false)
                                                                           IPv6DualStack=true|false (ALPHA - default=false)
                                                                           ImmutableEphemeralVolumes=true|false (BETA - default=true)
                                                                           KubeletPodResources=true|false (BETA - default=true)
                                                                           LegacyNodeRoleBehavior=true|false (BETA - default=true)
                                                                           LocalStorageCapacityIsolation=true|false (BETA - default=true)
                                                                           LocalStorageCapacityIsolationFSQuotaMonitoring=true|false (ALPHA - default=false)
                                                                           NodeDisruptionExclusion=true|false (BETA - default=true)
                                                                           NonPreemptingPriority=true|false (BETA - default=true)
                                                                           PodDisruptionBudget=true|false (BETA - default=true)
                                                                           PodOverhead=true|false (BETA - default=true)
                                                                           ProcMountType=true|false (ALPHA - default=false)
                                                                           QOSReserved=true|false (ALPHA - default=false)
                                                                           RemainingItemCount=true|false (BETA - default=true)
                                                                           RemoveSelfLink=true|false (ALPHA - default=false)
                                                                           RotateKubeletServerCertificate=true|false (BETA - default=true)
                                                                           RunAsGroup=true|false (BETA - default=true)
                                                                           RuntimeClass=true|false (BETA - default=true)
                                                                           SCTPSupport=true|false (BETA - default=true)
                                                                           SelectorIndex=true|false (BETA - default=true)
                                                                           ServerSideApply=true|false (BETA - default=true)
                                                                           ServiceAccountIssuerDiscovery=true|false (ALPHA - default=false)
                                                                           ServiceAppProtocol=true|false (BETA - default=true)
                                                                           ServiceNodeExclusion=true|false (BETA - default=true)
                                                                           ServiceTopology=true|false (ALPHA - default=false)
                                                                           SetHostnameAsFQDN=true|false (ALPHA - default=false)
                                                                           StartupProbe=true|false (BETA - default=true)
                                                                           StorageVersionHash=true|false (BETA - default=true)
                                                                           SupportNodePidsLimit=true|false (BETA - default=true)
                                                                           SupportPodPidsLimit=true|false (BETA - default=true)
                                                                           Sysctls=true|false (BETA - default=true)
                                                                           TTLAfterFinished=true|false (ALPHA - default=false)
                                                                           TokenRequest=true|false (BETA - default=true)
                                                                           TokenRequestProjection=true|false (BETA - default=true)
                                                                           TopologyManager=true|false (BETA - default=true)
                                                                           ValidateProxyRedirects=true|false (BETA - default=true)
                                                                           VolumeSnapshotDataSource=true|false (BETA - default=true)
                                                                           WarningHeaders=true|false (BETA - default=true)
                                                                           WinDSR=true|false (ALPHA - default=false)
                                                                           WinOverlay=true|false (ALPHA - default=false)
                                                                           WindowsEndpointSliceProxying=true|false (ALPHA - default=false)
      --gpu-total MultiStringFlag                                          Minimum and maximum number of different GPUs in cluster, in the format <gpu_type>:<min>:<max>. Cluster autoscaler will not scale the cluster beyond these numbers. Can be passed multiple times. CURRENTLY THIS FLAG ONLY WORKS ON GKE. (default [])
      --ignore-daemonsets-utilization                                      Should CA ignore DaemonSet pods when calculating resource utilization for scaling down
      --ignore-mirror-pods-utilization                                     Should CA ignore Mirror pods when calculating resource utilization for scaling down
      --ignore-taint MultiStringFlag                                       Specifies a taint to ignore in node templates when considering to scale a node group (default [])
      --kubeconfig string                                                  Path to kubeconfig file with authorization and master location information.
      --kubernetes string                                                  Kubernetes master location. Leave blank for default
      --leader-elect                                                       Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability. (default true)
      --leader-elect-lease-duration duration                               The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s)
      --leader-elect-renew-deadline duration                               The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. (default 10s)
      --leader-elect-resource-lock string                                  The type of resource object that is used for locking during leader election. Supported options are 'endpoints', 'configmaps', 'leases', 'endpointsleases' and 'configmapsleases'. (default "leases")
      --leader-elect-resource-name string                                  The name of resource object that is used for locking during leader election.
      --leader-elect-resource-namespace string                             The namespace of resource object that is used for locking during leader election.
      --leader-elect-retry-period duration                                 The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)
      --log-backtrace-at traceLocation                                     when logging hits line file:N, emit a stack trace (default :0)
      --log-dir string                                                     If non-empty, write log files in this directory
      --log-file string                                                    If non-empty, use this log file
      --log-file-max-size uint                                             Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --logtostderr                                                        log to standard error instead of files (default true)
      --max-autoprovisioned-node-group-count int                           The maximum number of autoprovisioned groups in the cluster. (default 15)
      --max-bulk-soft-taint-count int                                      Maximum number of nodes that can be tainted/untainted PreferNoSchedule at the same time. Set to 0 to turn off such tainting. (default 10)
      --max-bulk-soft-taint-time duration                                  Maximum duration of tainting/untainting nodes as PreferNoSchedule at the same time. (default 3s)
      --max-empty-bulk-delete int                                          Maximum number of empty nodes that can be deleted at the same time. (default 10)
      --max-failing-time duration                                          Maximum time from last recorded successful autoscaler run before automatic restart (default 15m0s)
      --max-graceful-termination-sec int                                   Maximum number of seconds CA waits for pod termination when trying to scale down a node. (default 600)
      --max-inactivity duration                                            Maximum time from last recorded autoscaler activity before automatic restart (default 10m0s)
      --max-node-provision-time duration                                   Maximum time CA waits for node to be provisioned (default 15m0s)
      --max-nodes-total int                                                Maximum number of nodes in all node groups. Cluster autoscaler will not grow the cluster beyond this number.
      --max-total-unready-percentage float                                 Maximum percentage of unready nodes in the cluster.  After this is exceeded, CA halts operations (default 45)
      --memory-total string                                                Minimum and maximum number of gigabytes of memory in cluster, in the format <min>:<max>. Cluster autoscaler will not scale the cluster beyond these numbers. (default "0:6400000")
      --min-replica-count int                                              Minimum number or replicas that a replica set or replication controller should have to allow their pods deletion in scale down
      --min-resync-period duration                                         The minimum resync period configured for the shared informers used by the MCM provider cached listers (default 1h0m0s)
      --namespace string                                                   Namespace in which cluster-autoscaler run. (default "kube-system")
      --new-pod-scale-up-delay duration                                    Pods less than this old will not be considered for scale-up. (default 0s)
      --node-autoprovisioning-enabled                                      Should CA autoprovision node groups when needed
      --node-deletion-delay-timeout duration                               Maximum time CA waits for removing delay-deletion.cluster-autoscaler.kubernetes.io/ annotations before deleting the node. (default 2m0s)
      --node-group-auto-discovery <name of discoverer>:[<key>[=<value>]]   One or more definition(s) of node group auto-discovery. A definition is expressed <name of discoverer>:[<key>[=<value>]]. The `aws` and `gce` cloud providers are currently supported. AWS matches by ASG tags, e.g. `asg:tag=tagKey,anotherTagKey`. GCE matches by IG name prefix, and requires you to specify min and max nodes per IG, e.g. `mig:namePrefix=pfx,min=0,max=10` Can be used multiple times. (default [])
      --nodes MultiStringFlag                                              sets min,max size and other configuration data for a node group in a format accepted by cloud provider. Can be used multiple times. Format: <min>:<max>:<other...> (default [])
      --ok-total-unready-count int                                         Number of allowed unready nodes, irrespective of max-total-unready-percentage (default 3)
      --profiling                                                          Is debug/pprof endpoint enabled
      --regional                                                           Cluster is regional.
      --scale-down-candidates-pool-min-count int                           Minimum number of nodes that are considered as additional non empty candidatesfor scale down when some candidates from previous iteration are no longer valid.When calculating the pool size for additional candidates we takemax(#nodes * scale-down-candidates-pool-ratio, scale-down-candidates-pool-min-count). (default 50)
      --scale-down-candidates-pool-ratio float                             A ratio of nodes that are considered as additional non empty candidates forscale down when some candidates from previous iteration are no longer valid.Lower value means better CA responsiveness but possible slower scale down latency.Higher value can affect CA performance with big clusters (hundreds of nodes).Set to 1.0 to turn this heuristics off - CA will take all nodes as additional candidates. (default 0.1)
      --scale-down-delay-after-add duration                                How long after scale up that scale down evaluation resumes (default 10m0s)
      --scale-down-delay-after-delete duration                             How long after node deletion that scale down evaluation resumes, defaults to scanInterval (default 0s)
      --scale-down-delay-after-failure duration                            How long after scale down failure that scale down evaluation resumes (default 3m0s)
      --scale-down-enabled                                                 Should CA scale down the cluster (default true)
      --scale-down-gpu-utilization-threshold float                         Sum of gpu requests of all pods running on the node divided by node's allocatable resource, below which a node can be considered for scale down.Utilization calculation only cares about gpu resource for accelerator node. cpu and memory utilization will be ignored. (default 0.5)
      --scale-down-non-empty-candidates-count int                          Maximum number of non empty nodes considered in one iteration as candidates for scale down with drain.Lower value means better CA responsiveness but possible slower scale down latency.Higher value can affect CA performance with big clusters (hundreds of nodes).Set to non positive value to turn this heuristic off - CA will not limit the number of nodes it considers. (default 30)
      --scale-down-unneeded-time duration                                  How long a node should be unneeded before it is eligible for scale down (default 10m0s)
      --scale-down-unready-time duration                                   How long an unready node should be unneeded before it is eligible for scale down (default 20m0s)
      --scale-down-utilization-threshold float                             Sum of cpu or memory of all pods running on the node divided by node's corresponding allocatable resource, below which a node can be considered for scale down (default 0.5)
      --scale-up-from-zero                                                 Should CA scale up when there 0 ready nodes. (default true)
      --scan-interval duration                                             How often cluster is reevaluated for scale up or down (default 10s)
      --skip-headers                                                       If true, avoid header prefixes in the log messages
      --skip-log-headers                                                   If true, avoid headers when opening log files
      --skip-nodes-with-local-storage                                      If true cluster autoscaler will never delete nodes with pods with local storage, e.g. EmptyDir or HostPath (default true)
      --skip-nodes-with-system-pods                                        If true cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods) (default true)
      --stderrthreshold severity                                           logs at or above this threshold go to stderr (default 2)
      --target-apiserver-burst int                                         Throttling burst configuration for the client to target cluster's apiserver. (default 10)
      --target-apiserver-qps float                                         Throttling QPS configuration for the client to target cluster's apiserver. (default 5)
      --unremovable-node-recheck-timeout duration                          The timeout before we check again a node that couldn't be removed before (default 5m0s)
  -v, --v Level                                                            number for the log level verbosity
      --vmodule moduleSpec                                                 comma-separated list of pattern=N settings for file-filtered logging
      --write-status-configmap                                             Should CA write status information to a configmap (default true)

@MaciekPytel
Copy link
Contributor

/lgtm
/approve
This was discussed on sig today and got broad support and it looks good to me. Thanks for the contribution!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ialidzhikov, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2021
@k8s-ci-robot k8s-ci-robot merged commit 6c9b0e9 into kubernetes:master Dec 20, 2021
@marwanad
Copy link
Member

@ialidzhikov would we want to cherry-pick this backwards to 1.21? I know we just released patches there but maybe for the next round of patches.

@ialidzhikov
Copy link
Contributor Author

k8s-ci-robot added a commit that referenced this pull request Dec 22, 2021
…539-upstream-cluster-autoscaler-release-1.21

[release-1.21] Automated cherry pick of #4539: Add `--feature-gates` flag to support scale up on volume
k8s-ci-robot added a commit that referenced this pull request Dec 22, 2021
…539-upstream-cluster-autoscaler-release-1.22

[release-1.22] Automated cherry pick of #4539: Add `--feature-gates` flag to support scale up on volume
k8s-ci-robot added a commit that referenced this pull request Dec 22, 2021
…539-upstream-cluster-autoscaler-release-1.20

[release-1.20] Automated cherry pick of #4539: Add `--feature-gates` flag to support scale up on volume
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on exceed max volume count
6 participants