Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* feat(*): add more metrics * Added the RBAC Permission to Linode. * CA - Correct Cloudprovider PR labelling to area/provider/<provider name> * fix(volcengine): don't build all provider when volcengine tag exist * chore: replace `github.com/ghodss/yaml` with `sigs.k8s.io/yaml` At the time of making this commit, the package `github.com/ghodss/yaml` is no longer actively maintained. `sigs.k8s.io/yaml` is a permanent fork of `ghodss/yaml` and is actively maintained by Kubernetes SIG. Signed-off-by: Eng Zer Jun <[email protected]> * Fixed Typo and Trailing-whitespace * Skip healthiness check for non-existing similar node groups * BinpackingLimiter interface * fix comment and list format * add more logging for balancing similar node groups this change adds some logging at verbosity levels 2 and 3 to help diagnose why the cluster-autoscaler does not consider 2 or more node groups to be similar. * Update VPA scripts to use v1. * fix: don't clean `CriticalAddonsOnly` taint from template nodes - this taint leads to unexpected behavior - users expect CA to consider the taint when autoscaling Signed-off-by: vadasambar <[email protected]> * Updated the owners of civo cloudprovider Signed-off-by: Vishal Anarse <[email protected]> * Bump golang from 1.20.4 to 1.20.5 in /vertical-pod-autoscaler/builder Bumps golang from 1.20.4 to 1.20.5. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * cluster-autoscaler: support Brightbox image pattern cluster-autoscaler/cloudprovider/brightbox Allow scaled workers to be built from an image name pattern as well as an image id. This deals with long running clusters where the official image is updated with security changes over time. * brightbox: set default docker registry Only set the registry value in the local Makefile if it is missing * Update oci-ip-cluster-autoscaler-w-config.yaml * Update oci-ip-cluster-autoscaler-w-principals.yaml * address comments * golint fix * Remove print condition for vpa-beta2-crd. * Improvement: Modified the VPA content for the helm chart. * Bump the Chart version to 9.29.1 and CA image to 1.27.2 * make no-op binpacking limiter as default + move mark nodegroups to its method * Drop projected volumes for init containers * fix zonal gce outage breaking CA when only some of the zones are failed * Bump version to 0.14.0 as a preparation for release. * Update vendor to Kubernetes 1.28.0-alpha.2 * Interface fixes after Kubernetes 1.28.0-alpha.2 vendor update * Execute git commands to show the state of local clone of the repo. * Clarify and simplify the "build and stage images" step. * Mention logs from kubernetes#5862 in release instructions. * addressed comments * chore: remove unused func scaleFromZeroAnnotationsEnabled Signed-off-by: Dinesh B <[email protected]> * add cluster-autoscaler name and version to the user agent This makes it easier to distinguish between various users of the Go SDK. * Explicitly create and remove buildx builders * Apply fixes to in place support VPA AEP Looks like the first PR merged a bit too early, while there were open coments * Add voelzmo to VPA reviewers voelzmo meets [requirements](https://github.com/kubernetes/community/blob/9504ce87ec14cff9455e794fdcbc5088c52f9dd9/community-membership.md#requirements-1): - K8s org members since 2023-02: kubernetes/org#4015 - Reviewer of 12 merged VPA PRs: https://github.com/kubernetes/autoscaler/pulls?q=is%3Apr+reviewed-by%3Avoelzmo+label%3Avertical-pod-autoscaler+is%3Amerged+ - Sent 10 merged VPA PRs: https://github.com/kubernetes/autoscaler/pulls?q=is%3Apr+label%3Avertical-pod-autoscaler+author%3Avoelzmo+is%3Amerged * Bump default VPA version to 0.14.0 * Minor tweaks after preparing VPA 0.14.0 release. * fix: CA on fargate causing log flood - happens when CA tries to check if the unmanaged fargate node is a part of ASG (it isn't) - and keeps on logging error Signed-off-by: vadasambar <[email protected]> * test: fix node names Signed-off-by: vadasambar <[email protected]> * Sort nodegroups in order of their ID * Move two util functions from actuator to delete_in_batch, where they are more appropriate * Add support for atomic scale-down in node group options * Extract cropNodesToBudgets function out of actuator file * Support atomic scale-down option for node groups * Respond to readability-related comments from the review * Don't pass NodeGroup as a parameter to functions running asynchronously * Add unit test for group_deletion_scheduler * Use single AtomicScaling option for scale up and scale down * address comments * Address next set of comments * update agnhost image to pull from registry.k8s.io * Revert "Add subresource status for vpa" This reverts commit 1384c8b. * Bugfix for budget cropping Previous "CropNodes" function of ScaleDownBudgetProcessor had an assumption that atomically-scaled node groups should be classified as "empty" or "drain" as a whole, however Cluster Autoscaler may classify some of the nodes from a single group as "empty" and other as "drain". * Remove unneeded node groups regardless of scale down being in cooldown. * Update VPA vendor Generated by runing: ``` go mod tidy go mod vendor ``` * Replace `BuildTestContainer` with use of builder * Quote temp folder name parameter to avoid errors * Include short unregistered nodes in calculation of incorrect node group sizes * Add BigDarkClown to Cluster Autoscaler approvers * Add support for scaling up ZeroToMaxNodesScaling node groups * Use appropriate logging levels * Remove unused field in expander and add comment about estimator * Merge tests for ZeroToMaxNodesScaling into one table-driven test. * Merged multiple tests into one single table driven test. * Fixed some typos. * Change handling of scale up options for ZeroToMaxNodeScaling in orchestrator * Started handling scale up options for ZeroToMaxNodeScaling with the existing estimator * Skip setting similar node groups for the node groups that use ZeroToMaxNodeScaling * Renamed the autoscaling option from "AtomicScaleUp" to "AtomicScaling" * Merged multiple tests into one single table driven test. * Fixed some typos. * Rename the autoscaling option * Renamed the "AtomicScaling" autoscaling option to "ZeroOrMaxNodeScaling" to be more clear about the behavior. * Record all vpa api versions in recommender metrics Change the tracking of APIVersion from a boolean indicating if the VPA is v1beta1 to the version string and make sure it gets exported in metrics. Add tests for the recommender metrics. * Add subresource status for vpa Add status field in subresource on crd yaml and add new ClusterRole system:vpa-actor to patch /status subresource. The `metadata.generation` only increase on vpa spec update. Fix e2e test for patch and create vpa * Implement threshold interface for use by threshold based limiter Add EstimationContext to take into account runtime state of the autoscaling for estimations Implement static threshold Implement cluster capacity threshold for Estimation Limiter Implement similar node groups capacity threshold for Estimation Limiter Set default estimation thresholds * Fix tests * Add ClusterStateRegistry to the AutoscalingContext. Due to the dependency of the MaxNodeProvisionTimeProvider on the context the provider was extracted to a dedicated package and injected to the ClusterStateRegistry after context creation. * Make signature of GetDurationLimit uniformed with GetNodeLimit For SNG threshold include capacity of the currently estimated node group (as it is not part of SNG itself) Replaced direct calls with use of getters in cluster capacity threshold Renamed getters removing the verb Get Replace EstimationContext struct with interface Add support for negative threshold value in estimation limiter * Add support for negative binpacking duration limit in threshold based estimation limiter * update RBAC to only use verbs that exist for the resources Signed-off-by: Maximilian Rink <[email protected]> * Move powerState to azure_util, change default to powerStateUnknown * renames all PowerState* consts to vmPowerState* * moves vmPowerState* consts and helper functions to azure_util.go * changes default vmPowerState to vmPowerStateUnknown instead of vmPowerStateStopped when a power state is not set. * test: fix failing tests - remove non-relevant comment related to rescheduler Signed-off-by: vadasambar <[email protected]> * feat: set `IgnoreDaemonSetsUtilization` per nodegroup Signed-off-by: vadasambar <[email protected]> fix: test cases failing for actuator and scaledown/eligibility - abstract default values into `config` Signed-off-by: vadasambar <[email protected]> refactor: rename global `IgnoreDaemonSetsUtilization` -> `GlobalIgnoreDaemonSetsUtilization` in code - there is no change in the flag name - rename `thresholdGetter` -> `configGetter` and tweak it to accomodate `GetIgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <[email protected]> refactor: reset help text for `ignore-daemonsets-utilization` flag - because per nodegroup override is supported only for AWS ASG tags as of now Signed-off-by: vadasambar <[email protected]> docs: add info about overriding `--ignore-daemonsets-utilization` per ASG - in AWS cloud provider README Signed-off-by: vadasambar <[email protected]> refactor: use a limiting interface in actuator in place of `NodeGroupConfigProcessor` interface - to limit the functions that can be used - since we need it only for `GetIgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <[email protected]> fix: tests failing for actuator - rename `staticNodeGroupConfigProcessor` -> `MockNodeGroupConfigGetter` - move `MockNodeGroupConfigGetter` to test/common so that it can be used in different tests Signed-off-by: vadasambar <[email protected]> fix: go lint errors for `MockNodeGroupConfigGetter` Signed-off-by: vadasambar <[email protected]> test: add tests for `IgnoreDaemonSetsUtilization` in cloud provider dir Signed-off-by: vadasambar <[email protected]> test: update node group config processor tests for `IgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <[email protected]> test: update eligibility test cases for `IgnoreDaemonSetsUtilization` Signed-off-by: vadasambar <[email protected]> test: run actuation tests for 2 NGS - one with `IgnoreDaemonSetsUtilization`: `false` - one with `IgnoreDaemonSetsUtilization`: `true` Signed-off-by: vadasambar <[email protected]> test: add tests for `IgnoreDaemonSetsUtilization` in actuator - add helper to generate multiple ds pods dynamically - get rid of mock config processor because it is not required Signed-off-by: vadasambar <[email protected]> test: fix failing tests for actuator Signed-off-by: vadasambar <[email protected]> refactor: remove `GlobalIgnoreDaemonSetUtilization` autoscaling option - not required Signed-off-by: vadasambar <[email protected]> fix: warn message `DefaultScaleDownUnreadyTimeKey` -> `DefaultIgnoreDaemonSetsUtilizationKey` Signed-off-by: vadasambar <[email protected]> refactor: use `generateDsPods` instead of `generateDsPod` Signed-off-by: vadasambar <[email protected]> refactor: `globaIgnoreDaemonSetsUtilization` -> `ignoreDaemonSetsUtilization` Signed-off-by: vadasambar <[email protected]> * test: fix merge conflicts in actuator tests Signed-off-by: vadasambar <[email protected]> * refactor: use `actuatorNodeGroupConfigGetter` param in `NewActuator` - instead of passing all the processors (we only need `NodeGroupConfigProcessor`) Signed-off-by: vadasambar <[email protected]> * test: refactor eligibility tests - add suffix to tests with `IgnoreDaemonSetsUtilization` set to `true` and `IgnoreDaemonSetsUtilization` set to `false` Signed-off-by: vadasambar <[email protected]> * refactor: remove comment line (not relevant anymore) Signed-off-by: vadasambar <[email protected]> * fix: dynamic assignment of the scale down threshold flags. Setting maxEmptyBulkDelete, and maxScaleDownParallelism to be the larger of the two flags in the case both are set * Refactor autoscaler.go and static_autoscalar.go to move declaration of the NodeDeletion option to main.go * Fixed go:build tags for ovhcloud * Update the go:build tag for missing cloud providers. * Adapt FAQ for Pods without controller * Use strings instead of NodeGroups as map keys in budgets.go * Delete dead code from budgets.go * Re-introduce asynchronous node deletion and clean node deletion logic. * feat: support custom scheduler config for in-tree schedulr plugins (without extenders) Signed-off-by: vadasambar <[email protected]> refactor: rename `--scheduler-config` -> `--scheduler-config-file` to avoid confusion Signed-off-by: vadasambar <[email protected]> fix: `goto` causing infinite loop - abstract out running extenders in a separate function Signed-off-by: vadasambar <[email protected]> refactor: remove code around extenders - we decided not to use scheduler extenders for checking if a pod would fit on a node Signed-off-by: vadasambar <[email protected]> refactor: move scheduler config to a `utils/scheduler` package` - use default config as a fallback Signed-off-by: vadasambar <[email protected]> test: fix static_autoscaler test Signed-off-by: vadasambar <[email protected]> refactor: `GetSchedulerConfiguration` fn - remove falling back - add mechanism to detect if the scheduler config file flag was set - Signed-off-by: vadasambar <[email protected]> test: wip add tests for `GetSchedulerConfig` - tests are failing now Signed-off-by: vadasambar <[email protected]> test: add tests for `GetSchedulerConfig` - abstract error messages so that we can use them in the tests - set api version explicitly (this is what upstream does as well) Signed-off-by: vadasambar <[email protected]> refactor: do a round of cleanup to make PR ready for review - make import names consistent Signed-off-by: vadasambar <[email protected]> fix: use `pflag` to check if the `--scheduler-config-file` flag was set Signed-off-by: vadasambar <[email protected]> docs: add comments for exported error constants Signed-off-by: vadasambar <[email protected]> refactor: don't export error messages - exporting is not needed Signed-off-by: vadasambar <[email protected]> fix: add underscore in test file name Signed-off-by: vadasambar <[email protected]> test: fix test failing because of no comment on exported `SchedulerConfigFileFlag` Signed-off-by: vadasambar <[email protected]> refacotr: change name of flag variable `schedulerConfig` -> `schedulerConfigFile` - avoids confusion Signed-off-by: vadasambar <[email protected]> test: add extra test cases for predicate checker - where the predicate checker uses custom scheduler config Signed-off-by: vadasambar <[email protected]> refactor: remove `setFlags` variable - not needed anymore Signed-off-by: vadasambar <[email protected]> refactor: abstract custom scheduler configs into `conifg` package - make them constants Signed-off-by: vadasambar <[email protected]> test: fix linting error Signed-off-by: vadasambar <[email protected]> refactor: introduce a new custom test predicate checker - instead of adding a param to the current one - this is so that we don't have to pass `nil` to the existing test predicate checker in many places Signed-off-by: vadasambar <[email protected]> refactor: rename `NewCustomPredicateChecker` -> `NewTestPredicateCheckerWithCustomConfig` - latter narrows down meaning of the function better than former Signed-off-by: vadasambar <[email protected]> refactor: rename `GetSchedulerConfig` -> `ConfigFromPath` - `scheduler.ConfigFromPath` is shorter and feels less vague than `scheduler.GetSchedulerConfig` - move test config to a new package `test` under `config` package Signed-off-by: vadasambar <[email protected]> docs: add `TODO` for replacing code to parse scheduler config - with upstream function Signed-off-by: vadasambar <[email protected]> * Use fixed version of golang image * Fix TestBinpackingLimiter flake * Bump golang from 1.20.5 to 1.20.6 in /vertical-pod-autoscaler/builder Bumps golang from 1.20.5 to 1.20.6. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Fix: Do not inject fakeNode for instance which has errors on create * chore: add script to update vendored hcloud-go * chore(deps): update vendored hcloud-go to 2.0.0 Generated by: ``` UPSTREAM_REF=v2.0.0 hack/update-vendor.sh ``` * fix: balancer RBAC permission to update balancer status * CA - AWS Cloudprovider OWNERS Update * Enable parallel drain by default. * Add BigDarkClown to patch releases schedule * Update Cluster Autoscaler vendor to K8s 1.28.0-beta.0 * Add EstimationAnalyserFunc to be run at the end of the estimation logic * Remove ChangeRequirements with `OrEqual` * Add EvictionRequirements to types * Run `generate-crd-yaml.sh` * Add metrics for improved observability: * pending_node_deletions * failed_gpu_scale_ups_total * Add requirement for Custom Resources to VPA FAQ * Clarify Eviction Control for Pods with multiple Containers * Fix broken hyperlink Co-authored-by: Shubham <[email protected]> * Update vertical-pod-autoscaler/FAQ.md Co-authored-by: Joachim <[email protected]> * Update vertical-pod-autoscaler/FAQ.md Co-authored-by: Joachim <[email protected]> * Reword AND/OR combinations for more clarity * Fix nil pointer exception for case when node is nil while processing gpuInfo * feat: add prometheus basic auth Signed-off-by: AhmedGrati <[email protected]> * Add error code for invalid reservations to GCE client * Bump golang from 1.20.6 to 1.20.7 in /vertical-pod-autoscaler/builder Bumps golang from 1.20.6 to 1.20.7. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Support ZeroOrMaxNodeScaling node groups when cleaning up unregistered nodes * Don't pass nil nodes to GetGpuInfoForMetrics * Revert "Fix nil pointer exception for case when node is nil while processing …" * Clean up NodeGroupConfigProcessor interface * docs: add kep to add fswatcher to nanny for automatic nanny configuration Signed-off-by: AhmedGrati <[email protected]> * Allow using an external secret instead of using the one the Helm chart creates * Remove the MaxNodeProvisioningTimeProvider interface * Fixed the hyperlink for Node group auto discovery. * Update ResourcePolicy description and limit control README * s390x image support * Bump golang from 1.20.7 to 1.21.0 in /vertical-pod-autoscaler/builder Bumps golang from 1.20.7 to 1.21.0. --- updated-dependencies: - dependency-name: golang dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * test * Set batch size to target size for atomically scaled groups * a little extra validation * test with 2 atomic groups * don't block draining other groups when one group has some empty nodes * fix: Broken links to testgrid dashboard * fix: scale down broken for providers not implementing NodeGroup.GetOptions() * feat(hetzner): use less requests while waiting for server create The default is to send a new request every 500ms, this will instead use an exponential backoff while waiting for the server the create. * Update in-place updates AEP adding details to consider * Fix Doc with External gRPC Signed-off-by: ZhengSheng0524 <[email protected]> * Add fetch reservations in specific project GCE supports shared reservations where the reservation is in a different project than the project the cluster is in. Add GCE client method to get said reservations so autoscaling can support shared reservations. * kep: add config file format and structure notes Signed-off-by: AhmedGrati <[email protected]> * CA - 1.28.0 k/k Vendor Update * Fix duplicate imports in IT * re-add changes part of FORK-CHANGE * Re added a fork change command and updated sync change notes * Update cluster-autoscaler/SYNC-CHANGES/SYNC_CHANGES-1.28.md Co-authored-by: Rishabh Patel <[email protected]> --------- Signed-off-by: Eng Zer Jun <[email protected]> Signed-off-by: vadasambar <[email protected]> Signed-off-by: Vishal Anarse <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Dinesh B <[email protected]> Signed-off-by: Maximilian Rink <[email protected]> Signed-off-by: vadasambar <[email protected]> Signed-off-by: AhmedGrati <[email protected]> Signed-off-by: ZhengSheng0524 <[email protected]> Co-authored-by: qianlei.qianl <[email protected]> Co-authored-by: shubham82 <[email protected]> Co-authored-by: Guy Templeton <[email protected]> Co-authored-by: Kubernetes Prow Robot <[email protected]> Co-authored-by: Eng Zer Jun <[email protected]> Co-authored-by: Bartłomiej Wróblewski <[email protected]> Co-authored-by: Kushagra <[email protected]> Co-authored-by: kei-gnu <[email protected]> Co-authored-by: michael mccune <[email protected]> Co-authored-by: vadasambar <[email protected]> Co-authored-by: Vishal Anarse <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Neil Wilson <[email protected]> Co-authored-by: Sourabh Gupta <[email protected]> Co-authored-by: Artur Żyliński <[email protected]> Co-authored-by: Damika Gamlath <[email protected]> Co-authored-by: Karol Golab <[email protected]> Co-authored-by: Dinesh B <[email protected]> Co-authored-by: Todd Neal <[email protected]> Co-authored-by: Marco Voelz <[email protected]> Co-authored-by: Joachim Bartosik <[email protected]> Co-authored-by: vadasambar <[email protected]> Co-authored-by: Karol Wychowaniec <[email protected]> Co-authored-by: Kevin Wiesmueller <[email protected]> Co-authored-by: Aleksandra Gacek <[email protected]> Co-authored-by: Hakan Bostan <[email protected]> Co-authored-by: xiaoqing <[email protected]> Co-authored-by: Yuriy Stryuchkov <[email protected]> Co-authored-by: Daniel Gutowski <[email protected]> Co-authored-by: Maximilian Rink <[email protected]> Co-authored-by: dom.bozzuto <[email protected]> Co-authored-by: bsoghigian <[email protected]> Co-authored-by: Krzysztof Siedlecki <[email protected]> Co-authored-by: Julian Tölle <[email protected]> Co-authored-by: Amir Alavi <[email protected]> Co-authored-by: Daniel Kłobuszewski <[email protected]> Co-authored-by: droctothorpe <[email protected]> Co-authored-by: Marco Voelz <[email protected]> Co-authored-by: Jayant Jain <[email protected]> Co-authored-by: AhmedGrati <[email protected]> Co-authored-by: Mike Tougeron <[email protected]> Co-authored-by: Sachin Tiptur <[email protected]> Co-authored-by: Saripalli Lavanya <[email protected]> Co-authored-by: Aleksandra Malinowska <[email protected]> Co-authored-by: Yash Khare <[email protected]> Co-authored-by: Piotr Betkier <[email protected]> Co-authored-by: ZhengSheng0524 <[email protected]> Co-authored-by: Jessica Chen <[email protected]> Co-authored-by: Rishabh Patel <[email protected]>
- Loading branch information