Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner: no node labels #6715

Closed
yellowhat opened this issue Apr 16, 2024 · 7 comments
Closed

Hetzner: no node labels #6715

yellowhat opened this issue Apr 16, 2024 · 7 comments
Labels
area/provider/hetzner Issues or PRs related to Hetzner provider kind/bug Categorizes issue or PR as related to a bug. kind/documentation Categorizes issue or PR as related to documentation.

Comments

@yellowhat
Copy link

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.29.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3

What environment is this in?:

Hetzner

What did you expect to happen?:

I have HCLOUD_CLUSTER_CONFIG:

{
    "imagesForArch": {
        "amd64": "<id>"
    },
    "nodeConfigs": {
        "pool1": {
            "cloudInit": "<blah>",
            "labels": {
                "node.kubernetes.io/role": "autoscaler-node",
                "custom/role": "test"
            }
        }
    }
}

and the following args:

...
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
          imagePullPolicy: "Always"
          command:
            - ./cluster-autoscaler
            - --cloud-provider=hetzner
            - --v=4
            - --scale-down-enabled
            - --ignore-daemonsets-utilization
            - --ignore-mirror-pods-utilization
            - --scale-down-unneeded-time=30s
            - --scale-down-delay-after-add=30s
            - --nodes=0:10:CPX31:NBG1:pool1
...

I would have expected to see the additional labels:

  • "node.kubernetes.io/role": "autoscaler-node"
  • "custom/role": "test"

added to the node.

What happened instead?:

$ kubectl describe nodes pool1-53477363a36052f3
Name:               pool1-53477363a36052f3
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=cpx31
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=nbg1
                    failure-domain.beta.kubernetes.io/zone=nbg1-dc3
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=pool1-53477363a36052f3
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=cpx31
                    topology.kubernetes.io/region=nbg1
                    topology.kubernetes.io/zone=nbg1-dc3
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • hetznercloud/hcloud-cloud-controller-manager:v1.19.0
@yellowhat yellowhat added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@apricote
Copy link
Member

/kind documentation
/area provider/hetzner

That is not well described in the README :( . The cluster-autoscaler is not responsible for setting any labels on your nodes. The kubelet will create the Node object with the Kubernetes API. The field you mentioned only tells cluster-autoscaler what labels the node will probably have, so cluster-autoscaler can figure out if the pending pods could be scheduled to a node from the node group.

You need to set these labels yourself. The easiest way to do that is in the cloud-init script you pass. The kubelet has a flag --node-labels that sets the labels on the newly created Node object. How you configure the kubelet arguments depends on your Kubernetes distribution.

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. area/provider/hetzner Issues or PRs related to Hetzner provider labels Apr 17, 2024
@yellowhat
Copy link
Author

Thanks for the reply, but if I set --node-labels to kubelet, HCCM will not apply its labels.

Am I missing something?

Is it the same behaviour with taints?

@yellowhat
Copy link
Author

Also by default cluster-autoscaler is aware of these labels:

beta.kubernetes.io/instance-type:...
kubernetes.io/arch:...
topology.kubernetes.io/region:..
csi.hetzner.cloud/location:...
hcloud/node-group:...

It does account for other "default" labels, like:

kubernetes.io/hostname=...

Therefore a deployment with:

...
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: app
...

will never trigger an autoscale as the kubernetes.io/hostname is unknown to the controller.

I have to pass

{
    "imagesForArch": {
        "amd64": "<id>"
    },
    "nodeConfigs": {
        "pool1": {
            "cloudInit": "<blah>",
            "labels": {
                "kubernetes.io/hostname": "",
            }
        }
    }
}

apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 22, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 22, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
@apricote
Copy link
Member

Thanks for the reply, but if I set --node-labels to kubelet, HCCM will not apply its labels.

This sounds like the kubelet is missing --cloud-provider=external now. Can you verify in the kubelet logs that both flags are actually set?

This worked when I did a quick test with k3s and flags --kubelet-arg cloud-provider=external --kubelet-arg node-labels=foo=bar


I thought that cluster-autoscaler would automatically add kubernetes.io/hostname into its consideration. Looks like a lot of other cloudproviders add it manually. I have opened #6740 to add it to the Hetzner provider.

Do you have any other labels that we should always set? Looking at a node, I see kubernetes.io/os but we are not able to reliably predict this.

@yellowhat
Copy link
Author

You are right I was overwriting the args, therefore removing the --cloud-provider=external.

@yellowhat
Copy link
Author

Another question if I set a pool with a min:

...
          command:
            - ./cluster-autoscaler
            - --cloud-provider=hetzner
            - --v=4
            - --scale-down-enabled
            - --ignore-daemonsets-utilization
            - --ignore-mirror-pods-utilization
            - --scale-down-unneeded-time=30s
            - --scale-down-delay-after-add=30s
            - --nodes=0:10:CPX31:HEL1:pool1
            - --nodes=1:2:CPX11:HEL1:lb
...

I would have expected that a new worker node would be added on startup, instead it always waits for a new pod that cannot be scheduled.

Any suggestions?

Thanks

@apricote
Copy link
Member

This is described in the general cluster-autoscaler docs: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#my-cluster-is-below-minimum--above-maximum-number-of-nodes-but-ca-did-not-fix-that-why

apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
apricote added a commit to hetznercloud/autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
thomasstxyz pushed a commit to WhizUs/kubernetes-autoscaler that referenced this issue Apr 24, 2024
The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715
aaronfern pushed a commit to gardener/autoscaler that referenced this issue Jul 25, 2024
* Comment to explain why test is done on STS ownerRef

* add informer argument to clusterapi provider builder

This change adds the informer factory as an argument to the
`buildCloudProvider` function for clusterapi so that building with tags
will work properly.

* Add informer argument to the CloudProviders builder.

* clusterapi: add missing error check

* Add instanceType/region support in Helm chart for Hetzner cloud provider

* doc: cluster-autoscaler: Oracle provider: Add small security note

* doc: cluster-autoscaler: Oracle provider: Add small security note

* doc: cluster-autoscaler: Oracle provider: Add small security note

* Update charts/cluster-autoscaler/README.md

* Update Auto Labels of Subprojects

* check empty ProviderID in ali NodeGroupForNode

* add gce constructor with custom timeout

* update README.md.gotmpl and added Helm docs for Hetzner Cloud

* bump chart version

* use older helm-docs version and remove empty line in values comment

* add missing line breaks

* Update charts/cluster-autoscaler/Chart.yaml

Co-authored-by: Shubham <[email protected]>

* Reduce log spam in AtomicResizeFilteringProcessor

Also, introduce default per-node logging quotas. For now, identical to
the per-pod ones.

* Bump golang in /vertical-pod-autoscaler/pkg/updater

Bumps golang from 1.21.6 to 1.22.0.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/recommender

Bumps golang from 1.21.6 to 1.22.0.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

Bumps golang from 1.21.6 to 1.22.0.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update Chart.yaml

* Move estimatorBuilder from AutoscalingContext to Orchestrator Init

* VPA: bump golang.org/x/net to fix CVE-2023-39325

The version of golang.org/x/net currently used is vulnerable to
https://avd.aquasec.com/nvd/2023/cve-2023-39325/, bump it to fix that.

* Bump go version.

* Fix e2e test setup

* helm: enable clusterapi namespace autodiscovery

* Fix expectedToRegister to respect instances with nil status

* add option to keep node group backoff on OutOfResource error

* remove changes to backoff interface

* attach errors to scale-up request and add comments

* revert optionally keeping node group backoff

* remove RemoveBackoff from updateScaleRequests

* Add ProvisioningRequestProcessor (kubernetes#6488)

* Add kube-env to MigInfoProvider

* CA: GCE: add pricing for new Z3 machines

* Introduce LocalSSDSizeProvider interface for GCE

* Use KubeEnv in gce/templates.go

* Add templateName to kube-env to ensure that correct value is cached

* Add unit-tests

* extract create group to function

* Merged PR 1379: added retry for creatingAzureManager in case of throttled requests

added retry for forceRefresh in case of throttled requests
ran tests
MallocNanoZone=0 go test -race k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure -- passed

and commented out unit test -- commented out as it takes 10 minutes to complete

func TestCreateAzureManagerWithRetryError(t *testing.T) {
	ctrl := gomock.NewController(t)
	defer ctrl.Finish()
	mockVMClient := mockvmclient.NewMockInterface(ctrl)
	mockVMSSClient := mockvmssclient.NewMockInterface(ctrl)
	mockVMSSClient.EXPECT().List(gomock.Any(), "fakeId").Return([]compute.VirtualMachineScaleSet{}, retry.NewError(true, errors.New("test"))).AnyTimes()
	mockAzClient := &azClient{
		virtualMachinesClient:         mockVMClient,
		virtualMachineScaleSetsClient: mockVMSSClient,
	}
	manager, err := createAzureManagerInternal(strings.NewReader(validAzureCfg), cloudprovider.NodeGroupDiscoveryOptions{}, config.AutoscalingOptions{}, mockAzClient)
	assert.Nil(t, manager)
	assert.NotNil(t, err)
}

* docs: update outdated/deprecated taints in the examples

Refactor references to taints & tolerations, replacing master key
with control-plane across all the example YAMLs.

Signed-off-by: Feruzjon Muyassarov <[email protected]>

* CA FAQ: clarify the point about scheduling constraints blocking scale-down

* Add warning about vendor removal to Makefile build target

Signed-off-by: Feruzjon Muyassarov <[email protected]>

* fix: add missing ephemeral-storage resource definition

* Add BuildTestNodeWithAllocatable test utility method.

* Add ProvisioningRequest injector (kubernetes#6529)

* Add ProvisioningRequests injector

* Add test case for Accepted conditions and add supported provreq classes list

* Use Passive clock

* Consider preemption policy for expandable pods

* Fix a bug where atomic scale-down failure could affect subsequent atomic scale-downs

* Update gce_price_info.go

* Migrate from satori/go.uuid to google/uuid

* Delay force refresh by DefaultInterval when OCI GetNodePool call returns 404

* CA: update dependencies to k8s v1.30.0-alpha.3, go1.21.8

* Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

Bumps golang from 1.22.0 to 1.22.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/updater

Bumps golang from 1.22.0 to 1.22.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/recommender

Bumps golang from 1.22.0 to 1.22.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update expander options for the AWS cloud provider README

* Remove shadow err variable in deleteCreatedNodesWithErros func

* fix memory leak in NodeDeleteTracker

* CA - Add 1.29 to version compatibility matrix

* ClusterAutoscaler: Put APIs in a separate go module

Signed-off-by: Yuki Iwai <[email protected]>

* Extend update-deps.sh so that we can automatically update k8s libraries in the apis pkg

Signed-off-by: Yuki Iwai <[email protected]>

* Clean up update-deps.sh

Signed-off-by: Yuki Iwai <[email protected]>

* Update apis version to v1.29.2

Signed-off-by: Yuki Iwai <[email protected]>

* Allow to override rancher provider settings

Currently it is only possible to set provider settings over yaml file.

This commit introduces env variables to override URL, token and cluster name.

If particular environment variable is set it overrides value supplied in yaml file.

Signed-off-by: Dinar Valeev <[email protected]>
Co-authored-by: Donovan Muller <[email protected]>

* Bump VPA version to 1.1.0

* Deprecate the Linode Cluster Autoscaler provider

Signed-off-by: Ondrej Kokes <[email protected]>

* add price info for n4

* update n4 price info format

* Set "pd-balanced" as DefaultBootDiskType

It is a default since v1.24
Ref: https://cloud.google.com/kubernetes-engine/docs/how-to/custom-boot-disks#specify

* Clarify VPA and HPA limitations

Signed-off-by: Luke Addison <[email protected]>

* Update ionos-cloud-sdk-go and mocks

* Update provider code

* Add cloud API request metrics.
* Fix and update README

* Ignore ionos-cloud-sdk-go spelling

* fix n4 price format

* Add listManagedInstancesResults to GceCache.

* [clusterapi] Do not skip nodegroups with minSize=maxSize

* [clusterapi] Update tests for nodegroups with minSize=maxSize

* add tests

* made changes to support MIGs that use regional instance templates

* modified current unit tests to support the new modifications

* added comment to InstanceTemplateNameType

* Ran hack/go-fmtupdate.h on mig_info_provider_test.go

* Use KubeEnv in gce/templates.go

* Add templateName to kube-env to ensure that correct value is cached

* rebased and resolved conflicts

* added fix for unit tests

* changed InstanceTemplateNameType to InstanceTemplateName

* separated url parser to its own function, created unit test for the function

* separated url parser to its own function, created unit test for the function

* added unit test with regional MIG

* Migrate GCE client to server side operation wait

* Track type of node group created/deleted in auto-provisioned group metrics.

* trigger tests

* fix comment

* Add AtomicScaleUp method to NodeGroup interface

* Add an option to Cluster Autoscaler that allows triggering new loops
more frequently: based on new unschedulable pods and every time a
previous iteration was productive.

* Refactor StartDeletion usage patterns and enforce periodic scaledown status processor calls.

* Bump golang to 1.22

* updated admission-controller to have adjustable --min-tls-version and --tls-ciphers

* CA: Move the ProvisioningRequest CRD to apis module

Signed-off-by: Yuki Iwai <[email protected]>

* Bump default VPA version to 1.1.0

As part of the 1.1.0 release: kubernetes#6388

* Format README

* Add chart versions

* Add script to update required chart versions in README

* Add chart version column in version matrix

* Move cluster-autoscaler update-chart-version-readme script to /hack

* Only check recent revisions when updating README

* Update min cluster-autoscaler chart for Kubernetes 1.29

* Remove unused NodeInfoProcessor

* Fix broken link in README.md to point to equinixmetal readme

* review comments - simplify retry logic

* CA: Before we perform go test, synchronizing go vendor

Signed-off-by: Yuki Iwai <[email protected]>

* Cleanup ProvReq wrapper

* Make the Estimate func accept pods grouped.

The grouping should be made by the schedulability equivalence
meaning we can introduce optimizations to the binpacking.

Introduce a benchmark that estimates capacity needed for 51k pods,
which can be grouped to two equivalence groups 50k and 1k.

* Update CAPI docs

Add a link to the sample manifest and update the image used in the
example.

Signed-off-by: Lennart Jern <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/updater

Bumps golang from 1.22.1 to 1.22.2.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

Bumps golang from 1.22.1 to 1.22.2.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/recommender

Bumps golang from 1.22.1 to 1.22.2.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Introduce binbacking optimization for similar pods.

The optimization uses the fact that pods which are equivalent do not
need to be check multiple times against already filled nodes.
This changes the time complexity from O(pods*nodes) to O(pods).

* CA: Fix apis vendoring

* Add g6 EC2 instance type for AWS

* Copyright boilerplate

* Lower errors verbosity for kube-env label missing

* parentController may be nil when owner isn't scalable

* Update ProvisioningClass API Group

* Fix Autoscaling for worker nodes with invalid ProviderID

This change fixes a bug that arises when the user's cluster includes worker nodes not from Hetzner Cloud, such as a Hetzner Dedicated server or any server resource other than Hetzner. It also corrects the behavior when a server has been physically deleted from Hetzner Cloud.

Signed-off-by: Maksim Paskal <[email protected]>

* Add tests for Pods owner that doesn't implement /scale

* Add provreqOrchestrator that handle ProvReq classes (kubernetes#6627)

* Add provreqOrchestrator that handle ProvReq classes

* Review remarks

* Review remarks

* Cluster Autoscaler: Sync k8s.io dependencies to k/k v1.30.0, bump Go to 1.22.2

* [v1.30] fix(hetzner): hostname label is not considered

The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715

* Remove the flag for enabling ProvisioningRequests

The API is not stable yet, we don't want people to depend on the
current version.

* fix: scale up broken for providers not implementing NodeGroup.GetOptions()

Properly handle calls to `NodeGroup.GetOptions()` that return
`cloudprovider.ErrNotImplemented` in the scale up path.

* Add --enable-provisioning-requests flag

* [cluster-autoscaler-release-1.30] Fix ProvisioningRequest update (kubernetes#6825)

* Fix ProvisioningRequest update

* Review remarks

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Update k/k vendor to 1.30.1 for CA 1.30

* sync changes

* added sync changes file

* golint fix

* update vpa vendor

* fixed volcengine

* ran gofmt

* synched azure

* synched azure

* synched IT

* removed IT log file

* addressed review comments

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Feruzjon Muyassarov <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Dinar Valeev <[email protected]>
Signed-off-by: Ondrej Kokes <[email protected]>
Signed-off-by: Luke Addison <[email protected]>
Signed-off-by: Lennart Jern <[email protected]>
Signed-off-by: Maksim Paskal <[email protected]>
Co-authored-by: Kubernetes Prow Robot <[email protected]>
Co-authored-by: David Benque <[email protected]>
Co-authored-by: michael mccune <[email protected]>
Co-authored-by: shubham82 <[email protected]>
Co-authored-by: Markus Lehtonen <[email protected]>
Co-authored-by: Niklas Rosenstein <[email protected]>
Co-authored-by: Ky-Anh Huynh <[email protected]>
Co-authored-by: Niklas Rosenstein <[email protected]>
Co-authored-by: Guy Templeton <[email protected]>
Co-authored-by: daimaxiaxie <[email protected]>
Co-authored-by: daimaxiaxie <[email protected]>
Co-authored-by: Michal Pitr <[email protected]>
Co-authored-by: Daniel Kłobuszewski <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Artur Żyliński <[email protected]>
Co-authored-by: Alvaro Aleman <[email protected]>
Co-authored-by: Marco Voelz <[email protected]>
Co-authored-by: Jack Francis <[email protected]>
Co-authored-by: Yarin Miran <[email protected]>
Co-authored-by: Will Bowers <[email protected]>
Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Bartłomiej Wróblewski <[email protected]>
Co-authored-by: Anish Shah <[email protected]>
Co-authored-by: Mahmoud Atwa <[email protected]>
Co-authored-by: pawel siwek <[email protected]>
Co-authored-by: Miranda Craghead <[email protected]>
Co-authored-by: Feruzjon Muyassarov <[email protected]>
Co-authored-by: Kuba Tużnik <[email protected]>
Co-authored-by: Johnnie Ho <[email protected]>
Co-authored-by: Walid Ghallab <[email protected]>
Co-authored-by: Karol Wychowaniec <[email protected]>
Co-authored-by: oksanabaza <[email protected]>
Co-authored-by: Vijay Bhargav Eshappa <[email protected]>
Co-authored-by: David <[email protected]>
Co-authored-by: Damika Gamlath <[email protected]>
Co-authored-by: Ashish Pani <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: Dinar Valeev <[email protected]>
Co-authored-by: Donovan Muller <[email protected]>
Co-authored-by: Luiz Antonio <[email protected]>
Co-authored-by: Ondrej Kokes <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: Luke Addison <[email protected]>
Co-authored-by: Mario Valderrama <[email protected]>
Co-authored-by: Max Fedotov <[email protected]>
Co-authored-by: Daniel-Redeploy <[email protected]>
Co-authored-by: Edwinhr716 <[email protected]>
Co-authored-by: Maksym Fuhol <[email protected]>
Co-authored-by: Allen Mun <[email protected]>
Co-authored-by: mewa <[email protected]>
Co-authored-by: Aayush Rangwala <[email protected]>
Co-authored-by: prachigandhi <[email protected]>
Co-authored-by: Daniel Gutowski <[email protected]>
Co-authored-by: Lennart Jern <[email protected]>
Co-authored-by: mendelski <[email protected]>
Co-authored-by: ceuity <[email protected]>
Co-authored-by: Maksim Paskal <[email protected]>
Co-authored-by: Julian Tölle <[email protected]>
Co-authored-by: k8s-infra-cherrypick-robot <[email protected]>
rishabh-11 added a commit to gardener/autoscaler that referenced this issue Oct 23, 2024
* Update with make generate

* Add pdb filtering to remainingPdbTracker

* Convert replicated, system, not-safe-to-evict, and local storage pods to drainability rules

* Convert scale-down pdb check to drainability rule

* Pass DeleteOptions once during default rule creation

* Split out custom controller and common checks into separate drainability rules

* Filter out disabled drainability rules during creation

* Refactor GetPodsForDeletion logic and tests into simulator

* Fix custom controller drainability rule and add test coverage

* Add unit test for long-terminating pod past grace period

* Removed node drainer, kept node termination handler

* Add HasNodeGroupStartedScaleUp to cluster state registry.

- HasNodeGroupStartedScaleUp checks wheter a scale up request exists
  without checking any upcoming nodes.

* Add kwiesmueller to OWNERS

jbartosik et al are transitioning off of workload autoscalers (incl vpa
and addon-resizer). kwiesmueller is on the new team and has agreed to
take on reviewer/approver responsibilities.

* Add information about provisioning-class-name annotation.

* Remove redundant if branch

* Add mechanism to override drainability status

* Log drainability override

* fix(cluster-autoscaler-chart): if secretKeyRefNameOverride is true, don't create secret

Signed-off-by: Jonathan Raymond <[email protected]>

* fix: correct version bump

Signed-off-by: Jonathan Raymond <[email protected]>

* Initialize default drainability rules

* feat: each node pool can now have different init configs

* ClusterAPI: Allow overriding the kubernetes.io/arch label set by the scale from zero method via environment variable

The architecture label in the build generic labels method of the cluster API (CAPI) provider is now populated using the GetDefaultScaleFromZeroArchitecture().Name() method.

The method allows CAPI users deploying the cluster-autoscaler to define the default architecture to be used by the cluster-autoscaler for scale up from zero via the env var CAPI_SCALE_ZERO_DEFAULT_ARCH. Amd64 is kept as a fallback for historical reasons.

The introduced changes will not take into account the case of nodes heterogeneous in architecture. The labels generation to infer properties like the cpu architecture from the node groups' features should be considered as a CAPI provider specific implementation.

* Update image builder to use Go 1.21.3

Some of Cluster Autoscaler code is now using features only available in Go 1.21.

* Add node-delete-delay-after-taint to FAQ

* Reports node taints.

* Add debugging-snapshot-enabled back

* Rename comments, logs, structs, and vars from packet to equinix metal

* Rename types

* fix: provider name to be used in builder to provide backward compatibility

Signed-off-by: Ayush Rangwala <[email protected]>

* Rename comments, logs, structs, and vars from packet to equinix metal

* Created a new env var for metal to replace/support packet env vars as usual

* Support backward compatibility for PACKET_MANAGER env var

Signed-off-by: Ayush Rangwala <[email protected]>

* fix: refactor cloud provider names

Signed-off-by: Ayush Rangwala <[email protected]>

* Documents startup/status/ignore node taints.

* Adding price info for c3d
(Price for preemptible instances is calculated as: (Spot price / On-demand price) * instance prices)

* Bump CA golang to 1.21.3

* cloudprovider/exoscale: update limits/quotas URL

https://portal.exoscale.com/account/limits has been deprecated in
favor of https://portal.exoscale.com/organization/quotas. Update
README accordingly.

* Add the AppVersion to cluster-autoscaler.labels as app.kubernetes.io/version

* Bump version in chart.yaml

* add note for CRD and RBAC handling for VPA (>=1.0.0)

* feat(helm): add support for exoscale provider

Signed-off-by: Thomas Stadler <[email protected]>

* Add TOC link in README for EvictionRequirement example

* Fix 'evictionRequirements.resources' to be plural in yaml

* Run 'hack/generate-crd-yamls.sh'

* Adapt AEP to have 'resources' in plural

* Remove deprecated dependency: gogo/protobuf

* Fix klog formating directives in cluster-autoscaler package.

* Update kubernetes dependencies to 1.29.0-alpha.3.

* Change scheduler framework function names after recent refactor in
kubernetes scheduler.

* chore(helm): bump version of cluster-autoscaler

Signed-off-by: Thomas Stadler <[email protected]>

* chore(helm): docs, update README template

Signed-off-by: Thomas Stadler <[email protected]>

* Fix capacityType label in AWS ManagedNodeGroup

Fixes an issue where the capacityType label inferred from an empty
EKS ManagedNodeGroup does not match the same label on the node after it
is created and joins the cluster

* Cleanup: Remove deprecated github.com/golang/protobuf usage

- Regenerate cloudprovider/externalgrpc proto
- go mod tidy

* Remove maps.Copy usage.

* chore: upgrade vpa go and k8s dependencies

Signed-off-by: Amir Alavi <[email protected]>

* ScaleUp is only ever called when there are unscheduled pods

* Bump golang from 1.21.2 to 1.21.4 in /vertical-pod-autoscaler/builder

Bumps golang from 1.21.2 to 1.21.4.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Disambiguate the resource usage node removal eligibility messages

* Cleanup: Remove separate client for k8s events

Remove RateLimiting options - replay on APF for apiserver protection.
Details: kubernetes/kubernetes#111880

* Update Chart.yaml

* Remove gce-expander-ephemeral-storage-support flag

Always enable the feature

* Add min/max/current asg size to log

* Clarify that log line updates cache, now AWS

* Update README.md: Link to Cluster-API

Add Link to Cluster API.

* azure: add owner-jackfrancis

* Update OWNERS - typo

* Update README.md

* Template the autoDiscovery.clusterName variable in the Helm chart

* fix: Add revisionHistoryLimit override to cluster-autoscaler

Signed-off-by: Matt Dainty <[email protected]>

* allow users to avoid aws instance not found spam

* fix: alicloud the function NodeGroupForNode is incorrect

* Update README.md

Fix error in text

* fix: handle error when listing machines

Signed-off-by: Cyrill Troxler <[email protected]>

* AWS: cache instance requirements

* fix: update node annotation used to limit log spam with valid key

* Removes unnecessary check

* Allow overriding domain suffix in GCE cloud provider.

* chore(deps): update vendored hcloud-go to 2.4.0

Generated by:

```
UPSTREAM_REF=v2.4.0 hack/update-vendor.sh
```

* Add new pod list processors for clearing TPU requests & filtering out
expendable pods

Treat non-processed pods yet as unschedulable

* Fix multiple comments and update flags

* Add new test for new behaviour and revert changes made to other tests

* Allow users to specify which schedulers to ignore

* Update flags, Improve tests readability & use Bypass instead of ignore in naming

* Update static_autoscaler tests & handle pod list processors errors as warnings

* Fix: Include restartable init containers in Pod utilization calc

Reuse k/k resourcehelper func

* Implement ProvReq service

* Set Go versions to the same settings kubernetes/kubernetes uses

Looks like specifying the Go patch version in go.mod might've been
a mistake: kubernetes/kubernetes#121808.

* feat: implement kwok cloudprovider

feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider
Signed-off-by: vadasambar <[email protected]>

feat: add builder for `kwok`
- add logic to scale up and scale down nodes in `kwok` provider
Signed-off-by: vadasambar <[email protected]>

feat: wip parse node templates from file
Signed-off-by: vadasambar <[email protected]>

docs: add short README
Signed-off-by: vadasambar <[email protected]>

feat: implement remaining things
- to get the provider in a somewhat working state
Signed-off-by: vadasambar <[email protected]>

docs: add in-cluster `kwok` as pre-requisite in the README
Signed-off-by: vadasambar <[email protected]>

fix: templates file not correctly marshalling into node list
Signed-off-by: vadasambar <[email protected]>

fix: `invalid leading UTF-8 octet` error during template parsing
- remove encoding using `gob`
- not required
Signed-off-by: vadasambar <[email protected]>

fix: use lister to get and list
- instead of uncached kube client
- add lister as a field on the provider and nodegroup struct
Signed-off-by: vadasambar <[email protected]>

fix: `did not find nodegroup annotation` error
- CA was thinking the annotation is not present even though it is
- fix a bug with parsing annotation
Signed-off-by: vadasambar <[email protected]>

fix: CA node recognizing fake nodegroups
- add provider ID to nodes in the format `kwok:<node-name>`
- fix invalid `KwokManagedAnnotation`
- sanitize template nodes (remove `resourceVersion` etc.,)
- not sanitizing the node leads to error during creation of new nodes
- abstract code to get NG name into a separate function `getNGNameFromAnnotation`
Signed-off-by: vadasambar <[email protected]>

fix: node not getting deleted
Signed-off-by: vadasambar <[email protected]>

test: add empty test file
Signed-off-by: vadasambar <[email protected]>

chore: add OWNERS file
Signed-off-by: vadasambar <[email protected]>

feat: wip kwok provider config
- add samples for static and dynamic template nodes
Signed-off-by: vadasambar <[email protected]>

feat: wip implement pulling node templates from cluster
- add status field to kwok provider config
- this is to capture how the nodes would be grouped by (can be annotation or label)
- use kwok provider config status to get ng name from the node template
Signed-off-by: vadasambar <[email protected]>

fix: syntax error in calling `loadNodeTemplatesFromCluster`
Signed-off-by: vadasambar <[email protected]>

feat: first draft of dynamic node templates
- this allows node templates to be pulled from the cluster
- instead of having to specify static templates manually
Signed-off-by: vadasambar <[email protected]>

fix: syntax error
Signed-off-by: vadasambar <[email protected]>

refactor: abstract out related code into separate files
- use named constants instead of hardcoded values
Signed-off-by: vadasambar <[email protected]>

feat: cleanup kwok nodes when CA is exiting
- so that the user doesn't have to cleanup the fake nodes themselves
Signed-off-by: vadasambar <[email protected]>

refactor: return `nil` instead of err for `HasInstance`
- because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented`
Signed-off-by: vadasambar <[email protected]>

test: start working on tests for kwok provider config
Signed-off-by: vadasambar <[email protected]>

feat: add `gpuLabelKey` under `nodes` field in kwok provider config
- fix validation for kwok provider config
Signed-off-by: vadasambar <[email protected]>

docs: add motivation doc
- update README with more details
Signed-off-by: vadasambar <[email protected]>

feat: update kwok provider config example to support pulling gpu labels and types from existing providers
- still needs to be implemented in the code
Signed-off-by: vadasambar <[email protected]>

feat: wip update kwok provider config to get gpu label and available types
Signed-off-by: vadasambar <[email protected]>

feat: wip read gpu label and available types from specified provider
- add available gpu types in kwok provider config status
Signed-off-by: vadasambar <[email protected]>

feat: add validation for gpu fields in kwok provider config
- load gpu related fields in kwok provider config status
Signed-off-by: vadasambar <[email protected]>

feat: implement `GetAvailableGPUTypes`
Signed-off-by: vadasambar <[email protected]>

feat: add support to install and uninstall kwok
- add option to disable installation
- add option to manually specify kwok release tag
- add future scope in readme
Signed-off-by: vadasambar <[email protected]>

docs: add future scope 'evaluate adding support to check if kwok controller already exists'
Signed-off-by: vadasambar <[email protected]>

fix: vendor conflict and cyclic import
- remove support to get gpu config from the specified provider (can't be used because leads to cyclic import)
Signed-off-by: vadasambar <[email protected]>

docs: add a TODO 'get gpu config from other providers'
Signed-off-by: vadasambar <[email protected]>

refactor: rename `file` -> `configmap`
- load config and templates from configmap instead of file
- move `nodes` and `nodegroups` config to top level
- add helper to encode configmap data into `[]bytes`
- add helper to get current pod namespace
Signed-off-by: vadasambar <[email protected]>

feat: add new options to the kwok provider config
- auto install kwok only if the version is >= v0.4.0
- add test for `GPULabel()`
- use `kubectl apply` way of installing kwok instead of kustomize
- add test for kwok helpers
- add test for kwok config
- inject service account name in CA deployment
- add example configmap for node templates and kwok provider config in CA helm chart
- add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests)
- update kwok provider sample configs
- update `README`
Signed-off-by: vadasambar <[email protected]>

chore: update go.mod to use v1.28 packages
Signed-off-by: vadasambar <[email protected]>

chore: `go mod tidy` and `go mod vendor` (again)
Signed-off-by: vadasambar <[email protected]>

refactor: kwok installation code
- add functions to create and delete clusterrolebinding to create kwok resources
- refactor kwok install and uninstall fns
- delete manifests in the opposite order of install ]
- add cleaning up left-over kwok installation to future scope
Signed-off-by: vadasambar <[email protected]>

fix: nil ptr error
- add `TODO` in README for adding docs around kwok config fields
Signed-off-by: vadasambar <[email protected]>

refactor: remove code to automatically install and uninstall `kwok`
- installing/uninstalling requires strong permissions to be granted to `kwok`
- granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase
- this can pose a security risk
- I have removed the code related to install and uninstall for now
- will proceed after discussion with the community
Signed-off-by: vadasambar <[email protected]>

chore: run `go mod tidy` and `go mod vendor`
Signed-off-by: vadasambar <[email protected]>

fix: add permission to create nodes
- to fix permissions error for kwok provider
Signed-off-by: vadasambar <[email protected]>

test: add more unit tests
- add tests for kwok helpers
- fix and update kwok config tests
- fix a bug where gpu label was getting assigned to `kwokConfig.status.key`
- expose `loadConfigFile` -> `LoadConfigFile`
- throw error if templates configmap does not have `templates` key (value of which is node templates)
- finish test for `GPULabel()`
- add tests for `NodeGroupForNode()`
- expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap`
- fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty)
Signed-off-by: vadasambar <[email protected]>

refactor: abstract provider ID code into `getProviderID` fn
- fix provider name in test `kwok` -> `kwok:kind-worker-xxx`
Signed-off-by: vadasambar <[email protected]>

chore: run `go mod vendor` and `go mod tidy
Signed-off-by: vadasambar <[email protected]>

docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label`
Signed-off-by: vadasambar <[email protected]>

refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"`
- `"kubernetes.io/hostname"` leads to infinite scale-up
Signed-off-by: vadasambar <[email protected]>

feat: support running CA with kwok provider locally
Signed-off-by: vadasambar <[email protected]>

refactor: use global informer factory
Signed-off-by: vadasambar <[email protected]>

refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates
Signed-off-by: vadasambar <[email protected]>

refactor: `Cleanup()` logic
- clean up only nodes managed by the kwok provider
Signed-off-by: vadasambar <[email protected]>

fix/refactor: nodegroup creation logic
- fix issue where fake node was getting created which caused fatal error
- use ng annotation to keep track of nodegroups
- (when creating nodegroups) don't process nodes which don't have the right ng nabel
- suffix ng name with unix timestamp
Signed-off-by: vadasambar <[email protected]>

refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup`
- pass only the required node lister to cloud provider instead of the entire informer factory
- pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config
- implement fake node lister for testing
Signed-off-by: vadasambar <[email protected]>

test: add test case for dynamic templates in `TestNodeGroupForNode`
- remove non-required fields from template node
Signed-off-by: vadasambar <[email protected]>

test: add tests for `NodeGroups()`
- add extra node template without ng selector label to add more variability in the test
Signed-off-by: vadasambar <[email protected]>

test: write tests for `GetNodeGpuConfig()`
Signed-off-by: vadasambar <[email protected]>

test: add test for `GetAvailableGPUTypes`
Signed-off-by: vadasambar <[email protected]>

test: add test for `GetResourceLimiter()`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()`
- abstract error msgs into variables to use them in tests
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn
- add check for deleting too many nodes
- rename err msg var names to make them consistent
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()`
- abstract error msgs into variables (for easy use in tests)
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `Nodes()`
- add extra test case for `DecreaseTargetSize()` to check lister error
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `TemplateNodeInfo`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): improve tests for `BuildKwokProvider()`
- add more test cases
- refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `GetOptions`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test
- not doing so leads to failure in other tests
- remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail
Signed-off-by: vadasambar <[email protected]>

chore: bump CA chart version
- this is because of changes made related to kwok
- fix type `everwhere` -> `everywhere`
Signed-off-by: vadasambar <[email protected]>

chore: fix linting checks
Signed-off-by: vadasambar <[email protected]>

chore: address CI lint errors
Signed-off-by: vadasambar <[email protected]>

chore: generate helm docs for `kwokConfigMapName`
- remove `KWOK_CONFIG_MAP_KEY` (not being used in the code)
- bump helm chart version
Signed-off-by: vadasambar <[email protected]>

docs: revise the outline for README
- add AEP link to the motivation doc
Signed-off-by: vadasambar <[email protected]>

docs: wip create an outline for the README
- remove `kwok` field from examples (not needed right now)
Signed-off-by: vadasambar <[email protected]>

docs: add outline for ascii gifs
Signed-off-by: vadasambar <[email protected]>

refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <[email protected]>

docs: update README with info around installation and benefits of using kwok provider
- add `Kwok` as a provider in main CA README
Signed-off-by: vadasambar <[email protected]>

chore: run `go mod vendor`
- remove TODOs that are not needed anymore
Signed-off-by: vadasambar <[email protected]>

docs: finish first draft of README
Signed-off-by: vadasambar <[email protected]>

fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <[email protected]>

refactor: remove redundant/deprecated code
Signed-off-by: vadasambar <[email protected]>

chore: bump chart version `9.30.1` -> `9.30.2`
- because of kwok provider related changes
Signed-off-by: vadasambar <[email protected]>

chore: fix typo `offical` -> `official`
Signed-off-by: vadasambar <[email protected]>

chore: remove debug log msg
Signed-off-by: vadasambar <[email protected]>

docs: add links for getting help
Signed-off-by: vadasambar <[email protected]>

refactor: fix type in log `external cluster` -> `cluster`
Signed-off-by: vadasambar <[email protected]>

chore: add newline in chart.yaml to fix CI lint
Signed-off-by: vadasambar <[email protected]>

docs: fix mistake `sig-kwok` -> `sig-scheduling`
- kwok is a part if sig-scheduling (there is no sig-kwok)
Signed-off-by: vadasambar <[email protected]>

docs: fix type `release"` -> `"release"`
Signed-off-by: vadasambar <[email protected]>

refactor: pass informer instead of lister to cloud provider builder fn
Signed-off-by: vadasambar <[email protected]>

* add unit test for function getScalingInstancesByGroup

* Azure: Remove AKS vmType

Signed-off-by: Jack Francis <[email protected]>

* Implement TemplateNodeInfo for civo cloudprovider

Signed-off-by: Vishal Anarse <[email protected]>

* Add comment for type and function

Signed-off-by: Vishal Anarse <[email protected]>

* refactor(*): move getKubeClient to utils/kubernetes

(cherry picked from commit b9f636d)

Signed-off-by: qianlei.qianl <[email protected]>

refactor: move logic to create client to utils/kubernetes pkg
- expose `CreateKubeClient` as public function
- make `GetKubeConfig` into a private `getKubeConfig` function (can be exposed as a public function in the future if needed)
Signed-off-by: vadasambar <[email protected]>

fix: CI failing because cloudproviders were not updated to use new autoscaling option fields
Signed-off-by: vadasambar <[email protected]>

refactor: define errors as constants
Signed-off-by: vadasambar <[email protected]>

refactor: pass kube client options by value
Signed-off-by: vadasambar <[email protected]>

* Calculate real value for template using node group

Signed-off-by: Vishal Anarse <[email protected]>

* Fix lint error

* Fix tests

Signed-off-by: Vishal Anarse <[email protected]>

* Update aws-sdk-go to 1.48.7 via tarball
Remove *_test.go, models/, examples

* + Added SDK version in the log
+ Update version in README + command

* Switch to multistage build Dockerfiles for VPA

* Adding 33 instances types

* heml chart - update cluster-autoscaler to 1.28

* Bump builder images to go 1.21.5

* feat: add metrics to show target size of every node group

* deprecate unused node-autoprovisioning-enabled and max-autoprovisioned-node-group-count flags

Signed-off-by: Prashant Rewar <[email protected]>

* fix(hetzner): insufficient nodes when boot fails

The Hetzner Cloud API returns "Actions" for anything asynchronous that
happens inside the backend. When creating a new server multiple actions
are returned: `create_server`, `start_server`, `attach_to_network` (if set).

Our current code waits for the `create_server` and if it fails, it makes
sure to delete the server so cluster-autoscaler can create a new one
immediately to provide the required capacity. If one of the "follow up"
actions fails though, we do not handle this. This causes issues when the
server for whatever reason did not start properly on the first try, as
then the customer has a shutdown server, is paying for it, but does not
receive the additional capacity for their Kubernetes cluster.

This commit fixes the bug, by awaiting all actions returned by the
create server API call, and deleting the server if any of them fail.

* Add VSCode workspace files to .gitignore

* Remove vpa/builder and switch dependabot updates to component Dockerfiles

* fix: updated readme for hetzner cloud provider

* Add error details to autoscaling backoff.

Change-Id: I3b5c62ba13c2e048ce2d7170016af07182c11eee

* Make backoff.Status.ErrorInfo non-pointer.

Change-Id: I1f812d4d6f42db97670ef7304fc0e895c837a13b

* allow specifing grpc timeout rather than hardcoded 5 seconds

Signed-off-by: lizhen <[email protected]>

* [GCE] Support paginated instance listing

* azure: fix chart bugs after AKS vmType deprecation

Signed-off-by: Jack Francis <[email protected]>

* Update VPA release README to reference 1.X VPA versions.

* implement priority based evictor and refactor drain logic

* Update dependencies to kubernetes 1.29.0

* [civo] Add Gpu count to node template

Signed-off-by: Vishal Anarase <[email protected]>
(cherry picked from commit 8703ff9)

* Restore flags for setting QPS limit in CA

Partially undo kubernetes#6274. I noticed that with this change CA get rate limited and
slows down significantly (especially during large scale downs).

* Pass Burst and QPS client params to capi k8s clients

* Dependency update for CA 1.29.1

* feat: support `--scale-down-delay-after-*` per nodegroup
Signed-off-by: vadasambar <[email protected]>

feat: update scale down status after every scale up
- move scaledown delay status to cluster state/registry
- enable scale down if  `ScaleDownDelayTypeLocal` is enabled
- add new funcs on cluster state to get and update scale down delay status
- use timestamp instead of booleans to track scale down delay status
Signed-off-by: vadasambar <[email protected]>

refactor: use existing fields on clusterstate
- uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus`
- changed the above existing fields a little to make them more convenient for use
- moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go)
Signed-off-by: vadasambar <[email protected]>

refactor: remove note saying only `scale-down-after-add` is supported
- because we are supporting all the flags
Signed-off-by: vadasambar <[email protected]>

fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false`
Signed-off-by: vadasambar <[email protected]>

refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add`
- because it is not true anymore
- we are supporting all `--scale-down-delay-after-*` flags per nodegroup
Signed-off-by: vadasambar <[email protected]>

test: fix clusterstate tests failing
Signed-off-by: vadasambar <[email protected]>

refactor: move back initializing processors logic to from static autoscaler to main
- we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors
- and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice
Signed-off-by: vadasambar <[email protected]>

refactor: revert changes related to `clusterstate`
- since I am going with observer pattern
Signed-off-by: vadasambar <[email protected]>

feat: add observer interface for state of scaling
- to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same)
- refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer
Signed-off-by: vadasambar <[email protected]>

refactor: remove params passed to `clearScaleUpFailures`
- not needed anymore
Signed-off-by: vadasambar <[email protected]>

refactor: revert clusterstate tests
- approach has changed
- I am not making any changes in clusterstate now
Signed-off-by: vadasambar <[email protected]>

refactor: add accidentally deleted lines for clusterstate test
Signed-off-by: vadasambar <[email protected]>

feat: implement `Add` fn for scale state observer
- to easily add new observers
- re-word comments
- remove redundant params from `NewDefaultScaleDownCandidatesProcessor`
Signed-off-by: vadasambar <[email protected]>

fix: CI complaining because no comments on fn definitions
Signed-off-by: vadasambar <[email protected]>

feat: initialize parent `ScaleDownCandidatesProcessor`
- instead  of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately
Signed-off-by: vadasambar <[email protected]>

refactor: add scale state notifier to list of default processors
- initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn
- this allows more flexibility
Signed-off-by: vadasambar <[email protected]>

refactor: add observer interface
- create a separate observer directory
- implement `RegisterScaleUp` function in the clusterstate
- TODO: resolve syntax errors
Signed-off-by: vadasambar <[email protected]>

feat: use `scaleStateNotifier` in place of `clusterstate`
- delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory)
- register `clustertstate` with `scaleStateNotifier`
- use `Register` instead of `Add` function in `scaleStateNotifier`
- fix `go build`
- wip: fixing tests
Signed-off-by: vadasambar <[email protected]>

test: fix syntax errors
- add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable)
Signed-off-by: vadasambar <[email protected]>

feat: wip track scale down failures along with scale up failures
- I was tracking scale up failures but not scale down failures
- fix copyright year 2017 -> 2023 for the new `pointers` package
Signed-off-by: vadasambar <[email protected]>

feat: register failed scale down with scale state notifier
- wip writing tests for `scale_down_candidates_delay_processor`
- fix CI lint errors
- remove test file for `scale_down_candidates_processor` (there is not much to test as of now)
Signed-off-by: vadasambar <[email protected]>

test: wip tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <[email protected]>

test: add unit tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <[email protected]>

refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor`
- not needed
Signed-off-by: vadasambar <[email protected]>

test: better doc comments for `TestGetScaleDownCandidates`
Signed-off-by: vadasambar <[email protected]>

refactor: don't ignore error in `NGChangeObserver`
- return it instead and let the caller decide what to do with it
Signed-off-by: vadasambar <[email protected]>

refactor: change pointers to values in `NGChangeObserver` interface
- easier to work with
- remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now)
- add tests for clusterstate's `RegisterScaleUp`
Signed-off-by: vadasambar <[email protected]>

refactor: conditions in `GetScaleDownCandidates`
- set scale down in cool down if the number of scale down candidates is 0
Signed-off-by: vadasambar <[email protected]>

test: use `ng1` instead of `ng2` in existing test
Signed-off-by: vadasambar <[email protected]>

feat: wip static autoscaler tests
Signed-off-by: vadasambar <[email protected]>

refactor: assign directly instead of using `sdProcessor` variable
- variable is not needed
Signed-off-by: vadasambar <[email protected]>

test: first working test for static autoscaler
Signed-off-by: vadasambar <[email protected]>

test: continue working on static autoscaler tests
Signed-off-by: vadasambar <[email protected]>

test: wip second static autoscaler test
Signed-off-by: vadasambar <[email protected]>

refactor: remove `Println` used for debugging
Signed-off-by: vadasambar <[email protected]>

test: add static_autoscaler tests for scale down delay per nodegroup flags
Signed-off-by: vadasambar <[email protected]>

chore: rebase off the latest `master`
- change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master`
Signed-off-by: vadasambar <[email protected]>

test: fix clusterstate test failing
Signed-off-by: vadasambar <[email protected]>

test: fix failing orchestrator test
Signed-off-by: vadasambar <[email protected]>

refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor`
- describes the processor better
Signed-off-by: vadasambar <[email protected]>

refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver`
- makes it easier to understand for someone not familiar with the codebase
Signed-off-by: vadasambar <[email protected]>

docs: reword code comment `after` -> `for which`
Signed-off-by: vadasambar <[email protected]>

refactor: don't return error from `RegisterScaleDown`
- not needed as of now (no implementer function returns a non-nil error for this function)
Signed-off-by: vadasambar <[email protected]>

refactor: address review comments around ng change observer interface
- change dir structure of nodegroup change observer package
- stop returning errors wherever it is not needed in the nodegroup change observer interface
- rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand)
Signed-off-by: vadasambar <[email protected]>

refactor: make nodegroupchange observer thread-safe
Signed-off-by: vadasambar <[email protected]>

docs: add TODO to consider using multiple mutexes in nodegroupchange observer
Signed-off-by: vadasambar <[email protected]>

refactor: use `time.Now()` directly instead of assigning a variable to it
Signed-off-by: vadasambar <[email protected]>

refactor: share code for checking if there was a recent scale-up/down/failure
Signed-off-by: vadasambar <[email protected]>

test: convert `ScaleDownCandidatesDelayProcessor` into table tests
Signed-off-by: vadasambar <[email protected]>

refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()`
- makes it easier to understand what the function does
Signed-off-by: vadasambar <[email protected]>

test: replace scale state notifier `Register` -> `RegisterForNotifications` in test
- to fix syntax errors since it is already renamed in the actual code
Signed-off-by: vadasambar <[email protected]>

refactor: remove `clusterStateRegistry` from `delete_in_batch` tests
- not needed anymore since we have `scaleStateNotifier`
Signed-off-by: vadasambar <[email protected]>

refactor: address PR review comments
Signed-off-by: vadasambar <[email protected]>

fix: add empty `RegisterFailedScaleDown` for clusterstate
- fix syntax error in static autoscaler test
Signed-off-by: vadasambar <[email protected]>
(cherry picked from commit 5de49a1)

* Backport kubernetes#6522 [CA] Bump go version into CA1.29

* Backport kubernetes#6491 and kubernetes#6494 [CA] Add informer argument to the CloudProviders builder into CA1.29

* Merge pull request kubernetes#6617 from ionos-cloud/update-ionos-sdk

ionoscloud: Update ionos-cloud sdk-go and add metrics

* CA - Update k/k vendor to 1.29.3

* [v1.29][Hetzner] Fix missing ephemeral storage definition

This fixed requests for pods with ephemeral storage requests being denied due to insufficient ephemeral storage for the Hetzner provider.

Backport of kubernetes#6574 to `v1.29` branch.

* Use cache to track vms pools

* fx

* Add UTs

* Fx boilder plate header

* Add const

* Rename vmsPoolSet

* [v1.29][Hetzner] Fix Autoscaling for worker nodes with invalid ProviderID

This change fixes a bug that arises when the user's cluster includes
worker nodes not from Hetzner Cloud, such as a Hetzner Dedicated server
or any server resource other than Hetzner. It also corrects the
behavior when a server has been physically deleted from Hetzner Cloud.

Signed-off-by: Maksim Paskal <[email protected]>

* [v1.29] fix(hetzner): hostname label is not considered

The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715

* Remove shadow err variable in deleteCreatedNodesWithErros func

* fix: scale up broken for providers not implementing NodeGroup.GetOptions()

Properly handle calls to `NodeGroup.GetOptions()` that return
`cloudprovider.ErrNotImplemented` in the scale up path.

* Update k/k vendor to 1.29.5 for CA 1.29

* Rebase

* Fx gomock

* Rename ARM_BASE_URL

* Backport kubernetes#6528 [CA] Fix expectedToRegister to respect instances with nil status into CA1.29

* Backport kubernetes#6750 [CA] fix(hetzner): missing error return in scale up/down into CA1.29

* PR#6911 Backport for 1.29: Fix/aws asg unsafe decommission kubernetes#5829

* CA - 1.29.4 Pre-release AWS Instance Types Update

* Update vendor to use k8s 1.29.6

* adjust logs

---------

Signed-off-by: Jonathan Raymond <[email protected]>
Signed-off-by: Ayush Rangwala <[email protected]>
Signed-off-by: Thomas Stadler <[email protected]>
Signed-off-by: Amir Alavi <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Matt Dainty <[email protected]>
Signed-off-by: Cyrill Troxler <[email protected]>
Signed-off-by: vadasambar <[email protected]>
Signed-off-by: Jack Francis <[email protected]>
Signed-off-by: Vishal Anarse <[email protected]>
Signed-off-by: Prashant Rewar <[email protected]>
Signed-off-by: lizhen <[email protected]>
Signed-off-by: Maksim Paskal <[email protected]>
Co-authored-by: Mathieu Bruneau <[email protected]>
Co-authored-by: Artem Minyaylov <[email protected]>
Co-authored-by: Kubernetes Prow Robot <[email protected]>
Co-authored-by: Dumlu Timuralp <[email protected]>
Co-authored-by: Hakan Bostan <[email protected]>
Co-authored-by: Rich Gowman <[email protected]>
Co-authored-by: Daniel Gutowski <[email protected]>
Co-authored-by: mikutas <[email protected]>
Co-authored-by: Jonathan Raymond <[email protected]>
Co-authored-by: Johnnie Ho <[email protected]>
Co-authored-by: aleskandro <[email protected]>
Co-authored-by: Kuba Tużnik <[email protected]>
Co-authored-by: lisenet <[email protected]>
Co-authored-by: Piotr Wrótniak <[email protected]>
Co-authored-by: Ayush Rangwala <[email protected]>
Co-authored-by: Dixita Narang <[email protected]>
Co-authored-by: Artur Żyliński <[email protected]>
Co-authored-by: Alexandros Afentoulis <[email protected]>
Co-authored-by: jw-maynard <[email protected]>
Co-authored-by: xiaoqing <[email protected]>
Co-authored-by: Thomas Stadler <[email protected]>
Co-authored-by: Marco Voelz <[email protected]>
Co-authored-by: Aleksandra Gacek <[email protected]>
Co-authored-by: Luis Ramirez <[email protected]>
Co-authored-by: piotrwrotniak <[email protected]>
Co-authored-by: Amir Alavi <[email protected]>
Co-authored-by: Michael Grosser <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: shapirus <[email protected]>
Co-authored-by: Guy Templeton <[email protected]>
Co-authored-by: Mads Hartmann <[email protected]>
Co-authored-by: Thomas Güttler <[email protected]>
Co-authored-by: Prachi Gandhi <[email protected]>
Co-authored-by: Prachi Gandhi <[email protected]>
Co-authored-by: Mike Tougeron <[email protected]>
Co-authored-by: Matt Dainty <[email protected]>
Co-authored-by: Guo Peng <[email protected]>
Co-authored-by: Alex Serbul <[email protected]>
Co-authored-by: Cyrill Troxler <[email protected]>
Co-authored-by: alexanderConstantinescu <[email protected]>
Co-authored-by: Brydon Cheyney <[email protected]>
Co-authored-by: Julian Tölle <[email protected]>
Co-authored-by: Mahmoud Atwa <[email protected]>
Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: vadasambar <[email protected]>
Co-authored-by: Jack Francis <[email protected]>
Co-authored-by: Vishal Anarse <[email protected]>
Co-authored-by: qianlei.qianl <[email protected]>
Co-authored-by: Andrea Scarpino <[email protected]>
Co-authored-by: Prashant Rewar <[email protected]>
Co-authored-by: Jont828 <[email protected]>
Co-authored-by: Pascal <[email protected]>
Co-authored-by: Walid Ghallab <[email protected]>
Co-authored-by: lizhen <[email protected]>
Co-authored-by: Daniel Kłobuszewski <[email protected]>
Co-authored-by: Luiz Antonio <[email protected]>
Co-authored-by: damikag <[email protected]>
Co-authored-by: Maciek Pytel <[email protected]>
Co-authored-by: Joachim Bartosik <[email protected]>
Co-authored-by: Kyle Weaver <[email protected]>
Co-authored-by: shubham82 <[email protected]>
Co-authored-by: Kubernetes Prow Robot <[email protected]>
Co-authored-by: wenxuanW <[email protected]>
Co-authored-by: Maksim Paskal <[email protected]>
Co-authored-by: Bartłomiej Wróblewski <[email protected]>
Co-authored-by: Krishna Sarabu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/hetzner Issues or PRs related to Hetzner provider kind/bug Categorizes issue or PR as related to a bug. kind/documentation Categorizes issue or PR as related to documentation.
Projects
None yet
Development

No branches or pull requests

3 participants