Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

bennysp · 2022-08-29T00:51:36Z

Which component are you using?:
cluster-autoscaler / cluster-autoscaler-chart

What version of the component are you using?:

Component version:
cluster-autoscaler1.23.1 / cluster-autoscaler-chart-9.20.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8", GitCommit:"4a3b558c52eb6995b3c5c1db5e54111bd0645a64", GitTreeState:"clean", BuildDate:"2021-12-15T14:52:11Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6+rke2r2", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-28T19:13:01Z", GoVersion:"go1.17.9b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.23) exceeds the supported minor version skew of +/-1

What environment is this in?:
Dev

What did you expect to happen?:
Cluster Autoscaler to deploy with Helm chart to my Rancher RKE2 cluster after the changes from PR 4975.

What happened instead?:
This error in the cluster autoscaler pod logs:

F0829 00:36:25.149073 1 main.go:430] Failed to get nodes from apiserver: nodes is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "nodes" in API group "" at the cluster scope

How to reproduce it (as minimally and precisely as possible):

Create Rancher API key (in my case, not restrictions)
Create Opaque Secret with you Rancher cloud-config:

apiVersion: v1
kind: Secret
metadata:
  name: cluster-autoscaler-cloud-config
  namespace: kube-system
type: Opaque
stringData:
  cloud-config: |
    # rancher server credentials
    url: https://rancher.domain.com
    token: [Redacted: token-*:*]
    # name and namespace of the clusters.provisioning.cattle.io resource on the
    # rancher server
    clusterName: my-cluster
    clusterNamespace: fleet-default
    # optional, will be auto-discovered if not specified
    #clusterAPIVersion: v1alpha4

3. Use the below helm values:

autoDiscovery:
  clusterName: my-cluster
  labels: []
  roles:
    - worker
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/{{ .Values.autoDiscovery.clusterName }}
cloudProvider: rancher
extraVolumeSecrets:
  cluster-autoscaler-cloud-config:
    mountPath: /config
    name: cluster-autoscaler-cloud-config
extraArgs:
  logtostderr: true
  stderrthreshold: info
  v: 4
  cloud-config: /config/cloud-config
  cluster-name: my-cluster
image:
  pullPolicy: IfNotPresent
  pullSecrets: []
  repository: k8s.gcr.io/autoscaling/cluster-autoscaler
  tag: v1.23.1

Anything else we need to know?:
On Rancher 2.6.7 with RKE2 1.23.x.
Tried to deploy to downstream and management clusters (both on RKE2 1.23.x)

I am wondering if something is wrong in reading my deployed cloud-config?

The text was updated successfully, but these errors were encountered:

bennysp · 2022-08-30T02:41:57Z

@ctrox Do you have any ideas from the above?

ctrox · 2022-08-30T06:29:20Z

From your message in the logs the autoscaler does not have permissions to get nodes, so I'm assuming it is missing some permissions (on the downstream cluster). If the autoscaler is running on the downstream cluster, you need to make sure the service account you set in the deployment has these permissions but that should happen automatically with the helm chart default values.

Also please note that cluster-autoscaler 1.23.1 that you linked does not contain the rancher provider, I'm assuming it will be in the next minor release (1.25.0).

bennysp · 2022-08-31T19:26:02Z

Thanks @ctrox . I was wondering about that 1.23.1. I will double check the service account.

jameswu2 · 2022-10-10T20:47:24Z

Hi @ctrox , I'm deploying the cluster-autoscaler chart similar to what's described in this issue, although my image.tag is v1.25.0 because the v1.23.1 image does not support the rancher cloud provider. From the autoscaler pod logs, I am seeing the following errors:
pre_filtering_processor.go:57] Node node_name should not be processed by cluster autoscaler (no node group config)
and
clusterstate.go:376] Failed to find readiness information for worker
clusterstate.go:438] Failed to find readiness information for worker

The API calls seem to be successful since autoscaler is able to read the node names in the cluster and discover the node group based on the log message:
rancher_provider.go:228] scalable node group found: worker (2:6)

I'm on kubernetes version v1.24.4+rke2r1 for reference so I think it's possible that there's a version mismatch with the autoscaler version 1.25.0 but I'm hoping someone can confirm whether that's the reason for the autoscaler failure or if there's something else going on.

nugzarg · 2022-11-07T07:20:11Z

The same issue here. cluster-autoscaler version 1.25.0, installed via helm. rancher version 2.6.9, rke2 version 1.24.4+rke2r1. Cluster type Amazon EC2.
Autoscaler cmd:
./cluster-autoscaler --cloud-provider=rancher --namespace=kube-system --nodes=3:5:cpu-worker --cloud-config=/config/cloud-config --logtostderr=true --stderrthreshold=info --v=4

Cluster-autoscaler is terminating with error "Failed to find readiness information for cpu-worker" (exit code 137). "cpu-worker" is the name of pool in rancher cluster.

Please see cluster-autoscaler log in attached file.

cluster-autoscaler-rancher.log

ctrox · 2022-11-07T14:15:52Z

Can you try without the --nodes flag? The node groups are discovered dynamically (using the annotations on the machinePool) so this flag should not be needed.

Also can you tell me the ProviderID of one of the nodes in the pool cpu-worker?

$ kubectl describe node <node> | grep ProviderID

I just verified here that cluster-autoscaler v1.25.0 runs fine with an RKE2 cluster, even a way older version of v1.21.14+rke2r1. I'm also on Rancher 2.6.9.

nugzarg · 2022-11-07T17:43:06Z

Hello @ctrox ,

Without --nodes flag result is the same. Here the snippet from cluster-autoscaler log:

I1107 17:39:19.455773 37 klogx.go:86] Pod ci-test/node-example-main-657d4bb7f4-fqwvn is unschedulable
I1107 17:39:19.455810 37 scale_up.go:375] Upcoming 3 nodes
W1107 17:39:19.455830 37 clusterstate.go:376] Failed to find readiness information for cpu-worker
W1107 17:39:19.455844 37 clusterstate.go:376] Failed to find readiness information for cpu-worker
W1107 17:39:19.455870 37 scale_up.go:395] Node group cpu-worker is not ready for scaleup - unhealthy
I1107 17:39:19.455889 37 scale_up.go:462] No expansion option

And the output of command kubectl describe node i-0d33022be1ed6ac78.eu-central-1.compute.internal |grep ProviderID:

ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78

ctrox · 2022-11-08T16:04:05Z

ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78

Aha, it makes sense now why it does not work with your EC2 backed cluster. This is a bit weird, looks like I (wrongly) assumed rancher would always set the ProviderID in a consistent way, no matter which backend node driver is used.

Just to be sure, you created your cluster with EC2 using RKE2 like so?

Would you mind sharing a full node object with kubectl get node <node> -o yaml? For a potential fix, I need to see if there's a way to figure out the node pool name from the node object.

eliaskoromilas · 2022-11-08T16:24:54Z

Q: Rancher cloud-provider is not yet supported in the Helm chart right?

nugzarg · 2022-11-09T05:52:01Z

Hello @ctrox ,

Yes cluster type is definitively RKE2 on EC2. Please see screenshot.

Also please see node manifest yaml file in attached file. The file has txt extension, because attaching of yaml is not supported. Simple shange txt extension to yml if you want
node.txt
.

ctrox · 2022-11-09T08:49:15Z

Q: Rancher cloud-provider is not yet supported in the Helm chart right?

I have not tested it but I think it should work with the helm chart. You just need to a few values like cloudProvider: rancher and cloudConfigPath respectively.

Thanks @nugzarg, I can think of a possible fix but I'm not yet sure when I will have time for that. I will look into it more on friday.

nugzarg · 2022-11-09T13:58:46Z

Thanks @ctrox .
Maybe this information is also relevant. If I let rancher to show the node in API (In older 2.5 version of rancher UI, which is hidden but still present), nodePoolId key is empty and there is no nodePoolname key.

jameswu2 · 2022-11-09T14:31:03Z

Hey @ctrox, thanks for taking a look at this! What infrastructure provider are you using in your test environment, if any? Our clusters are being created on vSphere and running kubectl describe node node-name | grep ProviderID gets us ProviderID: vsphere://some-long-random-string that doesn't seem to follow any specific convention. Sounds like this issue will manifest if we use any cloud infrastructure provider?

ctrox · 2022-11-11T16:18:50Z

I'm using a custom node driver which is not built-in. My guess is that just the ones that don't have a Cloud Provider in Rancher get a ProviderID rke2://. Anyways, it's clear to me that the autoscaler should not rely on the ProviderID anymore. I'm close to finishing up a PR, it will probably be done sometime next week.

bennysp added the kind/bug Categorizes issue or PR as related to a bug. label Aug 29, 2022

jbartosik added the area/cluster-autoscaler label Sep 2, 2022

alcidesmig mentioned this issue Nov 18, 2022

chore(cluster-autoscaler/rancher): parametize provider ID #5320

Closed

ctrox mentioned this issue Dec 9, 2022

rancher-cloudprovider: Improve node group discovery #5361

Merged

k8s-ci-robot closed this as completed in #5361 Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

bennysp commented Aug 29, 2022

bennysp commented Aug 30, 2022

ctrox commented Aug 30, 2022

bennysp commented Aug 31, 2022

jameswu2 commented Oct 10, 2022

nugzarg commented Nov 7, 2022 •

edited

Loading

ctrox commented Nov 7, 2022

nugzarg commented Nov 7, 2022 •

edited

Loading

ctrox commented Nov 8, 2022

eliaskoromilas commented Nov 8, 2022

nugzarg commented Nov 9, 2022

ctrox commented Nov 9, 2022

nugzarg commented Nov 9, 2022 •

edited

Loading

jameswu2 commented Nov 9, 2022 •

edited

Loading

ctrox commented Nov 11, 2022

Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

Comments

bennysp commented Aug 29, 2022

bennysp commented Aug 30, 2022

ctrox commented Aug 30, 2022

bennysp commented Aug 31, 2022

jameswu2 commented Oct 10, 2022

nugzarg commented Nov 7, 2022 • edited Loading

ctrox commented Nov 7, 2022

nugzarg commented Nov 7, 2022 • edited Loading

ctrox commented Nov 8, 2022

eliaskoromilas commented Nov 8, 2022

nugzarg commented Nov 9, 2022

ctrox commented Nov 9, 2022

nugzarg commented Nov 9, 2022 • edited Loading

jameswu2 commented Nov 9, 2022 • edited Loading

ctrox commented Nov 11, 2022

nugzarg commented Nov 7, 2022 •

edited

Loading

nugzarg commented Nov 7, 2022 •

edited

Loading

nugzarg commented Nov 9, 2022 •

edited

Loading

jameswu2 commented Nov 9, 2022 •

edited

Loading