Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

Closed
bennysp opened this issue Aug 29, 2022 · 14 comments · Fixed by #5361
Closed

Cannot deploy cluster-autoscaler with Rancher RKE2 #5140

bennysp opened this issue Aug 29, 2022 · 14 comments · Fixed by #5361
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@bennysp
Copy link

bennysp commented Aug 29, 2022

Which component are you using?:
cluster-autoscaler / cluster-autoscaler-chart

What version of the component are you using?:

Component version:
cluster-autoscaler1.23.1 / cluster-autoscaler-chart-9.20.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.8", GitCommit:"4a3b558c52eb6995b3c5c1db5e54111bd0645a64", GitTreeState:"clean", BuildDate:"2021-12-15T14:52:11Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6+rke2r2", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-28T19:13:01Z", GoVersion:"go1.17.9b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.23) exceeds the supported minor version skew of +/-1

What environment is this in?:
Dev

What did you expect to happen?:
Cluster Autoscaler to deploy with Helm chart to my Rancher RKE2 cluster after the changes from PR 4975.

What happened instead?:
This error in the cluster autoscaler pod logs:

F0829 00:36:25.149073 1 main.go:430] Failed to get nodes from apiserver: nodes is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "nodes" in API group "" at the cluster scope

How to reproduce it (as minimally and precisely as possible):

  1. Create Rancher API key (in my case, not restrictions)
  2. Create Opaque Secret with you Rancher cloud-config:
apiVersion: v1
kind: Secret
metadata:
  name: cluster-autoscaler-cloud-config
  namespace: kube-system
type: Opaque
stringData:
  cloud-config: |
    # rancher server credentials
    url: https://rancher.domain.com
    token: [Redacted: token-*:*]
    # name and namespace of the clusters.provisioning.cattle.io resource on the
    # rancher server
    clusterName: my-cluster
    clusterNamespace: fleet-default
    # optional, will be auto-discovered if not specified
    #clusterAPIVersion: v1alpha4
3. Use the below helm values:
autoDiscovery:
  clusterName: my-cluster
  labels: []
  roles:
    - worker
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/{{ .Values.autoDiscovery.clusterName }}
cloudProvider: rancher
extraVolumeSecrets:
  cluster-autoscaler-cloud-config:
    mountPath: /config
    name: cluster-autoscaler-cloud-config
extraArgs:
  logtostderr: true
  stderrthreshold: info
  v: 4
  cloud-config: /config/cloud-config
  cluster-name: my-cluster
image:
  pullPolicy: IfNotPresent
  pullSecrets: []
  repository: k8s.gcr.io/autoscaling/cluster-autoscaler
  tag: v1.23.1

Anything else we need to know?:
On Rancher 2.6.7 with RKE2 1.23.x.
Tried to deploy to downstream and management clusters (both on RKE2 1.23.x)

I am wondering if something is wrong in reading my deployed cloud-config?

@bennysp bennysp added the kind/bug Categorizes issue or PR as related to a bug. label Aug 29, 2022
@bennysp
Copy link
Author

bennysp commented Aug 30, 2022

@ctrox Do you have any ideas from the above?

@ctrox
Copy link
Contributor

ctrox commented Aug 30, 2022

From your message in the logs the autoscaler does not have permissions to get nodes, so I'm assuming it is missing some permissions (on the downstream cluster). If the autoscaler is running on the downstream cluster, you need to make sure the service account you set in the deployment has these permissions but that should happen automatically with the helm chart default values.

Also please note that cluster-autoscaler 1.23.1 that you linked does not contain the rancher provider, I'm assuming it will be in the next minor release (1.25.0).

@bennysp
Copy link
Author

bennysp commented Aug 31, 2022

Thanks @ctrox . I was wondering about that 1.23.1. I will double check the service account.

@jameswu2
Copy link

Hi @ctrox , I'm deploying the cluster-autoscaler chart similar to what's described in this issue, although my image.tag is v1.25.0 because the v1.23.1 image does not support the rancher cloud provider. From the autoscaler pod logs, I am seeing the following errors:
pre_filtering_processor.go:57] Node node_name should not be processed by cluster autoscaler (no node group config)
and
clusterstate.go:376] Failed to find readiness information for worker
clusterstate.go:438] Failed to find readiness information for worker

The API calls seem to be successful since autoscaler is able to read the node names in the cluster and discover the node group based on the log message:
rancher_provider.go:228] scalable node group found: worker (2:6)

I'm on kubernetes version v1.24.4+rke2r1 for reference so I think it's possible that there's a version mismatch with the autoscaler version 1.25.0 but I'm hoping someone can confirm whether that's the reason for the autoscaler failure or if there's something else going on.

@nugzarg
Copy link

nugzarg commented Nov 7, 2022

The same issue here. cluster-autoscaler version 1.25.0, installed via helm. rancher version 2.6.9, rke2 version 1.24.4+rke2r1. Cluster type Amazon EC2.
Autoscaler cmd:
./cluster-autoscaler --cloud-provider=rancher --namespace=kube-system --nodes=3:5:cpu-worker --cloud-config=/config/cloud-config --logtostderr=true --stderrthreshold=info --v=4

Cluster-autoscaler is terminating with error "Failed to find readiness information for cpu-worker" (exit code 137). "cpu-worker" is the name of pool in rancher cluster.

Please see cluster-autoscaler log in attached file.

cluster-autoscaler-rancher.log

@ctrox
Copy link
Contributor

ctrox commented Nov 7, 2022

Can you try without the --nodes flag? The node groups are discovered dynamically (using the annotations on the machinePool) so this flag should not be needed.

Also can you tell me the ProviderID of one of the nodes in the pool cpu-worker?

$ kubectl describe node <node> | grep ProviderID

I just verified here that cluster-autoscaler v1.25.0 runs fine with an RKE2 cluster, even a way older version of v1.21.14+rke2r1. I'm also on Rancher 2.6.9.

@nugzarg
Copy link

nugzarg commented Nov 7, 2022

Hello @ctrox ,

Without --nodes flag result is the same. Here the snippet from cluster-autoscaler log:

I1107 17:39:19.455773 37 klogx.go:86] Pod ci-test/node-example-main-657d4bb7f4-fqwvn is unschedulable
I1107 17:39:19.455810 37 scale_up.go:375] Upcoming 3 nodes
W1107 17:39:19.455830 37 clusterstate.go:376] Failed to find readiness information for cpu-worker
W1107 17:39:19.455844 37 clusterstate.go:376] Failed to find readiness information for cpu-worker
W1107 17:39:19.455870 37 scale_up.go:395] Node group cpu-worker is not ready for scaleup - unhealthy
I1107 17:39:19.455889 37 scale_up.go:462] No expansion option

And the output of command kubectl describe node i-0d33022be1ed6ac78.eu-central-1.compute.internal |grep ProviderID:

ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78

@ctrox
Copy link
Contributor

ctrox commented Nov 8, 2022

ProviderID: aws:///eu-central-1a/i-0d33022be1ed6ac78

Aha, it makes sense now why it does not work with your EC2 backed cluster. This is a bit weird, looks like I (wrongly) assumed rancher would always set the ProviderID in a consistent way, no matter which backend node driver is used.

Just to be sure, you created your cluster with EC2 using RKE2 like so?

image

Would you mind sharing a full node object with kubectl get node <node> -o yaml? For a potential fix, I need to see if there's a way to figure out the node pool name from the node object.

@eliaskoromilas
Copy link

Q: Rancher cloud-provider is not yet supported in the Helm chart right?

@nugzarg
Copy link

nugzarg commented Nov 9, 2022

Hello @ctrox ,

Yes cluster type is definitively RKE2 on EC2. Please see screenshot.
Screenshot_20221109_093347

Also please see node manifest yaml file in attached file. The file has txt extension, because attaching of yaml is not supported. Simple shange txt extension to yml if you want
node.txt
.

@ctrox
Copy link
Contributor

ctrox commented Nov 9, 2022

Q: Rancher cloud-provider is not yet supported in the Helm chart right?

I have not tested it but I think it should work with the helm chart. You just need to a few values like cloudProvider: rancher and cloudConfigPath respectively.

Thanks @nugzarg, I can think of a possible fix but I'm not yet sure when I will have time for that. I will look into it more on friday.

@nugzarg
Copy link

nugzarg commented Nov 9, 2022

Thanks @ctrox .
Maybe this information is also relevant. If I let rancher to show the node in API (In older 2.5 version of rancher UI, which is hidden but still present), nodePoolId key is empty and there is no nodePoolname key.
Screenshot_20221109_175454

@jameswu2
Copy link

jameswu2 commented Nov 9, 2022

Hey @ctrox, thanks for taking a look at this! What infrastructure provider are you using in your test environment, if any? Our clusters are being created on vSphere and running kubectl describe node node-name | grep ProviderID gets us ProviderID: vsphere://some-long-random-string that doesn't seem to follow any specific convention. Sounds like this issue will manifest if we use any cloud infrastructure provider?

@ctrox
Copy link
Contributor

ctrox commented Nov 11, 2022

I'm using a custom node driver which is not built-in. My guess is that just the ones that don't have a Cloud Provider in Rancher get a ProviderID rke2://. Anyways, it's clear to me that the autoscaler should not rely on the ProviderID anymore. I'm close to finishing up a PR, it will probably be done sometime next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants