Cluster Autoscaler for AWS failing to get availability zone for ASG #5002

sarasensible · 2022-07-01T19:30:43Z

Which component are you using?:
cluster-autoscaler for aws

What version of the component are you using?:
Helm chart version 9.19.2
App/Image version 1.23.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.2", GitCommit:"9d142434e3af351a628bffee3939e64c681afa4d", GitTreeState:"clean", BuildDate:"2022-01-19T17:27:51Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-eks-a64ea69", GitCommit:"d4336843ba36120e9ed1491fddff5f2fec33eb77", GitTreeState:"clean", BuildDate:"2022-05-12T18:29:27Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

EKS Cluster version: 1.21

What environment is this in?:
AWS EKS

What did you expect to happen?:
I deployed the cluster-autoscaler expecting to see it scale my nodes up and down

What happened instead?:
No scaling happened, I saw the following logs:

I0701 18:46:00.865156       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
E0701 18:46:00.865194       1 node_instances_cache.go:164] Failed to get cloud provider node instance for node group eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404, error error while looking for instances of ASG: {eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404}
I0701 18:46:00.865237       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 70.98µs
I0701 18:46:10.865729       1 static_autoscaler.go:230] Starting main loop
E0701 18:46:10.866834       1 mixed_nodeinfos_processor.go:130] Unable to build proper template node for eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404: unable to get first AvailabilityZone for ASG "eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404"
E0701 18:46:10.866855       1 static_autoscaler.go:285] Failed to get node infos for groups: unable to get first AvailabilityZone for ASG "eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404"

How to reproduce it (as minimally and precisely as possible):
I used eksctl to create a cluster with a managed node group specifying availability zones like so:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ${CLUSTER_NAME}
  region: ${CLUSTER_REGION}
  tags:
    environment: dev

vpc:
  id: "vpc-12345"
  subnets:
    private:
      us-east-2a:
        id: "subnet-aaaaaa"
      us-east-2b:
        id: "subnet-bbbbbb"
      us-east-2c:
        id: "subnet-cccccc"
    public:
      us-east-2a:
        id: "subnet-xxxxxx"
      us-east-2b:
        id: "subnet-yyyyyy"
      us-east-2c:
        id:  "subnet-zzzzz"

managedNodeGroups:
  - name: managed-m5n-1
    instanceType: m5n.xlarge
    minSize: 6
    desiredCapacity: 8
    maxSize: 10
    privateNetworking: true
    availabilityZones: ["${CLUSTER_REGION}a", "${CLUSTER_REGION}b", "${CLUSTER_REGION}c"]
    volumeSize: 20
    ssh:
      allow: true # uses ~/.ssh/id_rsa.pub as the default ssh key
    labels: { role: worker }
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudwatch: true
        ebs: true
        albIngress: true

Then I installed with Helm values set like the following:

autoscalingGroups:
  - name: managed-m5n-1
    maxSize: 10
    minSize: 6
extraArgs:
  skip-nodes-with-local-storage: false
  balance-similar-node-groups: false
  skip-nodes-with-system-pods: false

I attached a policy with a role annotation. I opened up the policy for testing purposes:

{
    "Statement": [
        {
            "Action": [
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/alpha.eksctl.io/cluster-name": "${CLUSTER_NAME}"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor0"
        },
        {
            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeAutoScalingGroups",
                "ec2:DescribeLaunchTemplateVersions",
                "autoscaling:DescribeTags",
                "autoscaling:DescribeLaunchConfigurations",
                "ec2:DescribeInstanceTypes",
                "eks:DescribeNodegroup"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor1"
        }
    ],
    "Version": "2012-10-17"
}

Anything else we need to know?:
The Launch Template for the ASG does not have availability zone set, but I am not sure whether this is because the node group has subnets set which constrain it to availability zones. If this is the case I would imagine this should be a supported configuration.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

sarasensible · 2022-07-01T21:07:38Z

I worked around this issue by opting for autoDiscovery instead of static autoscalingGroups. However now I am getting a list of all nodes in my cluster with Node ip-x-x-x-x.us-east-2.compute.internal should not be processed by cluster autoscaler (no node group config). I checked the tags, the policy and the region but I am still not sure why it's not finding the nodes.

So now my policy looks like this:

{
    "Statement": [
        {
            "Action": [
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/k8s.io/cluster-autoscaler/${CLUSTER_NAME}": "owned"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor0"
        },
        {
            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeAutoScalingGroups",
                "ec2:DescribeLaunchTemplateVersions",
                "autoscaling:DescribeTags",
                "autoscaling:DescribeLaunchConfigurations",
                "ec2:DescribeInstanceTypes",
                "eks:DescribeNodegroup"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor1"
        }
    ],
    "Version": "2012-10-17"
}

My helm values:

awsRegion: ${region}

autoDiscovery:
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/${cluster_name}

extraArgs:
  skip-nodes-with-local-storage: false
  balance-similar-node-groups: false
  skip-nodes-with-system-pods: false

sarasensible · 2022-07-01T21:30:12Z

Cluster autoscaler status configmap:

apiVersion: v1
data:
  status: |
    Cluster-autoscaler status at 2022-07-01 21:26:24.438485899 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=13 unready=0 notStarted=0 longNotStarted=0 registered=13 longUnregistered=0)
                   LastProbeTime:      2022-07-01 21:26:24.43715327 +0000 UTC m=+309.991227471
                   LastTransitionTime: 2022-07-01 21:21:43.442210316 +0000 UTC m=+28.996284904
      ScaleUp:     NoActivity (ready=13 registered=13)
                   LastProbeTime:      2022-07-01 21:26:24.43715327 +0000 UTC m=+309.991227471
                   LastTransitionTime: 2022-07-01 21:21:43.442210316 +0000 UTC m=+28.996284904
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2022-07-01 21:26:24.43715327 +0000 UTC m=+309.991227471
                   LastTransitionTime: 2022-07-01 21:21:43.442210316 +0000 UTC m=+28.996284904
kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2022-07-01 21:26:24.438485899 +0000
      UTC
  creationTimestamp: "2022-07-01T21:21:31Z"
  name: cluster-autoscaler-status
  namespace: tools

sarasensible · 2022-07-01T22:13:36Z

The (no node group config) was due to the awsRegion option not being configured correctly. Once it was actually passed in properly, the nodes were correctly registered. I'm going to close since autoDiscovery seems to be the recommended way to deploy and this path is working as expected.

sarasensible added the kind/bug Categorizes issue or PR as related to a bug. label Jul 1, 2022

sarasensible closed this as completed Jul 1, 2022

pkit mentioned this issue Jul 14, 2022

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS #3802

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler for AWS failing to get availability zone for ASG #5002

Cluster Autoscaler for AWS failing to get availability zone for ASG #5002

sarasensible commented Jul 1, 2022 •

edited

Loading

sarasensible commented Jul 1, 2022 •

edited

Loading

sarasensible commented Jul 1, 2022

sarasensible commented Jul 1, 2022

Cluster Autoscaler for AWS failing to get availability zone for ASG #5002

Cluster Autoscaler for AWS failing to get availability zone for ASG #5002

Comments

sarasensible commented Jul 1, 2022 • edited Loading

sarasensible commented Jul 1, 2022 • edited Loading

sarasensible commented Jul 1, 2022

sarasensible commented Jul 1, 2022

sarasensible commented Jul 1, 2022 •

edited

Loading

sarasensible commented Jul 1, 2022 •

edited

Loading