Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler for AWS failing to get availability zone for ASG #5002

Closed
sarasensible opened this issue Jul 1, 2022 · 3 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sarasensible
Copy link

sarasensible commented Jul 1, 2022

Which component are you using?:
cluster-autoscaler for aws

What version of the component are you using?:
Helm chart version 9.19.2
App/Image version 1.23.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.2", GitCommit:"9d142434e3af351a628bffee3939e64c681afa4d", GitTreeState:"clean", BuildDate:"2022-01-19T17:27:51Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-eks-a64ea69", GitCommit:"d4336843ba36120e9ed1491fddff5f2fec33eb77", GitTreeState:"clean", BuildDate:"2022-05-12T18:29:27Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

EKS Cluster version: 1.21

What environment is this in?:
AWS EKS

What did you expect to happen?:
I deployed the cluster-autoscaler expecting to see it scale my nodes up and down

What happened instead?:
No scaling happened, I saw the following logs:

I0701 18:46:00.865156       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
E0701 18:46:00.865194       1 node_instances_cache.go:164] Failed to get cloud provider node instance for node group eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404, error error while looking for instances of ASG: {eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404}
I0701 18:46:00.865237       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 70.98µs
I0701 18:46:10.865729       1 static_autoscaler.go:230] Starting main loop
E0701 18:46:10.866834       1 mixed_nodeinfos_processor.go:130] Unable to build proper template node for eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404: unable to get first AvailabilityZone for ASG "eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404"
E0701 18:46:10.866855       1 static_autoscaler.go:285] Failed to get node infos for groups: unable to get first AvailabilityZone for ASG "eks-managed-m5n-1-20bf96d0-320f-d275-b98f-1c56d77ce404"

How to reproduce it (as minimally and precisely as possible):
I used eksctl to create a cluster with a managed node group specifying availability zones like so:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ${CLUSTER_NAME}
  region: ${CLUSTER_REGION}
  tags:
    environment: dev

vpc:
  id: "vpc-12345"
  subnets:
    private:
      us-east-2a:
        id: "subnet-aaaaaa"
      us-east-2b:
        id: "subnet-bbbbbb"
      us-east-2c:
        id: "subnet-cccccc"
    public:
      us-east-2a:
        id: "subnet-xxxxxx"
      us-east-2b:
        id: "subnet-yyyyyy"
      us-east-2c:
        id:  "subnet-zzzzz"

managedNodeGroups:
  - name: managed-m5n-1
    instanceType: m5n.xlarge
    minSize: 6
    desiredCapacity: 8
    maxSize: 10
    privateNetworking: true
    availabilityZones: ["${CLUSTER_REGION}a", "${CLUSTER_REGION}b", "${CLUSTER_REGION}c"]
    volumeSize: 20
    ssh:
      allow: true # uses ~/.ssh/id_rsa.pub as the default ssh key
    labels: { role: worker }
    iam:
      withAddonPolicies:
        autoScaler: true
        cloudwatch: true
        ebs: true
        albIngress: true

Then I installed with Helm values set like the following:

autoscalingGroups:
  - name: managed-m5n-1
    maxSize: 10
    minSize: 6
extraArgs:
  skip-nodes-with-local-storage: false
  balance-similar-node-groups: false
  skip-nodes-with-system-pods: false

I attached a policy with a role annotation. I opened up the policy for testing purposes:

{
    "Statement": [
        {
            "Action": [
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/alpha.eksctl.io/cluster-name": "${CLUSTER_NAME}"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor0"
        },
        {
            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeAutoScalingGroups",
                "ec2:DescribeLaunchTemplateVersions",
                "autoscaling:DescribeTags",
                "autoscaling:DescribeLaunchConfigurations",
                "ec2:DescribeInstanceTypes",
                "eks:DescribeNodegroup"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor1"
        }
    ],
    "Version": "2012-10-17"
}

Anything else we need to know?:
The Launch Template for the ASG does not have availability zone set, but I am not sure whether this is because the node group has subnets set which constrain it to availability zones. If this is the case I would imagine this should be a supported configuration.

Thanks in advance.

@sarasensible sarasensible added the kind/bug Categorizes issue or PR as related to a bug. label Jul 1, 2022
@sarasensible
Copy link
Author

sarasensible commented Jul 1, 2022

I worked around this issue by opting for autoDiscovery instead of static autoscalingGroups. However now I am getting a list of all nodes in my cluster with Node ip-x-x-x-x.us-east-2.compute.internal should not be processed by cluster autoscaler (no node group config). I checked the tags, the policy and the region but I am still not sure why it's not finding the nodes.

So now my policy looks like this:

{
    "Statement": [
        {
            "Action": [
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/k8s.io/cluster-autoscaler/${CLUSTER_NAME}": "owned"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor0"
        },
        {
            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeAutoScalingGroups",
                "ec2:DescribeLaunchTemplateVersions",
                "autoscaling:DescribeTags",
                "autoscaling:DescribeLaunchConfigurations",
                "ec2:DescribeInstanceTypes",
                "eks:DescribeNodegroup"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "VisualEditor1"
        }
    ],
    "Version": "2012-10-17"
}

My helm values:

awsRegion: ${region}

autoDiscovery:
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/${cluster_name}

extraArgs:
  skip-nodes-with-local-storage: false
  balance-similar-node-groups: false
  skip-nodes-with-system-pods: false

@sarasensible
Copy link
Author

Cluster autoscaler status configmap:

apiVersion: v1
data:
  status: |
    Cluster-autoscaler status at 2022-07-01 21:26:24.438485899 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=13 unready=0 notStarted=0 longNotStarted=0 registered=13 longUnregistered=0)
                   LastProbeTime:      2022-07-01 21:26:24.43715327 +0000 UTC m=+309.991227471
                   LastTransitionTime: 2022-07-01 21:21:43.442210316 +0000 UTC m=+28.996284904
      ScaleUp:     NoActivity (ready=13 registered=13)
                   LastProbeTime:      2022-07-01 21:26:24.43715327 +0000 UTC m=+309.991227471
                   LastTransitionTime: 2022-07-01 21:21:43.442210316 +0000 UTC m=+28.996284904
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2022-07-01 21:26:24.43715327 +0000 UTC m=+309.991227471
                   LastTransitionTime: 2022-07-01 21:21:43.442210316 +0000 UTC m=+28.996284904
kind: ConfigMap
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/last-updated: 2022-07-01 21:26:24.438485899 +0000
      UTC
  creationTimestamp: "2022-07-01T21:21:31Z"
  name: cluster-autoscaler-status
  namespace: tools

@sarasensible
Copy link
Author

The (no node group config) was due to the awsRegion option not being configured correctly. Once it was actually passed in properly, the nodes were correctly registered. I'm going to close since autoDiscovery seems to be the recommended way to deploy and this path is working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant