Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No ScaleUp action for a new Pod and existing PVC/PV (AWS EKS, EBS Volume) #4739

Closed
KashifSaadat opened this issue Mar 15, 2022 · 9 comments
Closed
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@KashifSaadat
Copy link

KashifSaadat commented Mar 15, 2022

Which component are you using?: cluster-autoscaler, helm chart

What version of the component are you using?: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0, cluster-autoscaler-chart-9.9.2

What k8s version are you using (kubectl version)?: v1.21.5-eks-bc4871b

kubectl version Output
$ kubectl version

Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: Amazon Elastic Kubernetes Service (EKS)

What did you expect to happen?:

I created a PersistentVolumeClaim using a default gp2 StorageClass (EBS volume), and then a simple Deployment referencing the PVC. The Volume is created successfully and Pod schedules (cluster-autoscaler brings up a new Node to meet demand). I then scaled down the Deployment to 0, causing the cluster-autoscaler to drop 1 Node as it was no longer needed. On scaling the Deployment back to 1 replica (whilst the current Nodepool is at max capacity), the cluster-autoscaler should have detected and scaled the Nodepool, so the Pod can get scheduled on a Node successfully and attach the PV.

What happened instead?:

The Pod is stuck in a Pending state.

kubectl get events Output
$ kubectl -n test get events

LAST SEEN TYPE REASON OBJECT MESSAGE
70s Warning FailedScheduling pod/nginx-6bdfccff8f-s5b4k 0/3 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
66s Normal NotTriggerScaleUp pod/nginx-6bdfccff8f-s5b4k pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
71s Normal SuccessfulCreate replicaset/nginx-6bdfccff8f Created pod: nginx-6bdfccff8f-s5b4k
71s Normal ScalingReplicaSet deployment/nginx Scaled up replica set nginx-6bdfccff8f to 1

kubectl logs -lapp.kubernetes.io/name=aws-cluster-autoscaler Output
I0315 16:44:07.705409       1 static_autoscaler.go:229] Starting main loop
I0315 16:44:07.705978       1 filter_out_schedulable.go:65] Filtering out schedulables
I0315 16:44:07.705999       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0315 16:44:07.706132       1 scheduler_binder.go:795] PersistentVolume "pvc-1d8afbd0-6c46-425d-ae75-25dfbd0077cf", Node "ip-10-0-191-168.eu-west-2.compute.internal" mismatch for Pod "test/nginx-6bdfccff8f-s5b4k": no matching NodeSelectorTerms
I0315 16:44:07.706156       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0315 16:44:07.706168       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0315 16:44:07.706185       1 filter_out_schedulable.go:82] No schedulable pods
I0315 16:44:07.706195       1 klogx.go:86] Pod test/nginx-6bdfccff8f-s5b4k is unschedulable
I0315 16:44:07.706220       1 scale_up.go:364] Upcoming 0 nodes
I0315 16:44:07.706300       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-eks-compute-8abebfcd-72a5-97e9-5082-4dcb9b7dbc11-5821563729444175575": csinode.storage.k8s.io "template-node-for-eks-compute-8abebfcd-72a5-97e9-5082-4dcb9b7dbc11-5821563729444175575" not found
I0315 16:44:07.706350       1 scheduler_binder.go:795] PersistentVolume "pvc-1d8afbd0-6c46-425d-ae75-25dfbd0077cf", Node "template-node-for-eks-compute-8abebfcd-72a5-97e9-5082-4dcb9b7dbc11-5821563729444175575" mismatch for Pod "test/nginx-6bdfccff8f-s5b4k": no matching NodeSelectorTerms
I0315 16:44:07.706365       1 scale_up.go:288] Pod nginx-6bdfccff8f-s5b4k can't be scheduled on eks-compute-8abebfcd-72a5-97e9-5082-4dcb9b7dbc11, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0315 16:44:07.706384       1 scale_up.go:437] No pod can fit to eks-compute-8abebfcd-72a5-97e9-5082-4dcb9b7dbc11
I0315 16:44:07.706417       1 scale_up.go:441] No expansion options

How to reproduce it (as minimally and precisely as possible):

  1. Provision an AWS EKS Cluster with the cluster-autoscaler deployed, versions above
  2. Ensure test cluster is fully utilised that no new workloads can be scheduled without scaling up a Nodepool (or cordon Nodes)
  3. Create a PVC (gp2 default StorageClass)
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-data
    spec:
      resources:
        requests:
          storage: 1Gi
      accessModes:
      - ReadWriteOnce
  4. Create a Deployment
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: nginx
      name: nginx
      namespace: test
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
          - image: nginxinc/nginx-unprivileged
            name: nginx-unprivileged
            volumeMounts:
            - name: data
              mountPath: /data
          volumes:
          - name: data
            persistentVolumeClaim:
              claimName: test-data
  5. Wait for the PV and Pod to be successfully created and Running
  6. Scale down the Deployment to 0 replicas, forcing cluster-autoscaler to reduce Nodepool size by 1
  7. Scale the Deployment back to 1 replica, and observe the Pod state, namespace events, cluster-autoscaler logs

Anything else we need to know?:

I first thought this could be related to our EKS Cluster configuration, Nodepool configuration (tags are all there), cluster-autoscaler args (defaults, all standard) etc. However, I can just scale the Nodepool manually by 1 instance (or create another workload causing a new Node to be created), and this finally allows the Pod to be scheduled to a Node where the EBS volume can be attached and used.

Edit: I've also tested this with the latest cluster-autoscaler release for Kubernetes v1.21 (v1.21.1) and get the same issue.

@KashifSaadat KashifSaadat added the kind/bug Categorizes issue or PR as related to a bug. label Mar 15, 2022
@mmerrill3
Copy link

mmerrill3 commented Apr 9, 2022

@KashifSaadat , I see a similar issue on my clusters as well. In my case, I see PVs getting provisioined with the new topology keys on a k8s 1.21 cluster of

nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/region
            operator: In
            values:
            - ca-central-1
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - ca-central-1a

However, the function in aws_manager in CA still uses the old labels when building nodes from ASGs that are of zero size. These labels are

LabelFailureDomainBetaZone   = "failure-domain.beta.kubernetes.io/zone"   // deprecated
LabelFailureDomainBetaRegion = "failure-domain.beta.kubernetes.io/region" // deprecated

You can see the labels getting applied for the hypothetical node here:

func buildGenericLabels(template *asgTemplate, nodeName string) map[string]string {

In my case, I see the error message "node(s) had volume node affinity conflict" b/c the labels don't match.

@mmerrill3
Copy link

when nodes are created in EKS, they get both the old, and new, topology labels. It's just the CA that still only uses the old when building a hypothetical node.

Should be an easy fix. Can you confirm this is what is happening to your cluster?

@mmerrill3
Copy link

aws_manager.go is updated to use the new topology labels in the 1.22 and 1.23 tags of CA, but not in 1.21.x. This commit would fix our issue
8f11490

@KashifSaadat
Copy link
Author

KashifSaadat commented Apr 11, 2022

Hey @mmerrill3, nice find thank you!! I increased log verbosity to see the reference to the deprecated labels and confirmed the discrepancy as you've described. Tried this out with cluster-autoscaler v1.22.2 and it did manage to scale up successfully!

I ran into another issue validating this on my test cluster. I had a single ASG spanned across 3 Availability Zones and so when it scales up, the new Node happened to be in a different AZ to where the volume was. The Pod could not be scheduled on this Node and so it remained empty until a ScaleDown event, and no further ScaleUp actions were performed. This is mentioned as a gotcha in the AWS specific docs already, with the recommended approach being to have an ASG per AZ.

Edit: I'll close this issue as it appears to be correctly resolved from cluster-autoscaler v1.22.0 onwards and there isn't any further work required as far as I'm aware.

@Xyaren
Copy link

Xyaren commented Apr 13, 2022

@KashifSaadat you mentioned you are using k8s 1.21.5.
Do you notice any problems running a 1.22 cluster autoscaler with that verison?

At the maintainers: would this be something that could be backported ?

@KashifSaadat
Copy link
Author

Hey @Xyaren. I haven't noticed any issues myself, but generally prefer to run the component version in line with my cluster version.

@debu99
Copy link

debu99 commented Oct 19, 2022

Looks like there is new label for gp3

Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [ap-southeast-1a]

@just1900
Copy link

just1900 commented Nov 8, 2022

Looks like there is new label for gp3

Node Affinity:
  Required Terms:
    Term 0:        topology.ebs.csi.aws.com/zone in [ap-southeast-1a]

Same for this, the pv of ebs contains the following nodeSelectorTerms

      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.ebs.csi.aws.com/zone
          operator: In
          values:
          - us-west-2b

@youwalther65
Copy link

@just1900 and @debu99 gp3 is only suported using EBS CSI driver which implements new labels compared to in-tree EBS provisioner for gp2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants