Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node #1278

steved · 2022-06-16T04:50:02Z

/kind bug

What happened?

I believe this is similar to #1163 (and perhaps even #1258), but distinct enough that I thought it warranted another issue. This may also be the wrong repository, I just encountered it specifically with EKS and the aws-ebs-csi-driver.

When a cluster is scaled up in response to new pods with PVCs and a custom volume-attach-limit is set, somehow the restriction is not enforced happens leading to all pods being scheduled.

What you expected to happen?

Ideally, only the pods that can fit the volume attach limit can be scheduled to the node. I'm unsure if this is something that needs to happen in the Kubernetes nodevolumes filter, however.

How to reproduce it (as minimally and precisely as possible)?

Create a cluster with aws-ebs-csi-driver and cluster-autoscaler deployed and a node pool (node-pool: my-node-pool) for other pods
Set --volume-attach-limit=5 on the node driver
Create a statefulset similar to

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vols-pv-test
spec:
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: vols-pv-test
  replicas: 10
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.5
        volumeMounts:
        - name: test-volume
          mountPath: /test
      nodeSelector:
        node-pool: "my-node-pool"
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "mystorageclass"
      resources:
        requests:
          storage: 1Gi

All 10 pods are scheduled on the newly created node even with the attach limit

If I then delete all the pods, only 5 will reschedule to the same node and the other 5 will wait for a new autoscaled node. This issue still occurs even if the PVCs are already provisioned.

Anything else we need to know?:
N/A

Environment

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.9-eks-a64ea69", GitCommit:"540410f9a2e24b7a2a870ebfacb3212744b5f878", GitTreeState:"clean", BuildDate:"2022-05-12T19:15:31Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Driver version: 1.5.3

The text was updated successfully, but these errors were encountered:

jortkoopmans · 2022-09-07T11:22:52Z

My conclusion from discussion in #1163 is that my situation is highly likely to be identical to this issue. Given the excellent reproducibility described by @steved , working this issue is probably easier (as opposed what seemingly is a mix of issues in #1163 )

My environment;

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.4", GitCommit:"95ee5ab382d64cfe6c28967f36b53970b8374491", GitTreeState:"clean", BuildDate:"2022-08-17T18:54:23Z", GoVersion:"go1.18.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.7-eks-4721010", GitCommit:"b77d9473a02fbfa834afa67d677fd12d690b195f", GitTreeState:"clean", BuildDate:"2022-06-27T22:19:07Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}

`Driver version: 1.11.2 (and using backwards compatibility mode for kubernetes.io/aws-ebs provisioner)

steved · 2022-12-06T06:03:12Z

This actually appears to be a duplicate of kubernetes/kubernetes#95911. Closing this for now unless there's further evidence it can be fixed here instead.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 16, 2022

jortkoopmans mentioned this issue Sep 5, 2022

instance volume limits: workloads no longer attach ebs volumes #1163

Closed

steved changed the title ~~Race condition with volume attach limits when new node joins the cluster leads oversubscribed node~~ Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node Dec 6, 2022

steved closed this as completed Dec 6, 2022

jonathan-innis mentioned this issue Apr 3, 2023

Karpenter not respecting max volumes limit attached per node on AWS kubernetes-sigs/karpenter#260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node #1278

Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node #1278

steved commented Jun 16, 2022 •

edited

Loading

jortkoopmans commented Sep 7, 2022

steved commented Dec 6, 2022

Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node #1278

Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node #1278

Comments

steved commented Jun 16, 2022 • edited Loading

jortkoopmans commented Sep 7, 2022

steved commented Dec 6, 2022

steved commented Jun 16, 2022 •

edited

Loading