cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on `exceed max volume count` #4517

ialidzhikov · 2021-12-13T09:03:17Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.18.0

What k8s version are you using (kubectl version)?:

1.17 and 1.18

What environment is this in?:

Gardener

What did you expect to happen?:

cluster-autoscaler to properly count migrated PVs and to scale up appropriately on scheduling failures with reason exceed max volume count.

What happened instead?:

cluster-autoscaler cannot count migrated PVs when CSI enabled -> cannot scale up on exceed max volume count. Pod(s) hangs forever in Pending state.

How to reproduce it (as minimally and precisely as possible):

Create a single node cluster with K8s that does not have CSI enabled (for example AWS cluster with K8s 1.17)
For machine type select a one that allows 25 volume attachments - for example m5.large

Make sure that you have a single Node. Its allocatable volume attachments should be 25.
```
$ k get csinode

spec:
  drivers:
  - allocatable:
      count: 25
    name: ebs.csi.aws.com
```

Create dummy StatefulSet and scale to 20 replicas

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

$ k scale sts web --replicas=20

This will create 20 Pods and PVs (note that PV are created with the internal volume driver).

Update to K8s version with CSI enabled (for example AWS cluster with K8s 1.18)

After completing this spec you should have 20 "migrated" PVs (PVs that are provisioned with the in-tree volume plugin).
Scale the StatefulSet to 26 replicas
```
$ k scale sts web --replicas=26
```

Make sure that the 26th replica (Pod web-25) fails to be scheduled (as expected) but cluster-autoscaler never triggers a scale up

Events:
Type     Reason            Age   From               Message
----     ------            ----  ----               -------
Warning  FailedScheduling  54m   default-scheduler  0/1 nodes are available: 1 node(s) exceed max volume count.
Warning  FailedScheduling  54m   default-scheduler  0/1 nodes are available: 1 node(s) exceed max volume count.

Logs of cluster-autoscaler:

I1124 14:44:02.817569       1 csi.go:178] Persistent volume had no name for claim default/www-web-25
I1124 14:44:02.817584       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817592       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817599       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817606       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817612       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817615       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817629       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817634       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817641       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817644       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817649       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817652       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817659       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817662       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817668       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817671       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817678       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817685       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817691       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817694       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817700       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817705       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817713       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817721       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817729       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817732       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817738       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817741       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817751       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817759       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817766       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817769       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817774       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817778       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817787       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817791       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817797       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817800       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817806       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817809       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
I1124 14:44:02.817818       1 scheduler_binder.go:241] FindPodVolumes for pod "default/web-25", node "ip-10-250-28-64.eu-west-1.compute.internal"
I1124 14:44:02.817842       1 scheduler_binder.go:819] No matching volumes for Pod "default/web-25", PVC "default/www-web-25" on node "ip-10-250-28-64.eu-west-1.compute.internal"
I1124 14:44:02.817852       1 scheduler_binder.go:883] Provisioning for 1 claims of pod "default/web-25" that has no matching volumes on node "ip-10-250-28-64.eu-west-1.compute.internal" ...
I1124 14:44:02.817868       1 filter_out_schedulable.go:118] Pod default.web-25 marked as unschedulable can be scheduled on node ip-10-250-28-64.eu-west-1.compute.internal (based on hinting). Ignoring in scale up.
I1124 14:44:02.817878       1 filter_out_schedulable.go:132] Filtered out 1 pods using hints
I1124 14:44:02.817884       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I1124 14:44:02.817888       1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
I1124 14:44:02.817894       1 filter_out_schedulable.go:79] Schedulable pods present
I1124 14:44:02.817913       1 static_autoscaler.go:402] No unschedulable pods
I1124 14:44:02.817929       1 static_autoscaler.go:449] Calculating unneeded nodes

Anything else we need to know?:

From what I managed to track in the autoscaler repository, the autoscaler creates a new scheduler framework and "simulates" whether the Pod is really unschedulable (most probably using default scheduling config).

I1124 14:44:02.817868       1 filter_out_schedulable.go:118] Pod default.web-25 marked as unschedulable can be scheduled on node ip-10-250-28-64.eu-west-1.compute.internal (based on hinting). Ignoring in scale up.

The above log entry makes it clear that the Pod is unschedulable according to the kube-scheduler (exceed max volume count) but the same Pod is schedulable according to the cluster-autoscaler. The differences comes from the NodeVolumeLimits filter in the scheduler - kube-scheduler obviously has the required CSI migration feature gates set and can correctly count migrated volumes. cluster-autoscaler current does not have any such config, hence it cannot count volumes with CSI enabled:

I1124 14:44:02.817806       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817809       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume

(Note that kube-scheduler has the required CSI migration flags and CSI migration is enabled for AWS)

The text was updated successfully, but these errors were encountered:

ialidzhikov · 2021-12-16T17:43:51Z

As discussed in the SIG meeting on Monday, I tried out the cluster-autoscaler with a manual hack that simulates feature flag enablement. I used the following diff:

diff --git a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
index c05b49cd8..f08ed5b6e 100644
--- a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
+++ b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
@@ -92,12 +92,14 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v

 	// If the pod doesn't have any new CSI volumes, the predicate will always be true
 	if len(newVolumes) == 0 {
+		klog.V(5).Info("Early exit len(newVolumes) == 0")
 		return nil
 	}

 	// If the node doesn't have volume limits, the predicate will always be true
 	nodeVolumeLimits := getVolumeLimits(nodeInfo, csiNode)
 	if len(nodeVolumeLimits) == 0 {
+		klog.V(5).Info("Early exit len(nodeVolumeLimits) == 0")
 		return nil
 	}

@@ -125,6 +127,7 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v
 		if ok {
 			currentVolumeCount := attachedVolumeCount[volumeLimitKey]
 			if currentVolumeCount+count > int(maxVolumeLimit) {
+				klog.V(5).Info("Pod is unschedulable.")
 				return framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded)
 			}
 		}
diff --git a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
index 3fd98da14..9de43b175 100644
--- a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
+++ b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
@@ -19,7 +19,7 @@ package nodevolumelimits
 import (
 	"strings"

-	"k8s.io/api/core/v1"
+	v1 "k8s.io/api/core/v1"
 	storagev1 "k8s.io/api/storage/v1"
 	"k8s.io/apimachinery/pkg/util/sets"
 	utilfeature "k8s.io/apiserver/pkg/util/feature"
@@ -44,9 +44,7 @@ func isCSIMigrationOn(csiNode *storagev1.CSINode, pluginName string) bool {

 	switch pluginName {
 	case csilibplugins.AWSEBSInTreePluginName:
-		if !utilfeature.DefaultFeatureGate.Enabled(features.CSIMigrationAWS) {
-			return false
-		}
+		return true
 	case csilibplugins.GCEPDInTreePluginName:
 		if !utilfeature.DefaultFeatureGate.Enabled(features.CSIMigrationGCE) {
 			return false

With this small hack cluster-autoscaler was able to successfully scale up on exceed max volume count. It seems that counting of migrated PVs works as expected. When a new node template is created, I see that the following early exit logic in the NodeVolumeLimits is executed:

https://github.com/kubernetes/kubernetes/blob/ab69524f795c42094a6630298ff53f3c3ebab7f4/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L110-L114

(Note that csiNode is nil)

Logs:

I1216 17:20:30.158799       1 csi.go:95] Early exit len(newVolumes) == 0

I1216 17:20:30.158946       1 scale_up.go:574] Final scale-up plan: [{worker-nq600-z1 1->2 (max: 2)}]
I1216 17:20:30.158956       1 scale_up.go:663] Scale-up: setting group worker-nq600-z1 size to 2

[maciekpytel] - The issue here is when the CA runs the simulation, it will create the in-memory node object for the new nodes that would be created on scale-up, however the corresponding CSINode object won’t be created, and thus won’t be considered by the the scale-up simulation. This is the difficulty of simulation with CSI, as it depends a lot on informers and run-time decisions on node creation etc.

@MaciekPytel to my understanding this is not a real concern as the NodeVolumeLimits returns nil (which means that a Pod can be scheduled on the Node, right?) if there is no corresponding CSINode object and no capacity set on the Node itself. Feel free to correct me if I am wrong.

[maciekpytel] – Expectation would be that the CA will use the default options set in k/k due to the vendoring of upstream, therefore if they default to on in 1.23, CA should pick this up.

Having the above findings in mind, I guess this should be automatically fixed with vendoring of K8s 1.23 in cluster-autoscaler. Unfortunately I cannot easily verify this because we are using a fork of cluster-autoscaler that currently vendors K8s 1.18.

ialidzhikov · 2021-12-16T17:50:09Z

@MaciekPytel to tackle this issue for all K8s < 1.23 versions and for CSI migrations for providers that are still not enabled by default in K8s 1.23, would it be okay to introduce a --feature-gates flag to cluster-autoscaler for the CSI migration related feature gates. The cluster-autoscaler will set/pass these CSI migration related feature gates to the scheduler feature flags. Actually all K8s control plane components have these CSI migration feature flags - kube-apiserver, kube-controller-manager and kube-scheduler. In this way a managed service that uses the CA can also configure the CA and make it CSI migration aware. WDYT?

ialidzhikov · 2021-12-19T13:58:09Z

/cc @msau42 @jsafrane

ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2021

This was referenced Dec 13, 2021

cluster-autoscaler does not support custom scheduling config #4518

Closed

cluster-autoscaler cannot count migrated PVs and cannot scale up on exceed max volume count gardener/gardener#5064

Closed

jbartosik added the area/cluster-autoscaler label Dec 15, 2021

ialidzhikov mentioned this issue Dec 19, 2021

Add --feature-gates flag to support scale up on volume limits (CSI migration enabled) #4539

Merged

This was referenced Dec 19, 2021

Cluster-autoscaler "Ignoring in scale up" despite aws-ebs-csi-driver volume-attach-limit is reached kubernetes/kubernetes#106894

Closed

No integration/e2e tests for scale up on volume limits (with CSI migration enabled) #4540

Closed

k8s-ci-robot closed this as completed in #4539 Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on `exceed max volume count` #4517

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on `exceed max volume count` #4517

ialidzhikov commented Dec 13, 2021 •

edited

Loading

ialidzhikov commented Dec 16, 2021

ialidzhikov commented Dec 16, 2021

ialidzhikov commented Dec 19, 2021

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on exceed max volume count #4517

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on exceed max volume count #4517

Comments

ialidzhikov commented Dec 13, 2021 • edited Loading

ialidzhikov commented Dec 16, 2021

ialidzhikov commented Dec 16, 2021

ialidzhikov commented Dec 19, 2021

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on `exceed max volume count` #4517

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on `exceed max volume count` #4517

ialidzhikov commented Dec 13, 2021 •

edited

Loading