-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on exceed max volume count
#4517
Comments
As discussed in the SIG meeting on Monday, I tried out the cluster-autoscaler with a manual hack that simulates feature flag enablement. I used the following diff: diff --git a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
index c05b49cd8..f08ed5b6e 100644
--- a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
+++ b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
@@ -92,12 +92,14 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v
// If the pod doesn't have any new CSI volumes, the predicate will always be true
if len(newVolumes) == 0 {
+ klog.V(5).Info("Early exit len(newVolumes) == 0")
return nil
}
// If the node doesn't have volume limits, the predicate will always be true
nodeVolumeLimits := getVolumeLimits(nodeInfo, csiNode)
if len(nodeVolumeLimits) == 0 {
+ klog.V(5).Info("Early exit len(nodeVolumeLimits) == 0")
return nil
}
@@ -125,6 +127,7 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v
if ok {
currentVolumeCount := attachedVolumeCount[volumeLimitKey]
if currentVolumeCount+count > int(maxVolumeLimit) {
+ klog.V(5).Info("Pod is unschedulable.")
return framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded)
}
}
diff --git a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
index 3fd98da14..9de43b175 100644
--- a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
+++ b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
@@ -19,7 +19,7 @@ package nodevolumelimits
import (
"strings"
- "k8s.io/api/core/v1"
+ v1 "k8s.io/api/core/v1"
storagev1 "k8s.io/api/storage/v1"
"k8s.io/apimachinery/pkg/util/sets"
utilfeature "k8s.io/apiserver/pkg/util/feature"
@@ -44,9 +44,7 @@ func isCSIMigrationOn(csiNode *storagev1.CSINode, pluginName string) bool {
switch pluginName {
case csilibplugins.AWSEBSInTreePluginName:
- if !utilfeature.DefaultFeatureGate.Enabled(features.CSIMigrationAWS) {
- return false
- }
+ return true
case csilibplugins.GCEPDInTreePluginName:
if !utilfeature.DefaultFeatureGate.Enabled(features.CSIMigrationGCE) {
return false With this small hack cluster-autoscaler was able to successfully scale up on (Note that Logs:
@MaciekPytel to my understanding this is not a real concern as the
Having the above findings in mind, I guess this should be automatically fixed with vendoring of K8s 1.23 in cluster-autoscaler. Unfortunately I cannot easily verify this because we are using a fork of cluster-autoscaler that currently vendors K8s 1.18. |
@MaciekPytel to tackle this issue for all K8s |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: v1.18.0
What k8s version are you using (
kubectl version
)?:1.17 and 1.18
What environment is this in?:
Gardener
What did you expect to happen?:
cluster-autoscaler to properly count migrated PVs and to scale up appropriately on scheduling failures with reason
exceed max volume count
.What happened instead?:
cluster-autoscaler cannot count migrated PVs when CSI enabled -> cannot scale up on
exceed max volume count
. Pod(s) hangs forever in Pending state.How to reproduce it (as minimally and precisely as possible):
Create a single node cluster with K8s that does not have CSI enabled (for example AWS cluster with K8s 1.17)
For machine type select a one that allows 25 volume attachments - for example
m5.large
Make sure that you have a single Node. Its allocatable volume attachments should be 25.
Create dummy StatefulSet and scale to 20 replicas
This will create 20 Pods and PVs (note that PV are created with the internal volume driver).
Update to K8s version with CSI enabled (for example AWS cluster with K8s 1.18)
After completing this spec you should have 20 "migrated" PVs (PVs that are provisioned with the in-tree volume plugin).
Scale the StatefulSet to 26 replicas
Make sure that the 26th replica (Pod web-25) fails to be scheduled (as expected) but cluster-autoscaler never triggers a scale up
Logs of cluster-autoscaler:
Anything else we need to know?:
From what I managed to track in the autoscaler repository, the autoscaler creates a new scheduler framework and "simulates" whether the Pod is really unschedulable (most probably using default scheduling config).
The above log entry makes it clear that the Pod is unschedulable according to the kube-scheduler (exceed max volume count) but the same Pod is schedulable according to the cluster-autoscaler. The differences comes from the
NodeVolumeLimits
filter in the scheduler - kube-scheduler obviously has the required CSI migration feature gates set and can correctly count migrated volumes. cluster-autoscaler current does not have any such config, hence it cannot count volumes with CSI enabled:(Note that kube-scheduler has the required CSI migration flags and CSI migration is enabled for AWS)
The text was updated successfully, but these errors were encountered: