--balance-similar-node-groups and scale from zero don't work together for EBS volumes #4305

abatilo · 2021-09-04T18:36:09Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0

Helm chart version 9.10.6

What k8s version are you using (kubectl version)?:

⇒  kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS

What did you expect to happen?:
If I have a stateful set that requires a pod in the same AZ as the EBS volume, and there are ZERO available pods in that AZ, I'd expect the --scale-from-zero and --balance-similar-node-groups to work together.

nodeSelector:
  "topology.kubernetes.io/zone": "us-west-2b"

What happened instead?:
cluster-autoscaler just repeats saying that no node is available that matches the nodeSelector that I have applied to my stateful set:

cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113163       1 scale_up.go:300] Pod postgres-postgresql-0 can't be scheduled on eks-terraform-20210804153521783000000012-70bd88e7-075e-6ffd-8481
-0f1e662e565d, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113197       1 scale_up.go:449] No pod can fit to eks-terraform-20210804153521783000000012-70bd88e7-075e-6ffd-8481-0f1e662e565d
cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113288       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-terraform-20210904152444542800000001-
40bdd8b4-bd04-8214-bed5-77bee0376fd1-1708629525465421363": csinode.storage.k8s.io "template-node-for-eks-terraform-20210904152444542800000001-40bdd8b4-bd04-8214-bed5-77bee0376fd1-1708629525465421363" not found
cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113324       1 scheduler_binder.go:823] PersistentVolume "pvc-9769973e-cdd6-425f-9fa7-c325c178d291", Node "template-node-for-eks-terraform-20210
904152444542800000001-40bdd8b4-bd04-8214-bed5-77bee0376fd1-1708629525465421363" mismatch for Pod "chat/postgres-postgresql-0": no matching NodeSelectorTerms

The text was updated successfully, but these errors were encountered:

ArchiFleKs · 2021-09-04T23:32:52Z

I think you are using aws-ebs-csi driver : #3845

abatilo · 2021-09-07T00:50:20Z

I think it's similar but I'm using the following label instead: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

I just happened to include the CSINode log in my log snippet

What's interesting is that I was doing some more testing for this and the scale from 0 worked if the stateful set was a brand new pod. So if it was the first time deploying the stateful set, the scale up would work. If I had a working node in the EBS volume region and then I manually did something like a kubectl drain and then deleted the node, cluster-autoscaler would never bring it back. I'm not sure that I fully understand what was going on.

bpineau · 2021-09-10T17:54:55Z

Stable topology labels were added recently to cloudprovider/aws' template node builder. 1.22 should work but older version would infer a failure-domain.beta.kubernetes.io/zone label from ASG's zone instead ; you can pass the new topology label with an ASG tag like this:

k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone: us-west-2b

ArchiFleKs · 2021-09-10T18:33:10Z

I think it's similar but I'm using the following label instead: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

I just happened to include the CSINode log in my log snippet

What's interesting is that I was doing some more testing for this and the scale from 0 worked if the stateful set was a brand new pod. So if it was the first time deploying the stateful set, the scale up would work. If I had a working node in the EBS volume region and then I manually did something like a kubectl drain and then deleted the node, cluster-autoscaler would never bring it back. I'm not sure that I fully understand what was going on.

Yes I understand what you mean I noticed the same issue and cluster autoscaler complaining about a missing CSI Node.

My understanding is that it tries to read via the API the CSI Node object for the node template which does not exist because it is scaled to zero and the node template is not a real node. I didn’t had time to do more debugging if you have more information I’d be happy to hear them.

k8s-triage-robot · 2021-12-14T08:59:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Noksa · 2021-12-14T15:37:34Z

any updates?

carlosjgp · 2021-12-24T11:17:31Z

/remove-lifecycle stale

k8s-triage-robot · 2022-03-24T11:49:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-04-23T12:46:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-23T13:28:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-23T13:29:00Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abatilo added the kind/bug Categorizes issue or PR as related to a bug. label Sep 4, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 24, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 23, 2022

k8s-ci-robot closed this as completed May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--balance-similar-node-groups and scale from zero don't work together for EBS volumes #4305

--balance-similar-node-groups and scale from zero don't work together for EBS volumes #4305

abatilo commented Sep 4, 2021

ArchiFleKs commented Sep 4, 2021

abatilo commented Sep 7, 2021

bpineau commented Sep 10, 2021

ArchiFleKs commented Sep 10, 2021

k8s-triage-robot commented Dec 14, 2021

Noksa commented Dec 14, 2021

carlosjgp commented Dec 24, 2021

k8s-triage-robot commented Mar 24, 2022

k8s-triage-robot commented Apr 23, 2022

k8s-triage-robot commented May 23, 2022

k8s-ci-robot commented May 23, 2022

--balance-similar-node-groups and scale from zero don't work together for EBS volumes #4305

--balance-similar-node-groups and scale from zero don't work together for EBS volumes #4305

Comments

abatilo commented Sep 4, 2021

ArchiFleKs commented Sep 4, 2021

abatilo commented Sep 7, 2021

bpineau commented Sep 10, 2021

ArchiFleKs commented Sep 10, 2021

k8s-triage-robot commented Dec 14, 2021

Noksa commented Dec 14, 2021

carlosjgp commented Dec 24, 2021

k8s-triage-robot commented Mar 24, 2022

k8s-triage-robot commented Apr 23, 2022

k8s-triage-robot commented May 23, 2022

k8s-ci-robot commented May 23, 2022