Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--balance-similar-node-groups and scale from zero don't work together for EBS volumes #4305

Closed
abatilo opened this issue Sep 4, 2021 · 11 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@abatilo
Copy link

abatilo commented Sep 4, 2021

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0

Helm chart version 9.10.6

What k8s version are you using (kubectl version)?:

⇒  kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS

What did you expect to happen?:
If I have a stateful set that requires a pod in the same AZ as the EBS volume, and there are ZERO available pods in that AZ, I'd expect the --scale-from-zero and --balance-similar-node-groups to work together.

nodeSelector:
  "topology.kubernetes.io/zone": "us-west-2b"

What happened instead?:
cluster-autoscaler just repeats saying that no node is available that matches the nodeSelector that I have applied to my stateful set:

cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113163       1 scale_up.go:300] Pod postgres-postgresql-0 can't be scheduled on eks-terraform-20210804153521783000000012-70bd88e7-075e-6ffd-8481
-0f1e662e565d, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113197       1 scale_up.go:449] No pod can fit to eks-terraform-20210804153521783000000012-70bd88e7-075e-6ffd-8481-0f1e662e565d
cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113288       1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-terraform-20210904152444542800000001-
40bdd8b4-bd04-8214-bed5-77bee0376fd1-1708629525465421363": csinode.storage.k8s.io "template-node-for-eks-terraform-20210904152444542800000001-40bdd8b4-bd04-8214-bed5-77bee0376fd1-1708629525465421363" not found
cluster-autoscaler-aws-cluster-autoscaler-799cbf746-2x85l aws-cluster-autoscaler I0904 18:29:05.113324       1 scheduler_binder.go:823] PersistentVolume "pvc-9769973e-cdd6-425f-9fa7-c325c178d291", Node "template-node-for-eks-terraform-20210
904152444542800000001-40bdd8b4-bd04-8214-bed5-77bee0376fd1-1708629525465421363" mismatch for Pod "chat/postgres-postgresql-0": no matching NodeSelectorTerms
@abatilo abatilo added the kind/bug Categorizes issue or PR as related to a bug. label Sep 4, 2021
@ArchiFleKs
Copy link

I think you are using aws-ebs-csi driver : #3845

@abatilo
Copy link
Author

abatilo commented Sep 7, 2021

I think it's similar but I'm using the following label instead: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

I just happened to include the CSINode log in my log snippet

What's interesting is that I was doing some more testing for this and the scale from 0 worked if the stateful set was a brand new pod. So if it was the first time deploying the stateful set, the scale up would work. If I had a working node in the EBS volume region and then I manually did something like a kubectl drain and then deleted the node, cluster-autoscaler would never bring it back. I'm not sure that I fully understand what was going on.

@bpineau
Copy link
Contributor

bpineau commented Sep 10, 2021

Stable topology labels were added recently to cloudprovider/aws' template node builder. 1.22 should work but older version would infer a failure-domain.beta.kubernetes.io/zone label from ASG's zone instead ; you can pass the new topology label with an ASG tag like this:

k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone: us-west-2b

@ArchiFleKs
Copy link

I think it's similar but I'm using the following label instead: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

I just happened to include the CSINode log in my log snippet

What's interesting is that I was doing some more testing for this and the scale from 0 worked if the stateful set was a brand new pod. So if it was the first time deploying the stateful set, the scale up would work. If I had a working node in the EBS volume region and then I manually did something like a kubectl drain and then deleted the node, cluster-autoscaler would never bring it back. I'm not sure that I fully understand what was going on.

Yes I understand what you mean I noticed the same issue and cluster autoscaler complaining about a missing CSI Node.

My understanding is that it tries to read via the API the CSI Node object for the node template which does not exist because it is scaled to zero and the node template is not a real node. I didn’t had time to do more debugging if you have more information I’d be happy to hear them.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021
@Noksa
Copy link

Noksa commented Dec 14, 2021

any updates?

@carlosjgp
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants