Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority Expander sometimes picks all groups as equal #3956

Closed
cablespaghetti opened this issue Mar 19, 2021 · 6 comments
Closed

Priority Expander sometimes picks all groups as equal #3956

cablespaghetti opened this issue Mar 19, 2021 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@cablespaghetti
Copy link
Contributor

cablespaghetti commented Mar 19, 2021

Which component are you using?:
cluster-autoscaler with the Helm Chart

What version of the component are you using?:

Component version: v1.19.1

What k8s version are you using (kubectl version)?: v1.19.6-eks-49a6c0

kubectl version Output
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS using Managed Node Groups

What did you expect to happen?:
With this configuration for priority expander:

data:
  priorities: |
    19:
    - .*
    20:
    - eks-88bc2311-7b73-c393-bc9f-eac7e9fda14d
    - eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6
    - eks-cebc2311-7b5f-3205-366e-9234f4ddf72a
    30:
    - eks-c0bc2311-7b89-23cc-8af4-af21ada330fd
    - eks-1cbc2311-7bde-88d7-5d71-9235bbdefa25
    - eks-b4bc2311-7b7a-daa7-bd9f-0aa2d1450f25
    40:
    - eks-b4bc2311-7b63-a611-77ac-8a38f308948a
    - eks-c4bc2311-7b67-0ee5-2ce6-b4ea06543894
    - eks-4abc2311-7b63-796d-f5e2-6a5c4bdec5e3

And these flags:

    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/pulseplatform-prod
      --balance-similar-node-groups=true
      --expander=priority
      --logtostderr=true
      --max-node-provision-time=10m
      --scale-down-delay-after-delete=10m
      --skip-nodes-with-local-storage=false
      --skip-nodes-with-system-pods=false
      --stderrthreshold=info
      --v=4

I expected the groups with priority 30 to be the highest priority because the ones marked 40 are ARM nodes and tainted so not applicable to most of my workloads and also have a set size as you'll see in the logs.

What happened instead?:
Sometimes, but I can't figure out what triggers it, all 6 of the non-arm ASGs are marked as highest priority.

How to reproduce it (as minimally and precisely as possible):
It seems to be intermittent but essentially have 3 sets of ASGs (in this case managed by managed node groups, but that shouldn't matter here) for ARM, Spot and On Demand and use my configuration. Sometimes the spot and on demand will be considered the same priority for some reason.

Anything else we need to know?:
Here's the logs of the problem happening:

I0319 11:41:44.034420       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2021-03-19 11:42:44.03440351 +0000 UTC m=+1234.584471812
I0319 11:41:44.105351       1 filter_out_schedulable.go:65] Filtering out schedulables
I0319 11:41:44.114934       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0319 11:41:44.116261       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0319 11:41:44.116339       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0319 11:41:44.116399       1 filter_out_schedulable.go:82] No schedulable pods
I0319 11:41:44.116454       1 klogx.go:86] Pod pulseplatform-production/football-data-read-service-66bf4c7c47-hd4mk is unschedulable
I0319 11:41:44.116769       1 scale_up.go:364] Upcoming 0 nodes
I0319 11:41:44.118827       1 scale_up.go:400] Skipping node group eks-4abc2311-7b63-796d-f5e2-6a5c4bdec5e3 - max size reached
I0319 11:41:44.128759       1 scale_up.go:400] Skipping node group eks-b4bc2311-7b63-a611-77ac-8a38f308948a - max size reached
I0319 11:41:44.130961       1 scale_up.go:400] Skipping node group eks-c4bc2311-7b67-0ee5-2ce6-b4ea06543894 - max size reached
I0319 11:41:44.132179       1 priority.go:118] Successfully loaded priority configuration from configmap.
I0319 11:41:44.132209       1 priority.go:167] priority expander: eks-1cbc2311-7bde-88d7-5d71-9235bbdefa25 chosen as the highest available
I0319 11:41:44.132218       1 priority.go:167] priority expander: eks-88bc2311-7b73-c393-bc9f-eac7e9fda14d chosen as the highest available
I0319 11:41:44.132225       1 priority.go:167] priority expander: eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 chosen as the highest available
I0319 11:41:44.132232       1 priority.go:167] priority expander: eks-b4bc2311-7b7a-daa7-bd9f-0aa2d1450f25 chosen as the highest available
I0319 11:41:44.132239       1 priority.go:167] priority expander: eks-c0bc2311-7b89-23cc-8af4-af21ada330fd chosen as the highest available
I0319 11:41:44.132246       1 priority.go:167] priority expander: eks-cebc2311-7b5f-3205-366e-9234f4ddf72a chosen as the highest available
I0319 11:41:44.132262       1 scale_up.go:456] Best option to resize: eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6
I0319 11:41:44.132276       1 scale_up.go:460] Estimated 1 nodes needed in eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6
I0319 11:41:44.132497       1 scale_up.go:566] Splitting scale-up between 3 similar node groups: {eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6, eks-88bc2311-7b73-c393-bc9f-eac7e9fda14d, eks-cebc2311-7b5f-3205-366e-9234f4ddf72a}
I0319 11:41:44.132518       1 scale_up.go:574] Final scale-up plan: [{eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 1->2 (max: 8)}]
I0319 11:41:44.132545       1 scale_up.go:663] Scale-up: setting group eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 size to 2
I0319 11:41:44.132581       1 auto_scaling_groups.go:219] Setting asg eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 size to 2

Mostly it picks the 3 I expect but the configuration is consistent between times where it picks all 6 and picking the correct 3 and ruling out the other 3 because they are lower priority.

@cablespaghetti cablespaghetti added the kind/bug Categorizes issue or PR as related to a bug. label Mar 19, 2021
@cablespaghetti
Copy link
Contributor Author

I think I have traced this to #3308 not being fixed on the 1.19 branch due to #3582 being merged to master not the 1.19 branch.

I will raise an additional PR.

@cablespaghetti
Copy link
Contributor Author

For anyone reading this. It seems to only be an issue if an ASG/Nodegroup name matches two priorities. From my configuration getting rid of the .* works around it if you don't want to compile your own autoscaler.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 18, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants