Priority Expander sometimes picks all groups as equal #3956

cablespaghetti · 2021-03-19T11:54:30Z

Which component are you using?:
cluster-autoscaler with the Helm Chart

What version of the component are you using?:

Component version: v1.19.1

What k8s version are you using (kubectl version)?: v1.19.6-eks-49a6c0

kubectl version Output

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS using Managed Node Groups

What did you expect to happen?:
With this configuration for priority expander:

data:
  priorities: |
    19:
    - .*
    20:
    - eks-88bc2311-7b73-c393-bc9f-eac7e9fda14d
    - eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6
    - eks-cebc2311-7b5f-3205-366e-9234f4ddf72a
    30:
    - eks-c0bc2311-7b89-23cc-8af4-af21ada330fd
    - eks-1cbc2311-7bde-88d7-5d71-9235bbdefa25
    - eks-b4bc2311-7b7a-daa7-bd9f-0aa2d1450f25
    40:
    - eks-b4bc2311-7b63-a611-77ac-8a38f308948a
    - eks-c4bc2311-7b67-0ee5-2ce6-b4ea06543894
    - eks-4abc2311-7b63-796d-f5e2-6a5c4bdec5e3

And these flags:

    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/pulseplatform-prod
      --balance-similar-node-groups=true
      --expander=priority
      --logtostderr=true
      --max-node-provision-time=10m
      --scale-down-delay-after-delete=10m
      --skip-nodes-with-local-storage=false
      --skip-nodes-with-system-pods=false
      --stderrthreshold=info
      --v=4

I expected the groups with priority 30 to be the highest priority because the ones marked 40 are ARM nodes and tainted so not applicable to most of my workloads and also have a set size as you'll see in the logs.

What happened instead?:
Sometimes, but I can't figure out what triggers it, all 6 of the non-arm ASGs are marked as highest priority.

How to reproduce it (as minimally and precisely as possible):
It seems to be intermittent but essentially have 3 sets of ASGs (in this case managed by managed node groups, but that shouldn't matter here) for ARM, Spot and On Demand and use my configuration. Sometimes the spot and on demand will be considered the same priority for some reason.

Anything else we need to know?:
Here's the logs of the problem happening:

I0319 11:41:44.034420       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2021-03-19 11:42:44.03440351 +0000 UTC m=+1234.584471812
I0319 11:41:44.105351       1 filter_out_schedulable.go:65] Filtering out schedulables
I0319 11:41:44.114934       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0319 11:41:44.116261       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0319 11:41:44.116339       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0319 11:41:44.116399       1 filter_out_schedulable.go:82] No schedulable pods
I0319 11:41:44.116454       1 klogx.go:86] Pod pulseplatform-production/football-data-read-service-66bf4c7c47-hd4mk is unschedulable
I0319 11:41:44.116769       1 scale_up.go:364] Upcoming 0 nodes
I0319 11:41:44.118827       1 scale_up.go:400] Skipping node group eks-4abc2311-7b63-796d-f5e2-6a5c4bdec5e3 - max size reached
I0319 11:41:44.128759       1 scale_up.go:400] Skipping node group eks-b4bc2311-7b63-a611-77ac-8a38f308948a - max size reached
I0319 11:41:44.130961       1 scale_up.go:400] Skipping node group eks-c4bc2311-7b67-0ee5-2ce6-b4ea06543894 - max size reached
I0319 11:41:44.132179       1 priority.go:118] Successfully loaded priority configuration from configmap.
I0319 11:41:44.132209       1 priority.go:167] priority expander: eks-1cbc2311-7bde-88d7-5d71-9235bbdefa25 chosen as the highest available
I0319 11:41:44.132218       1 priority.go:167] priority expander: eks-88bc2311-7b73-c393-bc9f-eac7e9fda14d chosen as the highest available
I0319 11:41:44.132225       1 priority.go:167] priority expander: eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 chosen as the highest available
I0319 11:41:44.132232       1 priority.go:167] priority expander: eks-b4bc2311-7b7a-daa7-bd9f-0aa2d1450f25 chosen as the highest available
I0319 11:41:44.132239       1 priority.go:167] priority expander: eks-c0bc2311-7b89-23cc-8af4-af21ada330fd chosen as the highest available
I0319 11:41:44.132246       1 priority.go:167] priority expander: eks-cebc2311-7b5f-3205-366e-9234f4ddf72a chosen as the highest available
I0319 11:41:44.132262       1 scale_up.go:456] Best option to resize: eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6
I0319 11:41:44.132276       1 scale_up.go:460] Estimated 1 nodes needed in eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6
I0319 11:41:44.132497       1 scale_up.go:566] Splitting scale-up between 3 similar node groups: {eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6, eks-88bc2311-7b73-c393-bc9f-eac7e9fda14d, eks-cebc2311-7b5f-3205-366e-9234f4ddf72a}
I0319 11:41:44.132518       1 scale_up.go:574] Final scale-up plan: [{eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 1->2 (max: 8)}]
I0319 11:41:44.132545       1 scale_up.go:663] Scale-up: setting group eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 size to 2
I0319 11:41:44.132581       1 auto_scaling_groups.go:219] Setting asg eks-a0bc2311-7b5e-7c9f-8556-411634f5a6d6 size to 2

Mostly it picks the 3 I expect but the configuration is consistent between times where it picks all 6 and picking the correct 3 and ruling out the other 3 because they are lower priority.

The text was updated successfully, but these errors were encountered:

cablespaghetti · 2021-03-19T12:16:07Z

I think I have traced this to #3308 not being fixed on the 1.19 branch due to #3582 being merged to master not the 1.19 branch.

I will raise an additional PR.

cablespaghetti · 2021-03-20T13:52:53Z

For anyone reading this. It seems to only be an issue if an ASG/Nodegroup name matches two priorities. From my configuration getting rid of the .* works around it if you don't want to compile your own autoscaler.

fejta-bot · 2021-06-18T14:49:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-18T15:15:37Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-08-17T15:38:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-08-17T15:39:06Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cablespaghetti added the kind/bug Categorizes issue or PR as related to a bug. label Mar 19, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 18, 2021

k8s-ci-robot closed this as completed Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Priority Expander sometimes picks all groups as equal #3956

Priority Expander sometimes picks all groups as equal #3956

cablespaghetti commented Mar 19, 2021 •

edited

Loading

cablespaghetti commented Mar 19, 2021

cablespaghetti commented Mar 20, 2021

fejta-bot commented Jun 18, 2021

fejta-bot commented Jul 18, 2021

k8s-triage-robot commented Aug 17, 2021

k8s-ci-robot commented Aug 17, 2021

Priority Expander sometimes picks all groups as equal #3956

Priority Expander sometimes picks all groups as equal #3956

Comments

cablespaghetti commented Mar 19, 2021 • edited Loading

cablespaghetti commented Mar 19, 2021

cablespaghetti commented Mar 20, 2021

fejta-bot commented Jun 18, 2021

fejta-bot commented Jul 18, 2021

k8s-triage-robot commented Aug 17, 2021

k8s-ci-robot commented Aug 17, 2021

cablespaghetti commented Mar 19, 2021 •

edited

Loading