CA estimator and PodTopologySpread #3721

williamlord · 2020-11-25T15:52:59Z

I just wanted to confirm that this is expected behaviour.

I have an overprovisioning deployment of let's say 200m with 1 core per replica, giving roughly 20% overprovisioning. If I use an 8 core node I'm going to get 8 pods per node. This overprovisioning deployment also has pod topology contraints of max skew 1 across zones.

These overprovisioning pods in one zone then become unscheduable, and we need to schedule 100 of them in a specific zone. I was expecting the CA then to scale up this zone (one ASG per zone with balance similar node groups), with approximately 4 instances (100 * 200m / 8 core node), however what actually happens is the estimator estimates 100 nodes, which then gradually get scaled down. The least waste expander also rightly states a waste close to 100%.

I'm not sure if #3429 changed the behaviour or whether it was always like this as only recently introduced the pod topology spread.

Current workaround is to use the node cores as cores per replica and have a single overprovisioning pod per node with proportionally higher requests to get the same overprovisioning.

MaciekPytel · 2020-11-25T17:14:22Z

Can you specify which CA version you're using and what k8s version is running in the cluster? The behavior is likely to differ between version for this case. Also just to confirm - by pod topology spreading you mean using whenUnsatisfiable: DoNotSchedule (ScheduleAnyway constraints should be completely ignored by CA)?

Assuming it's 1.19 - I wouldn't expect your use-case to work, but I'd expect it to fail the other way round (with CA only being able to add 1 node). CA does binpacking for each ASG independently, meaning the estimator only considers a scenario where all nodes are added in the same zone (note: it's the same with multi-AZ ASG, binpacking will randomly choose one zone).
Balance similar nodegroups happens after estimating - it checks if the same set of pods can schedule in multiple-ASGs and if so, splits the number of nodes calculated using a single zone across multiple zones. This just doesn't work for zonal topology spreading (or antiaffinity).

Based on the above, I'd expect CA to only be able to schedule a single pod in its binpacking (maxSkew would not allow more than 1 pod to schedule given that we're binpacking into a single zone). That should result in scale-up of one node at a time. My guess (very much a guess) is that what's happening in your case is that CA somehow doesn't see nodes already added in binpacking (some sort of bug in updating ClusterSnapshot during binpacking?) or it thinks each node in binpacking is somehow in a different zone (missing zonal label? that may be possible if there is version mismatch between CA and cluster, as I think the zonal labels were recently changed). Or you're just using CA <1.18 which doesn't have ClusterSnapshot at all :)
Any of those way would lead to CA only being able to schedule 1 pod/node due to maxSkew constraint, but it would be able to schedule them all across multiple nodes, leading to exactly the behavior you observe. By using 1 pod/node you're successfully working around the problem.

Bad news is that once whatever bug is happening is fixed, I think you'll start seeing 1 node at a time scale-up, both with 1 pod / node and with multiple pods / node and there is little we can do to fix that problem. CA as-is can't work correctly with small values of maxSkew (I'd recommend using maxSkew on the order of 20% total number of replicas).

williamlord · 2020-11-25T18:00:41Z

Sorry yes it's Kubernetes v1.18.8 (EKS) and CA v1.18.2. Yes whenUnsatisfiable: DoNotSchedule. Currently I have both the stable and beta topology labels on the nodes, it does correctly pick the zone ASG.

Yeah this makes sense, 1 zone ASG at a time to adhere to the constraint.

I'll update to v1.18.3 CA for a start, thought I was actually on the latest with #3429 but I'm not. I'll iterate on the maxSkew, only takes an integer so I'll have to go for a conservative high number to be able to keep scaling up as the cluster grows.

fejta-bot · 2021-02-23T18:05:54Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-25T18:51:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-24T19:33:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-24T19:33:09Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nshekhar221 mentioned this issue Jan 25, 2021

Scale up delayed by Cluster Autoscaler for large request #3835

Closed

MaciekPytel mentioned this issue Feb 12, 2021

Set different hostname label for upcoming nodes #3883

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2021

k8s-ci-robot closed this as completed Apr 24, 2021

fullykubed mentioned this issue May 24, 2021

[cluster-autoscaler] Extreme overprovisioning when using pod interpendency constraints #4099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA estimator and PodTopologySpread #3721

CA estimator and PodTopologySpread #3721

williamlord commented Nov 25, 2020

MaciekPytel commented Nov 25, 2020

williamlord commented Nov 25, 2020

fejta-bot commented Feb 23, 2021

fejta-bot commented Mar 25, 2021

fejta-bot commented Apr 24, 2021

k8s-ci-robot commented Apr 24, 2021

CA estimator and PodTopologySpread #3721

CA estimator and PodTopologySpread #3721

Comments

williamlord commented Nov 25, 2020

MaciekPytel commented Nov 25, 2020

williamlord commented Nov 25, 2020

fejta-bot commented Feb 23, 2021

fejta-bot commented Mar 25, 2021

fejta-bot commented Apr 24, 2021

k8s-ci-robot commented Apr 24, 2021