Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA estimator and PodTopologySpread #3721

Closed
williamlord opened this issue Nov 25, 2020 · 6 comments
Closed

CA estimator and PodTopologySpread #3721

williamlord opened this issue Nov 25, 2020 · 6 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@williamlord
Copy link

I just wanted to confirm that this is expected behaviour.

I have an overprovisioning deployment of let's say 200m with 1 core per replica, giving roughly 20% overprovisioning. If I use an 8 core node I'm going to get 8 pods per node. This overprovisioning deployment also has pod topology contraints of max skew 1 across zones.

These overprovisioning pods in one zone then become unscheduable, and we need to schedule 100 of them in a specific zone. I was expecting the CA then to scale up this zone (one ASG per zone with balance similar node groups), with approximately 4 instances (100 * 200m / 8 core node), however what actually happens is the estimator estimates 100 nodes, which then gradually get scaled down. The least waste expander also rightly states a waste close to 100%.

I'm not sure if #3429 changed the behaviour or whether it was always like this as only recently introduced the pod topology spread.

Current workaround is to use the node cores as cores per replica and have a single overprovisioning pod per node with proportionally higher requests to get the same overprovisioning.

@MaciekPytel
Copy link
Contributor

Can you specify which CA version you're using and what k8s version is running in the cluster? The behavior is likely to differ between version for this case. Also just to confirm - by pod topology spreading you mean using whenUnsatisfiable: DoNotSchedule (ScheduleAnyway constraints should be completely ignored by CA)?

Assuming it's 1.19 - I wouldn't expect your use-case to work, but I'd expect it to fail the other way round (with CA only being able to add 1 node). CA does binpacking for each ASG independently, meaning the estimator only considers a scenario where all nodes are added in the same zone (note: it's the same with multi-AZ ASG, binpacking will randomly choose one zone).
Balance similar nodegroups happens after estimating - it checks if the same set of pods can schedule in multiple-ASGs and if so, splits the number of nodes calculated using a single zone across multiple zones. This just doesn't work for zonal topology spreading (or antiaffinity).

Based on the above, I'd expect CA to only be able to schedule a single pod in its binpacking (maxSkew would not allow more than 1 pod to schedule given that we're binpacking into a single zone). That should result in scale-up of one node at a time. My guess (very much a guess) is that what's happening in your case is that CA somehow doesn't see nodes already added in binpacking (some sort of bug in updating ClusterSnapshot during binpacking?) or it thinks each node in binpacking is somehow in a different zone (missing zonal label? that may be possible if there is version mismatch between CA and cluster, as I think the zonal labels were recently changed). Or you're just using CA <1.18 which doesn't have ClusterSnapshot at all :)
Any of those way would lead to CA only being able to schedule 1 pod/node due to maxSkew constraint, but it would be able to schedule them all across multiple nodes, leading to exactly the behavior you observe. By using 1 pod/node you're successfully working around the problem.

Bad news is that once whatever bug is happening is fixed, I think you'll start seeing 1 node at a time scale-up, both with 1 pod / node and with multiple pods / node and there is little we can do to fix that problem. CA as-is can't work correctly with small values of maxSkew (I'd recommend using maxSkew on the order of 20% total number of replicas).

@williamlord
Copy link
Author

Sorry yes it's Kubernetes v1.18.8 (EKS) and CA v1.18.2. Yes whenUnsatisfiable: DoNotSchedule. Currently I have both the stable and beta topology labels on the nodes, it does correctly pick the zone ASG.

Yeah this makes sense, 1 zone ASG at a time to adhere to the constraint.

I'll update to v1.18.3 CA for a start, thought I was actually on the latest with #3429 but I'm not. I'll iterate on the maxSkew, only takes an integer so I'll have to go for a conservative high number to be able to keep scaling up as the cluster grows.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants