-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA estimator and PodTopologySpread #3721
Comments
Can you specify which CA version you're using and what k8s version is running in the cluster? The behavior is likely to differ between version for this case. Also just to confirm - by pod topology spreading you mean using whenUnsatisfiable: DoNotSchedule (ScheduleAnyway constraints should be completely ignored by CA)? Assuming it's 1.19 - I wouldn't expect your use-case to work, but I'd expect it to fail the other way round (with CA only being able to add 1 node). CA does binpacking for each ASG independently, meaning the estimator only considers a scenario where all nodes are added in the same zone (note: it's the same with multi-AZ ASG, binpacking will randomly choose one zone). Based on the above, I'd expect CA to only be able to schedule a single pod in its binpacking (maxSkew would not allow more than 1 pod to schedule given that we're binpacking into a single zone). That should result in scale-up of one node at a time. My guess (very much a guess) is that what's happening in your case is that CA somehow doesn't see nodes already added in binpacking (some sort of bug in updating ClusterSnapshot during binpacking?) or it thinks each node in binpacking is somehow in a different zone (missing zonal label? that may be possible if there is version mismatch between CA and cluster, as I think the zonal labels were recently changed). Or you're just using CA <1.18 which doesn't have ClusterSnapshot at all :) Bad news is that once whatever bug is happening is fixed, I think you'll start seeing 1 node at a time scale-up, both with 1 pod / node and with multiple pods / node and there is little we can do to fix that problem. CA as-is can't work correctly with small values of maxSkew (I'd recommend using maxSkew on the order of 20% total number of replicas). |
Sorry yes it's Kubernetes v1.18.8 (EKS) and CA v1.18.2. Yes whenUnsatisfiable: DoNotSchedule. Currently I have both the stable and beta topology labels on the nodes, it does correctly pick the zone ASG. Yeah this makes sense, 1 zone ASG at a time to adhere to the constraint. I'll update to v1.18.3 CA for a start, thought I was actually on the latest with #3429 but I'm not. I'll iterate on the maxSkew, only takes an integer so I'll have to go for a conservative high number to be able to keep scaling up as the cluster grows. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I just wanted to confirm that this is expected behaviour.
I have an overprovisioning deployment of let's say 200m with 1 core per replica, giving roughly 20% overprovisioning. If I use an 8 core node I'm going to get 8 pods per node. This overprovisioning deployment also has pod topology contraints of max skew 1 across zones.
These overprovisioning pods in one zone then become unscheduable, and we need to schedule 100 of them in a specific zone. I was expecting the CA then to scale up this zone (one ASG per zone with balance similar node groups), with approximately 4 instances (100 * 200m / 8 core node), however what actually happens is the estimator estimates 100 nodes, which then gradually get scaled down. The least waste expander also rightly states a waste close to 100%.
I'm not sure if #3429 changed the behaviour or whether it was always like this as only recently introduced the pod topology spread.
Current workaround is to use the node cores as cores per replica and have a single overprovisioning pod per node with proportionally higher requests to get the same overprovisioning.
The text was updated successfully, but these errors were encountered: