-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodegroup min/max should respect both explicit setting and ASG api return #1559
Comments
I may not be 100% up-to-date on changes to AWS cloud provider, but IIRC by design manual configuration (passed via CA flag) overrides ASG settings. Flag can be skipped entirely by using auto-discovery mode. |
Yes, if the min and max is defined manually, those values will be used and not the actual ASG values. Probably invalid values could be sanitized during the reconcile loop when getting the ASGs from AWS, but I‘m not sure what the expected behavior should be in in those cases. Auto-correct and log or log and crash? What is your use-case for having defined values, but still want to rely on ASG boundaries? |
I think defined value should also respect ASG boundaries. Otherwise, it doesn't make sense and make CA behavior abnormally(scale up/down a nodegroup can not be adjusted). CA managed range should inside range ASG manages. What's the reason to allow invalidate inputs?
|
Yes, but what would you expect in this case to happen? Also the input might be only temporarily invalid as the CA can't influence if an ASG is updated via the console or API. This sounds mainly like the autoscaler could log a warning in those cases and auto-correct the values temporarily, but that sounds more like an optimisation to failed scale attempts without an API call and shouldn't have any other negative effects. Do you see actually any issues caused by this? |
@johanneswuerbach True. It could be temporarily invalid and that's also my concern. What I expect is CA respect node config and ASG boundary in every loop. For example, if defined min/max is invalid in this round, it doesn't have to waste resource to try to scale Up/Down. Otherwise, it brings unnecessary errors either in CA and cloudprovider side. I found out this issue because I carelessly set minSize which is large than my current nodegroup size. Then it refuse to scale down because minSIze reached. A question, what's the best practice of using CA? people should define min/max in CA and adjust ASG boundary? Or give a wide boundary in ASG and reset min/max in CA based on needs? |
At my company we purely rely on ASG auto-discovery to avoid having multiple, potentially different, configurations for min/max values. If we want to change the values, we generally never touch the CA, but modify the ASG values directly which is reflected in CA after the next sync (max. 60s). Afaik this manual specification options comes from the GCE origin of autoscaler as Google Managed Instance Groups (MIGs) don't have a min/max value and therefor the CA values are the only way to configure boundaries. Also auto-discovery was only added later, while manual configuration was always in-place. |
Thanks for the story. I like auto-discovery much better and it doesn't come with these confusions. |
Yeah, it was historically the only way to pass the limits. GKE doesn't use it either, but I believe it also makes it more friendly to implement support for a new cloud provider (which may not necessarily support any kind of scaling groups). As for the overriding use-case, IIRC ASGs prevent the user from manually setting the value outside of min/max autoscaling range. If true, the users may prefer to use different settings for autoscaling and different in cloud provider, to be allowed to adjust the size manually in case of emergency (e.g. a truly unexpected spike). |
As documentation change was merged, I feel good at this moment and I will just close this issue. |
Cluster scaler supports explicit nodeGroup min/max setting from user like following,
The problem is user can try arbitrary min/max combination here. I think we should consider ASG API return as well at the time register asg. But it brings another problem once ASG min/max is updated on cloud provider side, explicitlyConfigured might be wiped. Could someone help verify if this is by design or should we fix this issue? If that's reasonable, I can submit a PR to address this issue.
Let's assume ASG min/max setting is 3/3, desired instance 3.
Option 1. --nodes=3:4:k8s-worker-asg-1. CA fails scaling up because it reach max of ASG.
Option 1. --nodes=2:3:k8s-worker-asg-1. CA fails scaling down because it reach min of ASG.
Option 3. --nodes=10:20:k8s-worker-asg-1. min > ASG.MaxSize
Option 4. --nodes=1:2:k8s-worker-asg-1. max < ASG.MinSize
Have above issues on scaling up & scaling down.
The text was updated successfully, but these errors were encountered: