-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648
AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648
Conversation
Thanks for contributing! @sethpollack @mumoshu can you PTAL? |
lgtm @negz was there a reason |
@sethpollack It's been a while, but I'm fairly sure AWS did not support scale to zero when I added that code. |
@sethpollack Nope - my bad. I see now that it was enabled and I disabled it. My apologies - completely accidental on my part. |
@negz thanks |
/lgtm |
Is this available in any released image? |
@discordianfish would it be possible for you to also PR This on the 1.1.2 release for a 1.1.3 release? Everyone using Kops is still stuck on Kubernetes v1.9, which is bound to autoscaler 1.1.x. Shipping it to the 1.1.x branch would help a lot of people. I'm keen to help, but I've never done anything in Go, and couldn't figure out how to do it myself (just cherry picking didn't work 😉 ) See #903 |
Or perhaps @aleksandra-malinowska |
I tried cluster-autoscaler 1.2.2 on a 1.9.9 cluster (created by kops 1.9.1). Allthough it's not not officially supported, scaling up my GPU ASG from 0 seems to work. |
@kaazoo A word of warning: CA may over scale nodes with GPU. This is because the current GPU workflow is:
The problem is between steps 2 and 4 CA sees the node has already booted up without GPUs. The pod requesting GPU is pending and there are no upcoming nodes, so CA triggers another scale-up. Due to the details of how CA works this issue will not happen when scaling-up from 0, but otherwise will happen ~100% of times. There is a GCP/GKE specific hack for this, but we don't have a generic solution yet. |
These are two somewhat separate changes, so let me know if I should split this PR:
I've tested the changes manually and scaling to and from 0 works on my cluster with the nvidia k8s device plugin.