Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648

Conversation

discordianfish
Copy link
Contributor

These are two somewhat separate changes, so let me know if I should split this PR:

  1. Rename gpu resource name, the same as Fix GPU resource name. #407 but for AWS
  2. Set scaleToZero = true. It looks like this was changed (by accident?) in Support autodetection of GCE managed instance groups by name prefix #462

I've tested the changes manually and scaling to and from 0 works on my cluster with the nvidia k8s device plugin.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 13, 2018
@aleksandra-malinowska aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Feb 13, 2018
@aleksandra-malinowska
Copy link
Contributor

Thanks for contributing! @sethpollack @mumoshu can you PTAL?

@sethpollack
Copy link
Contributor

lgtm

@negz was there a reason scaleToZeroSupported was defaulted to false?

@negz
Copy link
Contributor

negz commented Feb 13, 2018

@sethpollack It's been a while, but I'm fairly sure AWS did not support scale to zero when I added that code.

@negz
Copy link
Contributor

negz commented Feb 13, 2018

https://github.com/kubernetes/autoscaler/pull/462/files#diff-34ecd32e36ab8898fff1637bb1b39c2cL367

@sethpollack Nope - my bad. I see now that it was enabled and I disabled it. My apologies - completely accidental on my part.

@sethpollack
Copy link
Contributor

@negz thanks

@aleksandra-malinowska
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2018
@aleksandra-malinowska aleksandra-malinowska merged commit 8abe69b into kubernetes:master Feb 14, 2018
@bhack
Copy link

bhack commented Feb 17, 2018

Is this available in any released image?

@alexnederlof
Copy link

alexnederlof commented Jun 7, 2018

@discordianfish would it be possible for you to also PR This on the 1.1.2 release for a 1.1.3 release? Everyone using Kops is still stuck on Kubernetes v1.9, which is bound to autoscaler 1.1.x. Shipping it to the 1.1.x branch would help a lot of people. I'm keen to help, but I've never done anything in Go, and couldn't figure out how to do it myself (just cherry picking didn't work 😉 ) See #903

@alexnederlof
Copy link

Or perhaps @aleksandra-malinowska

@kaazoo
Copy link

kaazoo commented Jul 23, 2018

I tried cluster-autoscaler 1.2.2 on a 1.9.9 cluster (created by kops 1.9.1). Allthough it's not not officially supported, scaling up my GPU ASG from 0 seems to work.

@MaciekPytel
Copy link
Contributor

@kaazoo A word of warning: CA may over scale nodes with GPU. This is because the current GPU workflow is:

  1. CA notices a pending pod that requires GPU and adds node with GPU.
  2. Node becomes ready. At that point GPU drivers are not yet installed and node doesn't have GPU capacity.
  3. GPU driver installing daemonset runs on the node.
  4. Once the drivers are installed GPU capacity is added to node.

The problem is between steps 2 and 4 CA sees the node has already booted up without GPUs. The pod requesting GPU is pending and there are no upcoming nodes, so CA triggers another scale-up. Due to the details of how CA works this issue will not happen when scaling-up from 0, but otherwise will happen ~100% of times.

There is a GCP/GKE specific hack for this, but we don't have a generic solution yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants