AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648

discordianfish · 2018-02-13T16:14:58Z

These are two somewhat separate changes, so let me know if I should split this PR:

Rename gpu resource name, the same as Fix GPU resource name. #407 but for AWS
Set scaleToZero = true. It looks like this was changed (by accident?) in Support autodetection of GCE managed instance groups by name prefix #462

I've tested the changes manually and scaling to and from 0 works on my cluster with the nvidia k8s device plugin.

aleksandra-malinowska · 2018-02-13T16:34:29Z

Thanks for contributing! @sethpollack @mumoshu can you PTAL?

sethpollack · 2018-02-13T16:49:52Z

lgtm

@negz was there a reason scaleToZeroSupported was defaulted to false?

negz · 2018-02-13T19:45:06Z

@sethpollack It's been a while, but I'm fairly sure AWS did not support scale to zero when I added that code.

negz · 2018-02-13T20:11:57Z

https://github.com/kubernetes/autoscaler/pull/462/files#diff-34ecd32e36ab8898fff1637bb1b39c2cL367

@sethpollack Nope - my bad. I see now that it was enabled and I disabled it. My apologies - completely accidental on my part.

sethpollack · 2018-02-13T20:14:37Z

@negz thanks

aleksandra-malinowska · 2018-02-14T11:30:00Z

/lgtm

bhack · 2018-02-17T16:45:41Z

Is this available in any released image?

alexnederlof · 2018-06-07T11:58:45Z

@discordianfish would it be possible for you to also PR This on the 1.1.2 release for a 1.1.3 release? Everyone using Kops is still stuck on Kubernetes v1.9, which is bound to autoscaler 1.1.x. Shipping it to the 1.1.x branch would help a lot of people. I'm keen to help, but I've never done anything in Go, and couldn't figure out how to do it myself (just cherry picking didn't work 😉 ) See #903

alexnederlof · 2018-06-07T11:59:02Z

Or perhaps @aleksandra-malinowska

kaazoo · 2018-07-23T12:39:45Z

I tried cluster-autoscaler 1.2.2 on a 1.9.9 cluster (created by kops 1.9.1). Allthough it's not not officially supported, scaling up my GPU ASG from 0 seems to work.

MaciekPytel · 2018-07-23T13:27:36Z

@kaazoo A word of warning: CA may over scale nodes with GPU. This is because the current GPU workflow is:

CA notices a pending pod that requires GPU and adds node with GPU.
Node becomes ready. At that point GPU drivers are not yet installed and node doesn't have GPU capacity.
GPU driver installing daemonset runs on the node.
Once the drivers are installed GPU capacity is added to node.

The problem is between steps 2 and 4 CA sees the node has already booted up without GPUs. The pod requesting GPU is pending and there are no upcoming nodes, so CA triggers another scale-up. Due to the details of how CA works this issue will not happen when scaling-up from 0, but otherwise will happen ~100% of times.

There is a GCP/GKE specific hack for this, but we don't have a generic solution yet.

discordianfish added 2 commits February 13, 2018 16:00

Fix nvidia gpu resource name on AWS

4eb8391

Allow scaling to zero on AWS

cf27c68

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 13, 2018

aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Feb 13, 2018

k8s-ci-robot assigned aleksandra-malinowska Feb 14, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2018

aleksandra-malinowska merged commit 8abe69b into kubernetes:master Feb 14, 2018

alexnederlof mentioned this pull request Jun 7, 2018

CA does not scale up from zero nodes in group #903

Closed

This was referenced Aug 8, 2018

cherry pick: "Fix nvidia gpu resource name on AWS" to 1.1 #1130

Merged

cherry pick: "Fix nvidia gpu resource name on AWS" to 1.0 #1131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648

AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648

discordianfish commented Feb 13, 2018

aleksandra-malinowska commented Feb 13, 2018

sethpollack commented Feb 13, 2018

negz commented Feb 13, 2018

negz commented Feb 13, 2018

sethpollack commented Feb 13, 2018

aleksandra-malinowska commented Feb 14, 2018

bhack commented Feb 17, 2018

alexnederlof commented Jun 7, 2018 •

edited

Loading

alexnederlof commented Jun 7, 2018

kaazoo commented Jul 23, 2018

MaciekPytel commented Jul 23, 2018

AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648

AWS: Fix nvidia gpu resource name, re-enable ScaleToZero #648

Conversation

discordianfish commented Feb 13, 2018

aleksandra-malinowska commented Feb 13, 2018

sethpollack commented Feb 13, 2018

negz commented Feb 13, 2018

negz commented Feb 13, 2018

sethpollack commented Feb 13, 2018

aleksandra-malinowska commented Feb 14, 2018

bhack commented Feb 17, 2018

alexnederlof commented Jun 7, 2018 • edited Loading

alexnederlof commented Jun 7, 2018

kaazoo commented Jul 23, 2018

MaciekPytel commented Jul 23, 2018

alexnederlof commented Jun 7, 2018 •

edited

Loading