Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

ari-becker · 2020-01-30T12:01:38Z

The AWS README currently advises:

Note that the instance types should have the same amount of RAM and number of CPU cores, since this is fundamental to CA's scaling calculations. Using mismatched instances types can produce unintended results.

The README also provides an example:

Set LaunchTemplateOverrides to include the 'base' instance type r5.2xlarge and suitable alternatives, e.g. r5d.2xlarge, i3.2xlarge, r5a.2xlarge and r5ad.2xlarge.

This raises two questions for me:

a) While r5.2xlarge has 64 GB of RAM, the i3.2xlarge has less: 61 GB of RAM. Wouldn't the 3 fewer GB of RAM play havoc with the CA's scaling calculations, as documented? Am I missing something here, or is i3.2xlarge erroneously included?

b) Would it be permissible to list as an alternative an instance type with slightly more CPU and/or RAM, accepting that the extra CPU and/or RAM will not be recognized/utilized/exploited by the CA's scheduler? For example, permitting the C5n family to be used as an alternative for the C5 family? If so, then the documentation language should be changed from "the same amount" and "mismatched" to language that makes it clear that larger alternatives are acceptable. And if not, then the documentation should clarify that larger alternatives are unacceptable, because at least to this naive perspective of somebody unfamiliar with the specifics of the scheduling algorithm, it seems as though it should be fine.

The text was updated successfully, but these errors were encountered:

ari-becker · 2020-01-30T12:04:13Z

Paging @drewhemm , who wrote that section of the documentation

drewhemm · 2020-01-30T13:27:24Z

Hi @ari-becker ,

Perhaps the i3 family is not the best example, but personally I have not had any issues with them. It depends on the use case in question, particularly around resource requests...

Typically CA will just add more nodes until the current requests are satisfied. However, there is a theoretical edge case where if a request is for 62GB of RAM and CA adds i3.2xlarge instances, then the request may never be satisfied. I have not personally tested this edge case, but i3.2xlarge remains a "permissible alternative" in many cases.

Your point about the c5n instances is probably valid in the sense that more memory is almost certainly better than not enough. For this reason, I have some groups set up to use t3.xlarge and fall back to t3.2xlarge if necessary. I don't mind if there is some capacity wasted (especially burstable and/or spot), as long as the workloads get scheduled.

The "mismatched" was put in as CA was never originally designed to handle multiple instance types and therefore I think the developers wanted something like a disclaimer for those of us who really want to use that feature.

fejta-bot · 2020-04-29T13:54:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ari-becker · 2020-04-29T15:56:28Z

/remove-lifecycle stale

@drewhemm 's comment answered my question, but I see the issue as a call to improve the documentation in line with his comment.

otterley · 2020-06-12T16:19:26Z

@ari-becker I'd love your feedback on #3198.

fejta-bot · 2020-09-10T17:14:34Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ari-becker · 2020-10-03T14:14:57Z

Should have been closed automatically when #3198 merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2020

ari-becker mentioned this issue Jun 12, 2020

AWS Cluster Autoscaler: Multiple options for estimating size of Mixed Instance ASGs #3217

Closed

otterley mentioned this issue Jun 12, 2020

AWS: Update documentation #3198

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2020

ari-becker closed this as completed Oct 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

ari-becker commented Jan 30, 2020

ari-becker commented Jan 30, 2020

drewhemm commented Jan 30, 2020

fejta-bot commented Apr 29, 2020

ari-becker commented Apr 29, 2020

otterley commented Jun 12, 2020

fejta-bot commented Sep 10, 2020

ari-becker commented Oct 3, 2020

Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

Comments

ari-becker commented Jan 30, 2020

ari-becker commented Jan 30, 2020

drewhemm commented Jan 30, 2020

fejta-bot commented Apr 29, 2020

ari-becker commented Apr 29, 2020

otterley commented Jun 12, 2020

fejta-bot commented Sep 10, 2020

ari-becker commented Oct 3, 2020