Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear documentation on permissible alternatives for AWS ASG MixedInstancesPolicy #2786

Closed
ari-becker opened this issue Jan 30, 2020 · 7 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ari-becker
Copy link

The AWS README currently advises:

Note that the instance types should have the same amount of RAM and number of CPU cores, since this is fundamental to CA's scaling calculations. Using mismatched instances types can produce unintended results.

The README also provides an example:

Set LaunchTemplateOverrides to include the 'base' instance type r5.2xlarge and suitable alternatives, e.g. r5d.2xlarge, i3.2xlarge, r5a.2xlarge and r5ad.2xlarge.

This raises two questions for me:

a) While r5.2xlarge has 64 GB of RAM, the i3.2xlarge has less: 61 GB of RAM. Wouldn't the 3 fewer GB of RAM play havoc with the CA's scaling calculations, as documented? Am I missing something here, or is i3.2xlarge erroneously included?

b) Would it be permissible to list as an alternative an instance type with slightly more CPU and/or RAM, accepting that the extra CPU and/or RAM will not be recognized/utilized/exploited by the CA's scheduler? For example, permitting the C5n family to be used as an alternative for the C5 family? If so, then the documentation language should be changed from "the same amount" and "mismatched" to language that makes it clear that larger alternatives are acceptable. And if not, then the documentation should clarify that larger alternatives are unacceptable, because at least to this naive perspective of somebody unfamiliar with the specifics of the scheduling algorithm, it seems as though it should be fine.

@ari-becker
Copy link
Author

Paging @drewhemm , who wrote that section of the documentation

@drewhemm
Copy link
Contributor

Hi @ari-becker ,

Perhaps the i3 family is not the best example, but personally I have not had any issues with them. It depends on the use case in question, particularly around resource requests...

Typically CA will just add more nodes until the current requests are satisfied. However, there is a theoretical edge case where if a request is for 62GB of RAM and CA adds i3.2xlarge instances, then the request may never be satisfied. I have not personally tested this edge case, but i3.2xlarge remains a "permissible alternative" in many cases.

Your point about the c5n instances is probably valid in the sense that more memory is almost certainly better than not enough. For this reason, I have some groups set up to use t3.xlarge and fall back to t3.2xlarge if necessary. I don't mind if there is some capacity wasted (especially burstable and/or spot), as long as the workloads get scheduled.

The "mismatched" was put in as CA was never originally designed to handle multiple instance types and therefore I think the developers wanted something like a disclaimer for those of us who really want to use that feature.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2020
@ari-becker
Copy link
Author

/remove-lifecycle stale

@drewhemm 's comment answered my question, but I see the issue as a call to improve the documentation in line with his comment.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2020
@otterley
Copy link
Contributor

@ari-becker I'd love your feedback on #3198.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2020
@ari-becker
Copy link
Author

Should have been closed automatically when #3198 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants