Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Update documentation #3198

Merged
merged 1 commit into from
Jun 29, 2020
Merged

Conversation

otterley
Copy link
Contributor

@otterley otterley commented Jun 5, 2020

Update AWS README document for:

  1. Clarity
  2. Document fully the use of tags on Auto Scaling Groups to advertise resources to Cluster Autoscaler
  3. Emphasize the use of IAM Roles for Service accounts where available for security
  4. Fix some grammatical/spelling errors
  5. Wrap lines

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 5, 2020
@otterley
Copy link
Contributor Author

otterley commented Jun 5, 2020

/assign @jaypipes

@arhea
Copy link

arhea commented Jun 5, 2020

Is it worth mentioning how Cluster Autoscaler relates to Managed Node Groups (cluster autoscaler tags are pre-configured) and AWS Fargate (not needed)?

@otterley otterley force-pushed the update-aws-docs branch 3 times, most recently from 6455873 to 3c046e7 Compare June 5, 2020 19:38
@otterley
Copy link
Contributor Author

otterley commented Jun 5, 2020

/assign @Jeffwan

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 8, 2020

I was off last week and I will review it by end of day. Thanks for contribution!

**NOTE**: You can restrict the target resources for the autoscaling actions by
specifying autoscaling group ARNS. More information can be found
In addition, we also recommend adding `autoscaling:DescribeLaunchConfigurations`
(if you created your ASG using a Launch Configuration) and/or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. usually, user just need LC or LT permission

@@ -71,153 +92,158 @@ env:
value: YOUR_AWS_REGION
```

## Deployment Specification
Auto-Discovery Setup is always preferred option to avoid multiple, potentially different configuration for min/max values. If you want to adjust minimum and maximum size of the group, please adjust size on ASG directly, CA will fetch latest change when talking to ASG.
## Auto-Discovery Setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make this change? Probably change to following makes more sense? They should belong to one header 2

Deployment Specification

  • Auto-Discovery
  • Manual Configuration
    • One ASG
    • Multiple ASG
  • Master Node Setup


## Scaling a node group to 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only being used by scale from 0 case. Why do you delete header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous language suggested that if there were no pods scheduled on an ASG that there might be a reason for the ASG size not to scale down to zero, and that it could be corrected somehow by tagging. Or, that scaling up from zero was an important special case. Having re-read the documentation and the source code a number of times, I'm not sure that scaling down to zero or up from zero is a special case. Am I mistaken?


To run a cluster-autoscaler which auto-discovers ASGs with nodes use the `--node-group-auto-discovery` flag. For example, `--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>` will find the ASGs where those tag keys
_exist_. It does not matter what value the tags have.
Each Auto Scaling Group should be comprised of instance types that provide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even mixed instance policy is supported. This is not something we recommend users to use. It never guarantee all the simulation is accurate because if instances ASG bring up are unknown to CA. I don't recommend we talk about this here. It would be better to move to MixedInstancePolicy section. The best practice is still one ASG -> one Launch Template -> One Instance type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using a single ASG with a single instance type is the simplest case for most customers, and we can come up with verbiage to emphasize that. However, we do recommend using Mixed Instance policies under several circumstances: (1) customers using Spot Instances, and (2) very large customers who benefit from instance diversity to address node shortages in certain regions/AZs.

I disagree that a single ASG is "best practice" however, since we even support multiple ASGs with Managed Node Groups out of the box and encourage customers to use it where it makes sense for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed Instance policy is not mature enough as a solution and user needs to take the risk and have to know all prerequisites to use it. It's just one presentation of spot, there're group of users using spot separately without MIP. I may not explain it clearly, my point is single instance per ASG is best practice, not single ASG. You can definitely has more ASGs/node groups.

]
}
```
Cluster Autoscaler < 1.15: `cloud.google.com/gke-accelerator=<gpu-type>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change line here.

- If you're running multiple ASGs, the `--expander` flag supports three options: `random`, `most-pods` and `least-waste`. `random` will expand a random ASG on scale up. `most-pods` will scale up the ASG that will schedule the most amount of pods. `least-waste` will expand the ASG that will waste the least amount of CPU/MEM resources. In the event of a tie, cluster autoscaler will fall back to `random`.
- If you're managing your own kubelets, they need to be started with the `--provider-id` flag. The provider id has the format `aws:///<availability-zone>/<instance-id>`, e.g. `aws:///us-east-1a/i-01234abcdef`.
- If you want to use regional STS endpoints (e.g. when using VPC endpoint for STS) the env `AWS_STS_REGIONAL_ENDPOINTS=regional` should be set.
* The `/etc/ssl/certs/ca-bundle.crt` should exist by default on ec2 instance in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of hard for me to check line by line. Can you summaries what's your change?
Basically, our principle is to minimize the changes..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier if you review the full README rather than review as a line by line patch, because I've completely reorganized the document for clarity. https://github.com/kubernetes/autoscaler/blob/3c046e7233d0d1dcb9cb46fd57f37c5dd189344e/cluster-autoscaler/cloudprovider/aws/README.md

cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/aws/README.md Outdated Show resolved Hide resolved

### Gotchas
`<gpu-type>` varies by instance type. On P2 instances, for example, the
value is `nvidia-tesla-k80`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the user find out what the supported GPU types are? Do we have a link to that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I lack this knowledge, personally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reserved for future GPU optimization, user doesn't have to use nvidia-tesla-k80 at this moment, any identifier will work. While, we plan to ask MNG to apply these labels by default.

@otterley
Copy link
Contributor Author

Resolves #2786

@ari-becker
Copy link

Resolves #2786

Yeah, this PR would certainly do this. Thanks @otterley - looks fantastic to me.

Not that it matters so much for this PR, but I would draw your attention to the partial solution (partial because it doesn't cover every EC2 instance type, only the ones we use) we wrote in Dhall and open-sourced for the problem of trying to compute the correct list of instances that can be put into a Mixed Instance Policy. This was essential to allow downstream to specify a desired image type, without having to provide all of the potential alternatives as well.

https://github.com/coralogix/dhall-aws/blob/55db0fb33d95e8a5e99b943bda1f74d23c388bba/ec2/InstanceType.dhall#L301

It surprisingly turned out to be one of the more profitable uses of our time - we prefer to use a lowest-cost allocation policy, and we would've never considered adding instance families like m5dn as their on-demand prices are higher than corresponding m5 instances. Apparently, most of AWS's users suffer the same fallacy, and we found AWS launching m5dn spot instances which, surprisingly, were sometimes cheaper than m5 spot instances. So, I applaud explicitly pointing out that instances like m5d can be added alongside m5.

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 22, 2020

/lgtm

@jaypipes any other feedbacks or concerns?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 22, 2020
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 29, 2020
@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 29, 2020

@otterley Can you check why you have code changes in this PR? There're lots of dependency conflicts.I assume this is only doc change.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 29, 2020
@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 29, 2020

@otterley One more thing, could you squash 3 commits to 1 commit? This repo doesn't use squash based. One commit for one change makes commit history clean. Thanks!

@otterley
Copy link
Contributor Author

@Jeffwan Done!

Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@otterley really like these changes as a whole, thank you very much!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypipes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 29, 2020
@k8s-ci-robot k8s-ci-robot merged commit 323509a into kubernetes:master Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants