Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding troubleshooting docs items #1597

Merged
merged 6 commits into from
Apr 1, 2022
Merged

Adding troubleshooting docs items #1597

merged 6 commits into from
Apr 1, 2022

Conversation

chrisnegus
Copy link
Member

@chrisnegus chrisnegus commented Mar 30, 2022

1. Issue, if available:
Added troubleshooting items related to Issue #1084, Issue #1180 and Issue #607, as well as several slack discussions.

2. Description of changes:
The following troubleshooting entries were added:

  • Daemonsets can result in deployment failures (@felix-zhe-huang, please review)
  • Unspecified resource requests cause scheduling/bin-pack failures (@tzneal, please review)
  • Missing discovery tags causing provisioning failures (@bwagner5, please review)

4. Does this change impact docs?

  • Yes, PR includes docs updates

@chrisnegus chrisnegus requested a review from a team as a code owner March 30, 2022 15:58
@netlify
Copy link

netlify bot commented Mar 30, 2022

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit e5816c7
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/62474c7e3e81810009ba50a1
😎 Deploy Preview https://deploy-preview-1597--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.


## Unspecified resource requests cause scheduling/bin-pack failures

Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods.
Not using the Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) feature to enforce minimum resource request sizes will allow pods with very low or non-existent resource requests to be scheduled. This can cause issues as Karpenter bin-packs pods based on the resource requests. If the resource requests do not reflect the actual resource usage of the pod, Karpenter will place too many of these pods onto the same node resulting in the pods getting CPU throttled or terminated due to the OOM killer. This behavior is not unique to Karpenter and can also occur with the standard `kube-scheduler` with pods that don't have accurate resource requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any best practice docs that recommend sizing of limits vs requests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tzneal We have a general discussion on limits and pointers to Kubernetes docs on the subject in the Karpenter Best Practices guide. Do you have suggestions for how to set particular limits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incorporated your other comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a good limit to request ratio value to recommend, probably worth punting on for now in that case as I can't find any recommendations from K8s or any existing EKS best practices either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like most instance types are here have a max cpu:memory ratio of 1:4. GPUs have 1:8. If we need a ratio, 1:4 would be a good place to start thinking about it.

@chrisnegus chrisnegus added the documentation Improvements or additions to documentation label Mar 30, 2022
Excluding instance type r3.8xlarge because there are not enough resources for daemons {"commit": "7e79a67", "provisioner": "default"}
```

One workaround is to set your provisioner to only use larger instance types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will suggest to emphasize the workaround is for before v0.5.6.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

Copy link
Member Author

@chrisnegus chrisnegus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felix-zhe-huang , @tzneal , @bwagner5 I've responded to all of the provided comments. Please let me know if further changes are required or provide /lgtm if everything looks okay.

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved
Excluding instance type r3.8xlarge because there are not enough resources for daemons {"commit": "7e79a67", "provisioner": "default"}
```

One workaround is to set your provisioner to only use larger instance types.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved
website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! nicely written!

@chrisnegus chrisnegus merged commit dac3c4c into aws:main Apr 1, 2022
@suket22 suket22 mentioned this pull request May 23, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants