Adding troubleshooting docs items #1597

chrisnegus · 2022-03-30T15:58:50Z

1. Issue, if available:
Added troubleshooting items related to Issue #1084, Issue #1180 and Issue #607, as well as several slack discussions.

2. Description of changes:
The following troubleshooting entries were added:

Daemonsets can result in deployment failures (@felix-zhe-huang, please review)
Unspecified resource requests cause scheduling/bin-pack failures (@tzneal, please review)
Missing discovery tags causing provisioning failures (@bwagner5, please review)

4. Does this change impact docs?

Yes, PR includes docs updates

netlify · 2022-03-30T15:59:01Z

✅ Deploy Preview for karpenter-docs-prod ready!

Name	Link
🔨 Latest commit	`e5816c7`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/62474c7e3e81810009ba50a1
😎 Deploy Preview	https://deploy-preview-1597--karpenter-docs-prod.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

tzneal · 2022-03-30T17:07:45Z

website/content/en/preview/troubleshooting.md

+
+## Unspecified resource requests cause scheduling/bin-pack failures
+
+Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods.


Suggested change

Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods.

Not using the Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) feature to enforce minimum resource request sizes will allow pods with very low or non-existent resource requests to be scheduled. This can cause issues as Karpenter bin-packs pods based on the resource requests. If the resource requests do not reflect the actual resource usage of the pod, Karpenter will place too many of these pods onto the same node resulting in the pods getting CPU throttled or terminated due to the OOM killer. This behavior is not unique to Karpenter and can also occur with the standard `kube-scheduler` with pods that don't have accurate resource requests.

Do we have any best practice docs that recommend sizing of limits vs requests?

@tzneal We have a general discussion on limits and pointers to Kubernetes docs on the subject in the Karpenter Best Practices guide. Do you have suggestions for how to set particular limits?

I incorporated your other comments.

I don't know a good limit to request ratio value to recommend, probably worth punting on for now in that case as I can't find any recommendations from K8s or any existing EKS best practices either.

Looks like most instance types are here have a max cpu:memory ratio of 1:4. GPUs have 1:8. If we need a ratio, 1:4 would be a good place to start thinking about it.

website/content/en/preview/troubleshooting.md

felix-zhe-huang · 2022-04-01T17:22:24Z

website/content/en/preview/troubleshooting.md

+Excluding instance type r3.8xlarge because there are not enough resources for daemons {"commit": "7e79a67", "provisioner": "default"}
+```
+
+One workaround is to set your provisioner to only use larger instance types.


I will suggest to emphasize the workaround is for before v0.5.6.

Good point. Done.

website/content/en/preview/troubleshooting.md

chrisnegus

@felix-zhe-huang , @tzneal , @bwagner5 I've responded to all of the provided comments. Please let me know if further changes are required or provide /lgtm if everything looks okay.

website/content/en/preview/troubleshooting.md

chrisnegus · 2022-04-01T17:49:55Z

website/content/en/preview/troubleshooting.md

+Excluding instance type r3.8xlarge because there are not enough resources for daemons {"commit": "7e79a67", "provisioner": "default"}
+```
+
+One workaround is to set your provisioner to only use larger instance types.


Good point. Done.

website/content/en/preview/troubleshooting.md

bwagner5

lgtm! nicely written!

Adding troubleshooting docs items

5dc3dd8

chrisnegus requested a review from a team as a code owner March 30, 2022 15:58

tzneal reviewed Mar 30, 2022

View reviewed changes

chrisnegus requested review from felix-zhe-huang and bwagner5 March 30, 2022 17:27

chrisnegus added the documentation Improvements or additions to documentation label Mar 30, 2022

Incorporated comments from tzneal

32c9763

bwagner5 reviewed Mar 31, 2022

View reviewed changes

website/content/en/preview/troubleshooting.md Show resolved Hide resolved

bwagner5 reviewed Mar 31, 2022

View reviewed changes

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved

chrisnegus added 2 commits March 31, 2022 22:04

Incorporated comments from bwagner

f3c7622

Fixed doc links

f5bab21

bwagner5 reviewed Mar 31, 2022

View reviewed changes

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved

felix-zhe-huang reviewed Apr 1, 2022

View reviewed changes

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved

felix-zhe-huang reviewed Apr 1, 2022

View reviewed changes

website/content/en/preview/troubleshooting.md Outdated Show resolved Hide resolved

Incorporated comments from felix-zhe-huang and bwagner

486efb5

bwagner5 reviewed Apr 1, 2022

View reviewed changes

website/content/en/preview/troubleshooting.md Show resolved Hide resolved

chrisnegus commented Apr 1, 2022

View reviewed changes

Added CLI commands to query subnet and security group selectors

e5816c7

bwagner5 approved these changes Apr 1, 2022

View reviewed changes

chrisnegus merged commit dac3c4c into aws:main Apr 1, 2022

chrisnegus mentioned this pull request Apr 2, 2022

Document best practices for Resource Requests #607

Closed

suket22 mentioned this pull request May 23, 2022

Releasing v0.10.1 #1843

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding troubleshooting docs items #1597

Adding troubleshooting docs items #1597

chrisnegus commented Mar 30, 2022 •

edited

Loading

netlify bot commented Mar 30, 2022 •

edited

Loading

tzneal Mar 30, 2022

tzneal Mar 30, 2022

chrisnegus Mar 30, 2022

chrisnegus Mar 30, 2022

tzneal Mar 30, 2022

njtran Apr 4, 2022

felix-zhe-huang Apr 1, 2022

chrisnegus Apr 1, 2022

chrisnegus left a comment

chrisnegus Apr 1, 2022

bwagner5 left a comment


		## Unspecified resource requests cause scheduling/bin-pack failures

		Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods.

Adding troubleshooting docs items #1597

Adding troubleshooting docs items #1597

Conversation

chrisnegus commented Mar 30, 2022 • edited Loading

netlify bot commented Mar 30, 2022 • edited Loading

✅ Deploy Preview for karpenter-docs-prod ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisnegus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwagner5 left a comment

Choose a reason for hiding this comment

chrisnegus commented Mar 30, 2022 •

edited

Loading

netlify bot commented Mar 30, 2022 •

edited

Loading