-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding troubleshooting docs items #1597
Conversation
✅ Deploy Preview for karpenter-docs-prod ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
|
||
## Unspecified resource requests cause scheduling/bin-pack failures | ||
|
||
Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not setting Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) on pods can cause Karpenter to fail to schedule or properly bin-pack pods. | |
Not using the Kubernetes [LimitRanges](https://kubernetes.io/docs/concepts/policy/limit-range/) feature to enforce minimum resource request sizes will allow pods with very low or non-existent resource requests to be scheduled. This can cause issues as Karpenter bin-packs pods based on the resource requests. If the resource requests do not reflect the actual resource usage of the pod, Karpenter will place too many of these pods onto the same node resulting in the pods getting CPU throttled or terminated due to the OOM killer. This behavior is not unique to Karpenter and can also occur with the standard `kube-scheduler` with pods that don't have accurate resource requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any best practice docs that recommend sizing of limits vs requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tzneal We have a general discussion on limits and pointers to Kubernetes docs on the subject in the Karpenter Best Practices guide. Do you have suggestions for how to set particular limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I incorporated your other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know a good limit to request ratio value to recommend, probably worth punting on for now in that case as I can't find any recommendations from K8s or any existing EKS best practices either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like most instance types are here have a max cpu:memory ratio of 1:4. GPUs have 1:8. If we need a ratio, 1:4 would be a good place to start thinking about it.
Excluding instance type r3.8xlarge because there are not enough resources for daemons {"commit": "7e79a67", "provisioner": "default"} | ||
``` | ||
|
||
One workaround is to set your provisioner to only use larger instance types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will suggest to emphasize the workaround is for before v0.5.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felix-zhe-huang , @tzneal , @bwagner5 I've responded to all of the provided comments. Please let me know if further changes are required or provide /lgtm if everything looks okay.
Excluding instance type r3.8xlarge because there are not enough resources for daemons {"commit": "7e79a67", "provisioner": "default"} | ||
``` | ||
|
||
One workaround is to set your provisioner to only use larger instance types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! nicely written!
1. Issue, if available:
Added troubleshooting items related to Issue #1084, Issue #1180 and Issue #607, as well as several slack discussions.
2. Description of changes:
The following troubleshooting entries were added:
4. Does this change impact docs?