-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter nodes get stuck on "NotReady" state #1415
Comments
Interesting -- it looks like your pods are consuming the ephemeral storage on the node. Can you list the pod specs applied to the node? Can you provide provisioner specs as well? |
This is an example of one of the pods spec:
This is the provisioner:
|
Great. I see you're not using custom launch templates or bottlerocket, so that simplifies some concerns I had about the disk. I'm focused on the log line |
@ellistarn I have ran this workload with Node Group before, EBS volume used to be 200GB. Now with Karpenter it has only 20GB (I think this is the default size by eks). Hence I asked if there's a way to choose the volume size from the provisioner, or 20gb is unchangeable. I really rather not dealing with launch templates. |
@ellistarn Now there's a new message. all these pods used to work before within a Node Group. The provisioner configurations are similar to those used to be in the node Group. "failed to garbage collect required amount of images. Wanted to free X bytes, but freed 0 bytes" |
@ellistarn Do you have any insights? |
Thank you. In the mean time I am running with my own launch template. Now disk size is 200gb. @ellistarn @bwagner5 Any idea why this keeps happening? This is how my cluster looking currently. I really need your advise because the env is down and other people are using it :( |
The example pod spec you have listed above doesn't have any resource requests. Without that, I believe karpenter will pack nodes until it reaches the ENI limit for pods which varies based on instance type. Are you seeing lots of OOM errors, or does |
@tzneal I do see OMM errors. This is an example of one of the nodes description: Allocated resources:
Should I add memory resource limits or requests to the provisioner itself? |
Karpenter has resource requests already defined in it's helm chart for itself. In my experience, you really need memory resource requests on your containers or scheduling won't work well within Kubernetes regardless of usage of Karpenter or any other auto-scaler. If you look at your node output above, it says that Kubernetes is only aware of 712Mi of memory requests, but you have Java processes getting OOM killed by the kernel, so you are running out of physical memory on the node. |
Setting resource requests is also a best practice listed here
|
@tzneal Thanks again for your attention.
So what is the meaning of resource requests and limits in the provisioner's spec ? (you can see my provisioner above for reference)
Are you certain that adding resources requests and limits for my deployments will solve the issue? |
Block Device Mappings PR has been merged and should be released next week. You can checkout the preview docs here: https://karpenter.sh/preview/aws/provisioning/#block-device-mappings If you are fine with the other defaults Karpenter is currently providing, you should be able to use the following mapping once we do a release:
|
Version 0.5.6
Karpenter has started creating new nodes and work as expected.
After a while (approximately 40min), some nodes are switching from
Ready
toNotReady
. They stay like that for hours.. nothing moves.It feels like it happens randomly, most of the pods inside the
NotReady
nodes are in "running" state and then moving to "Terminating".Provisioner has
ttlSecondsAfterEmpty: 60
andttlSecondsUntilExpired
isn't defined.Posting the Node description Events:
![Screen Shot 2022-02-25 at 1 01 19](https://user-images.githubusercontent.com/99882210/155622023-e388f851-18ad-4a93-93c7-f5a3490149d5.png)
NodeHasDiskPressure
- I think Karpenter nodes are starting with 20gb disk. Is it possible to extend the disk size through the provisioner? It may help with this situation.Here is another Node that has just has became
![Screen Shot 2022-02-25 at 1 25 36](https://user-images.githubusercontent.com/99882210/155624416-05cab285-9bef-4591-acd4-7ec3641db65e.png)
NotReady
. This time I can't really understand why:The text was updated successfully, but these errors were encountered: