-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"inconsistent state error adding volume, StorageClass.storage.k8s.io "nfs" not found, please file an issue" #1998
Comments
Additional problem with Karpenter when it tries to scale up nodes to match unscheduled pods. The pod I am testing with is sized to match 1 pod against 1 node (mem/cpu combo + allowed instance types are designed this way...). Karpenter will spin up 9-11 NEW nodes, just to match the 1 unschedulable pod...then will schedule the pod, and remove the extra nodes it added for no reason. In this case, Karpenter only needed to spin up TWO new nodes, in order to fit all the pods....but spun up 11 nodes instead. |
Reverting to 0.10.1 appears to solve both the StorageClass & "too many new nodes" problems...going to stay on that version until we figure out what I might be doing wrong... |
The extra nodes problem should be solved by #1980 . It's caused by the pod self affinity in your spec. If you remove the pod affinity rule, do you still see the error relating to storage class? |
@tzneal - I'll check and report back shortly. |
The NFS issue/storage class thing seems to go away when I remove:
The too many nodes issue persists, aggressively. My boss is going to kill me when he sees the way node counts were bursting in our account today. $$ I'm going to stay downgraded @ 0.10.x for now... |
Can you paste some of your Karpenter logs? I'm wondering if its the short TTL time and the nodes are being deleted before the persistent volume binds. |
I saw this also in the logs with 0.13.1 - what helped was adding the iam_role_additional_policies = [
# Required by Karpenter
"arn:${local.partition}:iam::aws:policy/AmazonSSMManagedInstanceCore",
# NFS
"arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
] This was also addressed in #1775 (comment). Maybe a FAQ candidate? |
@tzneal - I can try and get you some logs. @pat-s - I will look into this and see if it helps. @tzneal - I'm still getting WAYYYYYYY too many new EC2 nodes when I try to scale up a single pod (the single pod is sized to fit a single EC2 of type m6.large...this is intentional). It results in a TON of new nodes. ^^ The above was triggered when I scaled one of my Deployments from On Karpenter 0.10.1, this issue does not happen. Please let me know what other info I can provide to help debug. I'd be happy to cooperate! This is with Karpenter version 0.13.1 |
@armenr Can you provide the provisioner and pod spec for the situation where Karpenter launches too many nodes? |
@tzneal - I'll get that to you shortly! Thanks again for the strong Bias for Action & Customer Obsession!! :) |
FYI we saw a similar issue when using static PVs with PVC containing Karpenter would keep provisioning new nodes until first node finally came up, then delete all the extra nodes. Error was: Replacing it with a real StorageClass fixed the issue and node provisioning is working as expected. karpenter v0.13.1 |
@ace22b Thanks for that report, I believe that's the issue. We pull the CSI driver name off the storage class, but for static volumes we need to pull it from the volume itself. |
@armenr Do you have an "nfs" storage class? |
For static volumes, we pull the CSI driver name off of the PV after it's bound instead of from the SC named on the PVC. The SC may not even exist in these cases and is informational. Fixes aws#1998
@tzneal - We have the following: PV Manifest
PVC Manifest
Karpenter Provisioner
Deployment Manifest (sample)
|
@tzneal - I'm also confused about something - In the karpenter provisioner above, I used to also include t3 instances in sizes L + XL, along with the array of other instance types...but karpenter keeps provisioning only m5n.L and m5n.XL instances...nothing else, no matter what pod resources I schedule :-\ |
@armenr Can you file a separate issue? It's difficult to track multiple items in a single issue. |
@armenr Thanks for the manifests, I've verified that the fix I've implemented solves this problem using those manifests as well. |
For static volumes, we pull the CSI driver name off of the PV after it's bound instead of from the SC named on the PVC. The SC may not even exist in these cases and is informational. Fixes aws#1998
For static volumes, we pull the CSI driver name off of the PV after it's bound instead of from the SC named on the PVC. The SC may not even exist in these cases and is informational. Fixes #1998
@tzneal - Thank you so much for the quick turnaround and explanation/verification. What would be the way to pull down a nightly (or from main/master branch) to test the fix until a release is tagged? |
Yes, this should work for you. 4c35c0f is the current latest commit in main.
|
Thanks so much @tzneal - I'll annoy you with my other ask (the instance sizes, which instance types get provisioned) on a separate issue. Enjoy the long weekend, friend! Thanks again from a (former) fellow Amazonian! |
Do you know if this is from a pvc or ephemeral volume? If it's a PVC, can you paste the spec? I tested with what was supplied above and can't reproduce the error. I'm expecting on this static PVC to see a volumeName which causes us to then ignore the storage class on the PVC and look it up from the volume. |
Re-closing this, there may have been a comment deleted. This fix wasn't in the v0.13.2 release as it only resolved an issue related to pricing sync. |
Version
Karpenter: v0.12.0
Kubernetes: v1.22.9-eks-a64ea69
Expected Behavior
Karpenter works as expected (no issues with nfs volumes)
Actual Behavior
Seeing the following in my karpenter logs:
controller 2022-06-27T06:30:24.907Z ERROR controller.node-state inconsistent state error adding volume, StorageClass.storage.k8s.io "nfs" not found, please file an issue {"commit": "588d4c8", "node": "ip-XXX-XX-XX-XXX.us-west-2.compute.internal"}
Steps to Reproduce the Problem
Resource Specs and Logs
Provisioner:
Sample Deployment Template
Sample PVC tempalte:
Sample PV template:
The text was updated successfully, but these errors were encountered: