-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A node was created with error and Karpenter doesn't terminate it #1111
Comments
Same as #1067, I think |
Yeah, looks the same. I think it may happen because
Ref: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_Instance.html) If for some reason an instance can't be in |
EC2 can often take several seconds to assign a Private DNS Name. It's usually much quicker, but we've seen this happen. This will be resolved permanently when we move towards id-based naming in kubernetes 1.23. Are you using a custom launch template? It must inject the label This label allows karpenter to manage the node, even if there was a loss of availability when the node came online. Further, karpenter will inject the finalizer into the node, as long as the label exists. Some paths forward:
|
Yeah, I'm using custom LT.
I think if #1008 is implemented I will be able to get rid of custom LT. |
@ellistarn As example:
It looks like only 'bad' node without |
When Karpenter is able to create the instance, it also creates the node object itself with the correct label. When it fails, it is kubelet that creates the node object, but kubelet has not been configured to add the label, thus it is missing. |
@olemarkus Thanks for the clarification. I think the core problem is that Karpenter relied on a node object which is created by itself. Probably it would help if Karpenter is able to check/modify node objects that are created by kubelet when Karpenter failed to do it. |
It would take some state maintenance that karpenter doesn't do right now. The easier solution is to use the instance ID as node name, which I added support for earlier and that kOps clusters already make use of. |
Sure, I’m waiting for #1008, it satisfies almost all cases |
Version
Karpenter: v0.5.4
Kubernetes: v1.21.5
Expected Behavior
Karpenter deletes 'bad' node. At least.
Actual Behavior
For some reason one of nodes was created with the following error in logs:
After these messages Karpenter just ignores this instance.
I mean this instance is actually up and running:
But Karpenter said that it couldn't create 'node' and now this instaince is not tracked by Karpenter.
Karpenter doesn't delete it but it is empty.
At the same time this instance has joined a k8s cluster as node:
What I did:
I scaled statefulset from 0 to 3 and Karpenter created 4 instances instead of 3 (due to the error I guess).
Steps to Reproduce the Problem
Actually I don't know because I encountered this issue first time
Resource Specs and Logs
Can we have a workaround for such cases? I mean Karpenter should track instances if they actually were created even if they are not treated as 'node'.
PS: I tried to restart Karpenter but it didn't find this instance/node and couldn't delete it.
I think this is because the finalizer has not been added by Karpenter due to the error.
Maybe Karpenter should add finalizer anyway despite the fact that there were errors.
The text was updated successfully, but these errors were encountered: