Karpenter does not remove terminated nodes #3214

runningman84 · 2023-01-13T15:03:44Z

Version

Karpenter Version: v0.22.0

Kubernetes Version: v1.22.0

Expected Behavior

If a given instance is terminated by user action or due to a hardware fault karpenter should remove the nodes from the node list. This should not be affected by pods having a "not evict annotation". Because once a node is terminated there is no need to wait for anything... if the node is removed as soon as possible there is a chance for a replacement.

Actual Behavior

The node stays in the list and cannot be deleted even if you use the force option.

Steps to Reproduce the Problem

Schedule some pods with not evict annotation, termiante these nodes in the ec2 console.

Resource Specs and Logs

2023-01-13T14:55:19.849Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-90-128.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:55:20.141Z	INFO	controller.termination	cordoned node	{"commit": "038d219-dirty", "node": "ip-10-8-90-128.eu-central-1.compute.internal"}
2023-01-13T14:55:20.431Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-138-145.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:55:20.830Z	INFO	controller.termination	cordoned node	{"commit": "038d219-dirty", "node": "ip-10-8-138-145.eu-central-1.compute.internal"}
2023-01-13T14:55:32.839Z	INFO	controller.inflightchecks	Inflight check failed for node, Can't drain node, pod wp-example/wp-example-wordpress-bedrock-wp-cron-27893695-bvh7x has do not evict annotation	{"commit": "038d219-dirty", "node": "ip-10-8-90-128.eu-central-1.compute.internal"}
2023-01-13T14:57:02.147Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-138-145.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:57:05.922Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-90-128.eu-central-1.compute.internal", "action": "CordonAndDrain"}

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

njtran · 2023-01-13T17:34:00Z

Hey @runningman84, this is expected behavior with the do-not-evict annotation, as we treat those as disruption intolerant workloads. As I understand it, you're saying that even after the underlying instance has been terminated, since this do-not-evict pod is still around, it's blocking removal of the node object in kubernetes, right?

For Karpenter this signal would be the same. The instance is unreachable and the node is unhealthy. In this case, I believe this would be a duplicate of kubernetes-sigs/karpenter#750 where we're discussing node auto-repair. I'm going to close this in favor of that issue, as discussion is being tracked there.

In the meantime, if you're looking for a workaround, Karpenter creates kube-events when it is unable to drain a node. CodeRef. If you are able to watch for kubernetes events with this message, you should be able to detect when this happens and de-annotate that pod/remove that pod.

runningman84 · 2023-01-13T19:20:08Z

Yes that is right. In our usecase this leads to an unnecessary outage. If the node would be terminated instantly the pod could respawn on another node.

runningman84 added the bug Something isn't working label Jan 13, 2023

njtran closed this as completed Jan 13, 2023

runningman84 mentioned this issue Feb 11, 2023

Karpenter doesn't completely drain nodes that receive a spot interruption warning when running pods with do-not-evict annotation #3383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter does not remove terminated nodes #3214

Karpenter does not remove terminated nodes #3214

runningman84 commented Jan 13, 2023

njtran commented Jan 13, 2023

runningman84 commented Jan 13, 2023

Karpenter does not remove terminated nodes #3214

Karpenter does not remove terminated nodes #3214

Comments

runningman84 commented Jan 13, 2023

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

njtran commented Jan 13, 2023

runningman84 commented Jan 13, 2023