Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter does not remove terminated nodes #3214

Closed
runningman84 opened this issue Jan 13, 2023 · 2 comments
Closed

Karpenter does not remove terminated nodes #3214

runningman84 opened this issue Jan 13, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@runningman84
Copy link

Version

Karpenter Version: v0.22.0

Kubernetes Version: v1.22.0

Expected Behavior

If a given instance is terminated by user action or due to a hardware fault karpenter should remove the nodes from the node list. This should not be affected by pods having a "not evict annotation". Because once a node is terminated there is no need to wait for anything... if the node is removed as soon as possible there is a chance for a replacement.

Actual Behavior

The node stays in the list and cannot be deleted even if you use the force option.

Steps to Reproduce the Problem

Schedule some pods with not evict annotation, termiante these nodes in the ec2 console.

Resource Specs and Logs

2023-01-13T14:55:19.849Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-90-128.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:55:20.141Z	INFO	controller.termination	cordoned node	{"commit": "038d219-dirty", "node": "ip-10-8-90-128.eu-central-1.compute.internal"}
2023-01-13T14:55:20.431Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-138-145.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:55:20.830Z	INFO	controller.termination	cordoned node	{"commit": "038d219-dirty", "node": "ip-10-8-138-145.eu-central-1.compute.internal"}
2023-01-13T14:55:32.839Z	INFO	controller.inflightchecks	Inflight check failed for node, Can't drain node, pod wp-example/wp-example-wordpress-bedrock-wp-cron-27893695-bvh7x has do not evict annotation	{"commit": "038d219-dirty", "node": "ip-10-8-90-128.eu-central-1.compute.internal"}
2023-01-13T14:57:02.147Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-138-145.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:57:05.922Z	INFO	controller.interruption	deleted node from interruption message	{"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-90-128.eu-central-1.compute.internal", "action": "CordonAndDrain"}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@runningman84 runningman84 added the bug Something isn't working label Jan 13, 2023
@njtran
Copy link
Contributor

njtran commented Jan 13, 2023

Hey @runningman84, this is expected behavior with the do-not-evict annotation, as we treat those as disruption intolerant workloads. As I understand it, you're saying that even after the underlying instance has been terminated, since this do-not-evict pod is still around, it's blocking removal of the node object in kubernetes, right?

For Karpenter this signal would be the same. The instance is unreachable and the node is unhealthy. In this case, I believe this would be a duplicate of kubernetes-sigs/karpenter#750 where we're discussing node auto-repair. I'm going to close this in favor of that issue, as discussion is being tracked there.

In the meantime, if you're looking for a workaround, Karpenter creates kube-events when it is unable to drain a node. CodeRef. If you are able to watch for kubernetes events with this message, you should be able to detect when this happens and de-annotate that pod/remove that pod.

@njtran njtran closed this as completed Jan 13, 2023
@runningman84
Copy link
Author

Yes that is right. In our usecase this leads to an unnecessary outage. If the node would be terminated instantly the pod could respawn on another node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants