You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a given instance is terminated by user action or due to a hardware fault karpenter should remove the nodes from the node list. This should not be affected by pods having a "not evict annotation". Because once a node is terminated there is no need to wait for anything... if the node is removed as soon as possible there is a chance for a replacement.
Actual Behavior
The node stays in the list and cannot be deleted even if you use the force option.
Steps to Reproduce the Problem
Schedule some pods with not evict annotation, termiante these nodes in the ec2 console.
Resource Specs and Logs
2023-01-13T14:55:19.849Z INFO controller.interruption deleted node from interruption message {"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-90-128.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:55:20.141Z INFO controller.termination cordoned node {"commit": "038d219-dirty", "node": "ip-10-8-90-128.eu-central-1.compute.internal"}
2023-01-13T14:55:20.431Z INFO controller.interruption deleted node from interruption message {"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-138-145.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:55:20.830Z INFO controller.termination cordoned node {"commit": "038d219-dirty", "node": "ip-10-8-138-145.eu-central-1.compute.internal"}
2023-01-13T14:55:32.839Z INFO controller.inflightchecks Inflight check failed for node, Can't drain node, pod wp-example/wp-example-wordpress-bedrock-wp-cron-27893695-bvh7x has do not evict annotation {"commit": "038d219-dirty", "node": "ip-10-8-90-128.eu-central-1.compute.internal"}
2023-01-13T14:57:02.147Z INFO controller.interruption deleted node from interruption message {"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-138-145.eu-central-1.compute.internal", "action": "CordonAndDrain"}
2023-01-13T14:57:05.922Z INFO controller.interruption deleted node from interruption message {"commit": "038d219-dirty", "queue": "KarpenterInterruptions-preprod-example", "messageKind": "StateChangeKind", "node": "ip-10-8-90-128.eu-central-1.compute.internal", "action": "CordonAndDrain"}
Community Note
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
The text was updated successfully, but these errors were encountered:
Hey @runningman84, this is expected behavior with the do-not-evict annotation, as we treat those as disruption intolerant workloads. As I understand it, you're saying that even after the underlying instance has been terminated, since this do-not-evict pod is still around, it's blocking removal of the node object in kubernetes, right?
For Karpenter this signal would be the same. The instance is unreachable and the node is unhealthy. In this case, I believe this would be a duplicate of kubernetes-sigs/karpenter#750 where we're discussing node auto-repair. I'm going to close this in favor of that issue, as discussion is being tracked there.
In the meantime, if you're looking for a workaround, Karpenter creates kube-events when it is unable to drain a node. CodeRef. If you are able to watch for kubernetes events with this message, you should be able to detect when this happens and de-annotate that pod/remove that pod.
Yes that is right. In our usecase this leads to an unnecessary outage. If the node would be terminated instantly the pod could respawn on another node.
Version
Karpenter Version: v0.22.0
Kubernetes Version: v1.22.0
Expected Behavior
If a given instance is terminated by user action or due to a hardware fault karpenter should remove the nodes from the node list. This should not be affected by pods having a "not evict annotation". Because once a node is terminated there is no need to wait for anything... if the node is removed as soon as possible there is a chance for a replacement.
Actual Behavior
The node stays in the list and cannot be deleted even if you use the force option.
Steps to Reproduce the Problem
Schedule some pods with not evict annotation, termiante these nodes in the ec2 console.
Resource Specs and Logs
Community Note
The text was updated successfully, but these errors were encountered: