-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter Appears to Attempt to Delete Nodes, Hang on Finalizer #1166
Comments
Would you be able to share more information on your issue here? What exactly happened that wasn't expected? How long was the node up before the termination finalizer was removed? Was the termination finalizer never removed? |
If I understand the termination controller correctly it will try to drain and cordon nodes even if the underlying instance no longer exists. I think the first thing it should do is to check instance status and if it is not in some form of running state it should just remove the finalizer. |
IIUC the logic already attempts to do this, though perhaps we're missing a detail with the EC2 API. |
I'll see if I can dig into our log management tool to find the logs from the Karpenter controller. This is the 2nd time this has happened in our lab cluster - I actually went looking for it this morning to see if it had happened again, and indeed it had. I looked at the cluster with Lens, and I could see the same situation as last week - two nodes sitting in cordoned/no schedule mode, and some pods appearing to be stuck in "Terminating" status. Went into the eks console UI, and I could see the nodes there, Unknown status. Just to see what would happen, I uncordoned the nodes in Lens - Karpenter immediately cordoned them again. To get rid of them, I manually edited the nodes and removed the Karpenter finalizer. I'll keep an eye on our cluster and see if it happens again too - can try to snag the logs. |
Ok - Happened again in our env. ip-10-30-14-231.us-west-2.compute.internal is currently "stuck" - sitting there in cordoned state according to k8s, but gone from EC2. Here are the logs. Initial instance of karpenter-controller (looks like this pod restarted):
Looks like something upset the hnc pod, which blocked leader election on karpenter. 2nd instance of karpenter-controller (running now):
You can see where Karpenter cordoned ip-10-30-14-231.us-west-2.compute.internal, but there's no message indicating the instance was deleted even though it's gone from EC2. Same thing as before - If I manually uncordon ip-10-30-14-231.us-west-2.compute.internal, Karpenter immediately re-cordons it:
I'm gonna go ahead and unstick this env (remove the finalizer from this node so it fully deletes) since it seems like we can repro this fairly regularly - let me know what else I can provide. |
Adding :burning to this. We need to get this sorted out in the scale down logic. |
I've tried a couple things, but haven't been able to reproduce.
Could you help me with -
I could potentially fix this by adding a check that if the cloudprovider declares the instance as gone, we short-circuit and directly remove the finalizer from the node object without a cordon or drain. Having said that, I'm still curious as to what could have triggered this. |
I haven't seen this happen since. I just migrated Karpenter to Fargate and rolled this to 2 more clusters in our lab env. Planning to let it marinate there for a bit to watch to see if it happens again, and can provide additional detail. I def like the sound of the check to short circuit and remove the finalizer if something else removes the EC2 instance - that might be helpful in this case, but could also be helpful in case of operator error or something else external to Karpenter killing off the node. |
I've been able to isolate this behaviour on my end. My scenario: kubernetes 1.21, karpenter 0.5.6, aws node termination handler working in sqs mode (probably unrelated) On manual termination of karpenter node from the aws console, termination leaves a bunch of pods in Terminating stage. Since kubelet is down at that point, they never recover from that state, and as consequence, this Potentally, what we could do is something like |
Wait - Is it possible this is happening because we have spot instances and they're being "reclaimed" by AWS? One of the clusters I set up yesterday is in this state - let me dig into the logs to see what happened. |
@rayterrill might very well be, check if you have a pending pod in terminating state somewhere. |
Here's a list of pods "running" on one of the stuck nodes:
|
Yeah, try to |
Unfortunately losing those two nodes left me with zero nodes in my cluster other than the fargate nodes running Karpenter, so Karpenter is currently jacked up with no CoreDNS. Let me get that going so Fargate can use DNS without any pods. This makes me think it might be nice to have an option to request at least one on-demand node when we have both on-demand and spot instance types in our provisioner - not sure how common it is to set up the CoreDNS/Fargate piece. |
Yeah, I've tried to use karpenter on eks-fargate but abbandoned the idea pretty soon (it was not working well, and tbh I disliked a lot fargate limitations). Given that fargate spot is not supported on eks, It cost the same to create a small asg with only one node (i.e. t3.micro/small) during the bootstrap process and taint it with Karpenter installed with
|
That's exactly what I started with, but I flipped over to using Fargate because it felt "cleaner". Having second thoughts given the extra setup to get DNS working, which means whacking the coredns deployment to remove the ec2 annotation... |
It does indeed look like our instances were terminated due to spot pre-emption (which looks like it gets recorded as BidEvictedEvent, not TerminateInstance):
|
Once I got another node provisioned (which got CoreDNS back up), the nodes did indeed disappear after force deleting the pods stuck in terminating status. |
Same issue on another one of the clusters I added yesterday - spot pre-emption leaving the node in a stuck state. Removed finalized to let node go. |
Hey @rayterrill, regardless of if an instance is terminated manually by an operator or "reclaimed" by AWS through spot interruption, the Kubernetes node controller should eventually realize the instance no longer exists, and delete the node, triggering the termination workflow you've mentioned. With regards to a previous comment:
This behavior sounds like it's acting as intended if the underlying instance has already been terminated in EC2, spot or not. The finalizer on the node acts as a signal for a pre-deletion hook, so if someone tries to delete a node, Karpenter will reconcile that node state to be cordoned, fully drained, then terminated. Any attempts to uncordon in that process will be futile, as the controller will cordon it again. As previously mentioned by @alekc, if any pods are unable to be evicted/terminated, then that will block deletion. Any pods that tolerate unschedulable or are daemonsets are ignored for this condition. W.r.t. pods that are stuck terminating even though we have deleted them, we've seen this if the node is of NotReady/Unknown status, where the kubelet is unable to register pods as terminated and remove them. Additionally in the same logs, it looks like Karpenter may have been scheduled to a node it had provisioned, evicting itself and eventually re-scheduling to another node, which may have caused a delay in your cluster.
Would you be able to describe these pods if they're unable to be deleted? How long are they deleting before you force terminate them? Are they owned by any replication controllers, and do you see more replicas being brought up in response to the eviction call? |
I can try to gather that additional detail if it happens again, @njtran.
The node being in Unknown status is exactly what we're seeing from the EKS console - the behavior you're describing where the kubelet is unable to register pods as terminated seems to be exactly what's happening.
That's what we were thinking too, but that's not what we're seeing unfortunately. Maybe we're not waiting long enough? It seems to be happening overnight, and when I check on the clusters the following day, they're still stuck and require some manual intervention to unstick (it doesn't happen every night, but still fairly regularly). |
From what I've seen in my tests, it doesn't matter what kind of pods they are. As soon as an operator terminates the instance from the console, it doesn't wait for the node drain and at some point kubelet is killed without reporting the state of the node. Pods then are stuck in Terminating state seemingly forever (in case of @rayterrill as you can see above over 16h). Normally eks would be intelligent enough to remove the node from the cluster thus forcing the scheduler to reallocate those nodes, but since we have a finalizer in place it's getting stuck. With autoscaling group you have lifecycle hook which helps with this situation, in our case I guess we don't, thus if the node has not drained after x we would need to ask cloud provider if that node still exists. |
Yes @alekc that's exactly what I was thinking might be happening the workflow is @njtran is describing where the node controller realizes a node is gone and should be removed from the cluster is in fact running, but the Karpenter finalizer is blocking it from actually removing the node. It seems like Karpenter should handle this case and remove it's finalizer if the node is gone. In fact now that I'm thinking about it - that's likely exactly what I've simulated - if i manually remove the finalizer on the stuck node, the node controller indeed removes the node without any other intervention. |
When you do find a node in this state again, could you provide us a list of all the pods that are on it? During our drain process, we skip any pods that are stuck in Terminating so I'm curious to see why we're stuck in a cordon and drain loop. The Karpenter controller was otherwise healthy right? |
@suket22 I provided that today from a stuck node before I manually intervened: #1166 (comment) |
I could not retrigger the condition under the debugging this morning, so can't say which pod was marked as not drained (might be a good idea to add a debug there). Looking through the code I've noticed these bits of code: // 2. Separate pods as non-critical and critical
// https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown
for _, pod := range pods {
if val := pod.Annotations[v1alpha5.DoNotEvictPodAnnotationKey]; val == "true" {
logging.FromContext(ctx).Debugf("Unable to drain node, pod %s has do-not-evict annotation", pod.Name)
return false, nil
}
} Am I right to think that in case of a deleted node which had that taint, we won't ever be able to safely delete the node as it is (and just to be clear, this is NOT what's happening in both our cases because we would have seen that message in debug log). For now I added some additional debugging to the drain function to see which pods are marked as not drained and I will try to retrigger the scenario later on. |
The default rate limit for the EC2 API is 100/sec for non-mutating calls. It isn't that restrictive. I think that would only cause problems if a large amount of nodes were deleted, but that would be throttled by PDBs and other things anyway. |
Even something like a gate where you only check if a node has been "stuck" in unknown for something like 10m would cut down on the calls as well. Something like: |
It could be an idea to directly delete nodes once the unreachable taint has been added. It isn't possible to (gracefully) terminate pods when the kubelet is not responding anyway. |
Is this possibly the root cause? #1257 |
@olemarkus please no, I've had unreachable taint placed on the nodes due to spike in load average (kubelet was not able to report properly), that would be too drastic? @ellistarn might be in some scenario, but the underlaying issue still persist (do we really care about pods on terminated instance?) I've also created a PR (above) which adds additional logging eliminating the black hole (where we are unable to understand why the node is not being terminated). |
@alekc I meant that the finalizer would be removed if that was the case. I.e someone would have had to try to delete the node and the unreachable taint would have to be there. I certainly did not intend that the unreachable taint would cause node deletion on its own. |
oh, this is a very good point. If we remove the finalizer as soon as the node is tainted unreachable and put it back when it's reachable again eks should be able to clean everything up without need for karpenter to do anything else. This solution won't require change in cloud provider interface either. I guess the only downside would be that in case (or better, when) we expand to other providers in the future, what if the control plane is not able to remove missing nodes, should it be our responibility? |
It's CCM's responsibility to remove nodes when the underlying instance is gone. It's a requirement of the CCM node controller interface. So if it is no able to remove missing nodes, the issue must be fixed there. To have karpenter make up for a misbehaving CCM is not exactly ideal. |
I don't think removing the finalizer will help. That means that if a node is partitioned, Karpenter will delete the node without cordon/drain. This means that the node may still exist, as well as the pods running on it, causing a wide array of problems like violating stateful set guarantees. |
As far as we understand, logically, the current implementation does the right thing. We need to understand our gap in understanding is before making changes, or we risk making the situation worse. |
@ellistarn Will Karpenter delete the node? Or will K8s itself delete the node if the Karpenter finalizer is removed? |
@ellistarn I did a quick poc (without testing) alekc-forks@dc24cc8 As far as I understand, that implementation would remove the finalizer in case there is a new taint assigned to the node (we are not talking about the deletion yet). I have yet to test what happens when the taint is removed and if there is a conflict with karpenter trying to reassign the finalizer, so as I said a minimal POC. In theory (and I might be very wrong), if the node is without finalizer (so it's in the not reachable state), then for it to be deleted, it would require either intervention of ccm or operator? (or the ttl expiry, potentially). As for
I don't think it does it by 100% as I mentioned above. So, either a check for the existence of the node from the cloud (requiring interface change) or removal of finalizer in case the node is unschedulable probably would be needed |
Sorry for the delay -- getting back to this
Karpenter sends the deletion request which sets deletiontimestamp in the API Server, which causes cordon, drain, a delete call to EC2, and eventually removes the finalizer from the node. Kubernetes garbage collection deletes the node. |
I think this may be the gap. Can you confirm that you're using the |
I'm able to reproduce one case. I don't know if it's your exact case, since you don't have debugging logs on, so you wouldn't have seen these lines.
Node stuck draining
Pod stuck terminating
Logs
|
@ellistarn we are not using the do-not-evict key. |
Ah bummer -- well it's a case, if not your case. Will keep digging on this. Have you been able to successfully reproduce? |
Just checked again today - we haven't had it happen in awhile now. I wonder if it's a combination of a couple of things - multiple spot instances recalled at once, PDBs on critical services, etc - basically just leaving things in a bad state. I know we have some pods that don't play nicely with disruptions. |
Feel free to reopen if this recurs. |
This happen to me right now, working with version 0.13.2 but after the node becomes empty, it will hang on Finalizer!
|
Version
Karpenter: v0.5.3
Kubernetes: v1.21
Expected Behavior
Nodes should be terminated and exit cleanly.
Actual Behavior
EC2 instance is terminated, node continues to remain in Kubernetes. Removing the karpenter finalizer allows the node to be fully deleted.
Steps to Reproduce the Problem
Unsure - This has happened a few times now. Upgrading to 0.5.4 to see if this is still an issue.
Resource Specs and Logs
Will include next time - unfortunately pod restarted and the logs were lost.
The text was updated successfully, but these errors were encountered: