Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter Appears to Attempt to Delete Nodes, Hang on Finalizer #1166

Closed
rayterrill opened this issue Jan 18, 2022 · 50 comments
Closed

Karpenter Appears to Attempt to Delete Nodes, Hang on Finalizer #1166

rayterrill opened this issue Jan 18, 2022 · 50 comments
Assignees
Labels
bug Something isn't working burning Time sensitive issues

Comments

@rayterrill
Copy link
Contributor

Version

Karpenter: v0.5.3

Kubernetes: v1.21

Expected Behavior

Nodes should be terminated and exit cleanly.

Actual Behavior

EC2 instance is terminated, node continues to remain in Kubernetes. Removing the karpenter finalizer allows the node to be fully deleted.

Steps to Reproduce the Problem

Unsure - This has happened a few times now. Upgrading to 0.5.4 to see if this is still an issue.

Resource Specs and Logs

Will include next time - unfortunately pod restarted and the logs were lost.

@rayterrill rayterrill added the bug Something isn't working label Jan 18, 2022
@njtran
Copy link
Contributor

njtran commented Jan 18, 2022

Would you be able to share more information on your issue here? What exactly happened that wasn't expected? How long was the node up before the termination finalizer was removed? Was the termination finalizer never removed?

@olemarkus
Copy link
Contributor

If I understand the termination controller correctly it will try to drain and cordon nodes even if the underlying instance no longer exists. I think the first thing it should do is to check instance status and if it is not in some form of running state it should just remove the finalizer.

@ellistarn
Copy link
Contributor

IIUC the logic already attempts to do this, though perhaps we're missing a detail with the EC2 API.

@rayterrill
Copy link
Contributor Author

I'll see if I can dig into our log management tool to find the logs from the Karpenter controller.

This is the 2nd time this has happened in our lab cluster - I actually went looking for it this morning to see if it had happened again, and indeed it had.

I looked at the cluster with Lens, and I could see the same situation as last week - two nodes sitting in cordoned/no schedule mode, and some pods appearing to be stuck in "Terminating" status.

Went into the eks console UI, and I could see the nodes there, Unknown status.

Just to see what would happen, I uncordoned the nodes in Lens - Karpenter immediately cordoned them again.

To get rid of them, I manually edited the nodes and removed the Karpenter finalizer.

I'll keep an eye on our cluster and see if it happens again too - can try to snag the logs.

@rayterrill
Copy link
Contributor Author

rayterrill commented Jan 19, 2022

Ok - Happened again in our env. ip-10-30-14-231.us-west-2.compute.internal is currently "stuck" - sitting there in cordoned state according to k8s, but gone from EC2.

Here are the logs.

Initial instance of karpenter-controller (looks like this pod restarted):

2022-01-18T16:59:19.938Z	INFO	controller.controller.volume	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.938Z	INFO	controller.controller.volume	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.938Z	INFO	controller.controller.volume	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim"}
2022-01-18T16:59:19.938Z	INFO	controller.controller.termination	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.938Z	INFO	controller.controller.termination	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node"}
2022-01-18T16:59:19.938Z	INFO	controller.controller.node	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.938Z	INFO	controller.controller.node	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.938Z	INFO	controller.controller.node	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.938Z	INFO	controller.controller.node	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node"}
2022-01-18T16:59:19.938Z	INFO	controller.controller.podmetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Pod", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.939Z	INFO	controller.controller.podmetrics	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Pod"}
2022-01-18T16:59:19.939Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.939Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.939Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-18T16:59:19.939Z	INFO	controller.controller.nodemetrics	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node"}
2022-01-18T16:59:20.087Z	INFO	controller.controller.provisioning	Starting workers	{"commit": "7e79a67", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-01-18T16:59:20.088Z	INFO	controller.controller.termination	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-01-18T16:59:20.088Z	INFO	controller.controller.podmetrics	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Pod", "worker count": 1}
2022-01-18T16:59:20.098Z	INFO	controller.controller.counter	Starting workers	{"commit": "7e79a67", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-01-18T16:59:20.100Z	INFO	controller.controller.volume	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "worker count": 1}
2022-01-18T16:59:20.100Z	INFO	controller.controller.node	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-01-18T16:59:20.111Z	INFO	controller.controller.nodemetrics	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "worker count": 1}
2022-01-18T16:59:21.519Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "7e79a67", "provisioner": "arm64"}
2022-01-18T16:59:21.669Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T06:19:19.210Z	INFO	controller.termination	Cordoned node	{"commit": "7e79a67", "node": "ip-10-30-12-61.us-west-2.compute.internal"}
2022-01-19T06:24:21.547Z	INFO	controller.termination	Deleted node	{"commit": "7e79a67", "node": "ip-10-30-12-61.us-west-2.compute.internal"}
E0119 06:50:13.005146       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.30.12.218:9443: connect: connection refused
E0119 06:50:15.022746       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.30.12.218:9443: connect: connection refused
E0119 06:50:17.021832       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.30.12.218:9443: connect: connection refused
E0119 06:50:19.021057       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "objects.hnc.x-k8s.io": Post "https://hnc-webhook-service.hnc-system.svc:443/validate-objects?timeout=2s": dial tcp 10.30.12.218:9443: connect: connection refused
E0119 06:50:22.988777       1 leaderelection.go:361] Failed to update lock: Put "https://172.20.0.1:443/api/v1/namespaces/karpenter/configmaps/karpenter-leader-election": context deadline exceeded
I0119 06:50:22.988842       1 leaderelection.go:278] failed to renew lease karpenter/karpenter-leader-election: timed out waiting for the condition
panic: Unable to start manager, leader election lost

goroutine 1 [running]:
main.main()
	github.com/aws/karpenter/cmd/controller/main.go:99 +0xef9

Looks like something upset the hnc pod, which blocked leader election on karpenter.

2nd instance of karpenter-controller (running now):

2022-01-19T07:02:48.581Z	INFO	controller.controller.volume	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim"}
2022-01-19T07:02:48.581Z	INFO	controller.controller.termination	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.termination	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node"}
2022-01-19T07:02:48.581Z	INFO	controller.controller.node	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.node	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.node	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.node	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node"}
2022-01-19T07:02:48.581Z	INFO	controller.controller.podmetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Pod", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.podmetrics	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Pod"}
2022-01-19T07:02:48.581Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-01-19T07:02:48.581Z	INFO	controller.controller.nodemetrics	Starting Controller	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node"}
2022-01-19T07:02:48.680Z	INFO	controller.controller.provisioning	Starting workers	{"commit": "7e79a67", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-01-19T07:02:48.681Z	INFO	controller.controller.podmetrics	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Pod", "worker count": 1}
2022-01-19T07:02:48.700Z	INFO	controller.controller.termination	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-01-19T07:02:48.700Z	INFO	controller.controller.counter	Starting workers	{"commit": "7e79a67", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-01-19T07:02:48.702Z	INFO	controller.controller.volume	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "worker count": 1}
2022-01-19T07:02:48.702Z	INFO	controller.controller.node	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-01-19T07:02:48.709Z	INFO	controller.controller.nodemetrics	Starting workers	{"commit": "7e79a67", "reconciler group": "", "reconciler kind": "Node", "worker count": 1}
2022-01-19T07:02:50.240Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "7e79a67", "provisioner": "arm64"}
2022-01-19T07:02:50.260Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T07:03:05.266Z	INFO	controller.termination	Cordoned node	{"commit": "7e79a67", "node": "ip-10-30-14-231.us-west-2.compute.internal"}
2022-01-19T08:30:54.867Z	INFO	controller.termination	Cordoned node	{"commit": "7e79a67", "node": "ip-10-30-12-197.us-west-2.compute.internal"}
2022-01-19T08:30:56.033Z	INFO	controller.provisioning	Batched 1 pods in 1.000770345s	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T08:30:56.384Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [c5ad.4xlarge c5a.4xlarge c5d.4xlarge c5.4xlarge]	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T08:30:59.747Z	INFO	controller.provisioning	Launched instance: i-023900719cd083d34, hostname: ip-10-30-15-36.us-west-2.compute.internal, type: c5.4xlarge, zone: us-west-2c, capacityType: spot	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T08:30:59.764Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-30-15-36.us-west-2.compute.internal	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T08:30:59.764Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T08:36:17.212Z	INFO	controller.termination	Deleted node	{"commit": "7e79a67", "node": "ip-10-30-12-197.us-west-2.compute.internal"}
2022-01-19T10:52:08.221Z	INFO	controller.termination	Cordoned node	{"commit": "7e79a67", "node": "ip-10-30-15-85.us-west-2.compute.internal"}
2022-01-19T10:52:09.720Z	INFO	controller.provisioning	Batched 2 pods in 1.11576356s	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T10:52:09.937Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 2 pod(s) with instance type option(s) [c5d.4xlarge c5a.4xlarge c5ad.4xlarge c5.4xlarge]	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T10:52:12.364Z	INFO	controller.provisioning	Launched instance: i-0da0e06e8fa747739, hostname: ip-10-30-12-171.us-west-2.compute.internal, type: c5.4xlarge, zone: us-west-2c, capacityType: spot	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T10:52:12.381Z	INFO	controller.provisioning	Bound 2 pod(s) to node ip-10-30-12-171.us-west-2.compute.internal	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T10:52:12.382Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "7e79a67", "provisioner": "default"}
2022-01-19T10:57:10.557Z	INFO	controller.termination	Deleted node	{"commit": "7e79a67", "node": "ip-10-30-15-85.us-west-2.compute.internal"}

You can see where Karpenter cordoned ip-10-30-14-231.us-west-2.compute.internal, but there's no message indicating the instance was deleted even though it's gone from EC2.

Same thing as before - If I manually uncordon ip-10-30-14-231.us-west-2.compute.internal, Karpenter immediately re-cordons it:

2022-01-19T16:37:51.952Z	INFO	controller.termination	Cordoned node	{"commit": "7e79a67", "node": "ip-10-30-14-231.us-west-2.compute.internal"}

I'm gonna go ahead and unstick this env (remove the finalizer from this node so it fully deletes) since it seems like we can repro this fairly regularly - let me know what else I can provide.

@ellistarn ellistarn added the burning Time sensitive issues label Jan 19, 2022
@ellistarn
Copy link
Contributor

Adding :burning to this. We need to get this sorted out in the scale down logic.

@suket22 suket22 self-assigned this Jan 24, 2022
@suket22
Copy link
Contributor

suket22 commented Jan 26, 2022

I've tried a couple things, but haven't been able to reproduce.

  1. In the simplest case, I terminate the instance from EC2 directly. Node controller detects this and marks the node as unschedulable. I see the pods are evicted and the node is eventually cleaned up by the termination controller.
  2. Next, I tried deploying pods that tolerate all taints. My guess was that maybe the Karpenter controller is stuck in a loop, infinitely trying to drain a node, and therefore isn't getting to the point of actually removing the finalizer itself. I wasn't able to replicate your node getting stuck this way either.

Could you help me with -

  1. Debug logs of the Karpenter controller and a list of pods that are running on the stuck nodes. Just curious as to whether this node is empty or not and if the pods tolerate everything. For example, some calico pods do IIRC.
  2. Can you check who called the EC2 TerminateInstance API in CloudTrail? I find it quite surprising Karpenter would terminate the instance, but not successfully remove the finalizer since the termination only happens after the cordon and drain are complete.

I could potentially fix this by adding a check that if the cloudprovider declares the instance as gone, we short-circuit and directly remove the finalizer from the node object without a cordon or drain. Having said that, I'm still curious as to what could have triggered this.

@rayterrill
Copy link
Contributor Author

I haven't seen this happen since. I just migrated Karpenter to Fargate and rolled this to 2 more clusters in our lab env. Planning to let it marinate there for a bit to watch to see if it happens again, and can provide additional detail.

I def like the sound of the check to short circuit and remove the finalizer if something else removes the EC2 instance - that might be helpful in this case, but could also be helpful in case of operator error or something else external to Karpenter killing off the node.

@alekc
Copy link
Contributor

alekc commented Jan 27, 2022

I've been able to isolate this behaviour on my end.

My scenario: kubernetes 1.21, karpenter 0.5.6, aws node termination handler working in sqs mode (probably unrelated)

On manual termination of karpenter node from the aws console, termination leaves a bunch of pods in Terminating stage.

Since kubelet is down at that point, they never recover from that state, and as consequence, this
https://github.com/aws/karpenter/blob/main/pkg/controllers/termination/controller.go#L89-L91
return is always triggered.

Potentally, what we could do is something like
If not is not yet drained, and it has been more than X from the deletion timestamp, get info about the node from the cloud provider, and if it doesn't exist perform the deletion.

@rayterrill
Copy link
Contributor Author

Wait - Is it possible this is happening because we have spot instances and they're being "reclaimed" by AWS? One of the clusters I set up yesterday is in this state - let me dig into the logs to see what happened.

@alekc
Copy link
Contributor

alekc commented Jan 27, 2022

@rayterrill might very well be, check if you have a pending pod in terminating state somewhere.

@rayterrill
Copy link
Contributor Author

Here's a list of pods "running" on one of the stuck nodes:

rayterrill@C02FR4M4MD6R ~ % kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=ip-10-30-44-70.ap-south-1.compute.internal
NAMESPACE                            NAME                                                              READY   STATUS        RESTARTS   AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
calico-system                        calico-kube-controllers-59f4448cd7-wqhc5                          1/1     Terminating   1          16h   10.30.47.225   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
calico-system                        calico-node-k457x                                                 1/1     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
calico-system                        calico-typha-688bbd479c-j74kr                                     1/1     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-cert-manager           cert-manager-6875b5cc95-zx6sz                                     1/1     Terminating   0          16h   10.30.45.200   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-cert-manager           cert-manager-cainjector-c9fc7b997-pn5cm                           1/1     Terminating   1          16h   10.30.47.42    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-cert-manager           cert-manager-webhook-58b458f465-48h75                             1/1     Terminating   0          16h   10.30.46.154   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-datadog-agent          datadog-agent-72lgd                                               4/4     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-datadog-agent          datadog-agent-cluster-agent-5f588485db-thglz                      1/1     Terminating   0          16h   10.30.47.156   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-datadog-agent          datadog-agent-clusterchecks-57b4f798b7-t84sr                      1/1     Terminating   0          16h   10.30.44.54    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-external-dns           external-dns-657964d4ff-74gqp                                     1/1     Terminating   1          16h   10.30.46.253   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-image-secrets-puller   image-secrets-puller-deployment-7544fbb7f9-klg85                  1/1     Terminating   0          16h   10.30.46.2     ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-metrics-server         metrics-server-85fcf9bdbf-lxsz4                                   1/1     Terminating   1          16h   10.30.45.128   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-metrics-server         metrics-server-85fcf9bdbf-xvzb9                                   1/1     Terminating   0          16h   10.30.45.74    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-external-controller-7f985b97c5-4vnrs                1/1     Terminating   0          16h   10.30.47.253   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-external-controller-7f985b97c5-dkzhn                1/1     Terminating   0          16h   10.30.47.150   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-external-v2-ingress-nginx-controller-6b7c56dfxsnm   1/1     Terminating   0          16h   10.30.45.222   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-external-v2-ingress-nginx-controller-6b7c56dqlnn9   1/1     Terminating   0          16h   10.30.47.48    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-internal-controller-77d78b8f6c-bqlpn                1/1     Terminating   0          16h   10.30.47.104   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-internal-controller-77d78b8f6c-xr69q                1/1     Terminating   0          16h   10.30.46.98    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-internal-v2-ingress-nginx-controller-68c9f5b59m5x   1/1     Terminating   0          16h   10.30.45.75    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
cluster-addon-nginx-ingress          nginx-ingress-internal-v2-ingress-nginx-controller-68c9f5b5g24l   1/1     Terminating   0          16h   10.30.46.186   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         api-gateway-v2-79c56699b8-j7ltt                                   0/1     Terminating   0          16h   10.30.44.128   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         api-gateway-v2-79c56699b8-wffgn                                   0/1     Terminating   0          16h   10.30.46.200   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         api-gateway-v2-sandbox-769b8ffbdb-kbl9k                           1/1     Terminating   0          16h   10.30.47.176   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         istio-ingressgateway-f7c58dd7-vmhnk                               1/1     Terminating   0          16h   10.30.44.120   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         istiod-1-10-3-75645f44d5-kdlb2                                    1/1     Terminating   0          16h   10.30.45.196   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         istiod-1-10-3-75645f44d5-pbhdm                                    1/1     Terminating   0          16h   10.30.45.137   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
istio-system                         istiod-1-10-3-75645f44d5-t5d6z                                    1/1     Terminating   0          16h   10.30.46.245   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          aws-node-cj459                                                    1/1     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          coredns-6b4cdc67b4-vbcrg                                          1/1     Terminating   0          16h   10.30.46.113   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          coredns-coredns-6f4f44d478-f5w6k                                  1/1     Terminating   0          16h   10.30.45.98    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          coredns-coredns-6f4f44d478-m5wwr                                  1/1     Terminating   0          16h   10.30.47.119   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          ebs-csi-controller-5797b9f99b-b629n                               6/6     Running       1          16h   10.30.45.32    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          ebs-csi-node-6rxcv                                                3/3     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          ebs-snapshot-controller-0                                         1/1     Running       0          16h   10.30.47.126   ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          efs-csi-node-d2xkm                                                3/3     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
kube-system                          kube-proxy-pmz7f                                                  1/1     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>
tigera-operator                      tigera-operator-7dc6bc5777-nvd5l                                  1/1     Running       0          16h   10.30.44.70    ip-10-30-44-70.ap-south-1.compute.internal   <none>           <none>

@alekc
Copy link
Contributor

alekc commented Jan 27, 2022

Yeah, try to kubectl get po | grep Terminating | awk '{print $1}' | xargs kubectl delete po --force --grace-period=0 (I am using kubens, if you are not add -n) to clean them up and see if your node is removed correctly after that.

@rayterrill
Copy link
Contributor Author

Unfortunately losing those two nodes left me with zero nodes in my cluster other than the fargate nodes running Karpenter, so Karpenter is currently jacked up with no CoreDNS. Let me get that going so Fargate can use DNS without any pods.

This makes me think it might be nice to have an option to request at least one on-demand node when we have both on-demand and spot instance types in our provisioner - not sure how common it is to set up the CoreDNS/Fargate piece.

@alekc
Copy link
Contributor

alekc commented Jan 27, 2022

Yeah, I've tried to use karpenter on eks-fargate but abbandoned the idea pretty soon (it was not working well, and tbh I disliked a lot fargate limitations). Given that fargate spot is not supported on eks, It cost the same to create a small asg with only one node (i.e. t3.micro/small) during the bootstrap process and taint it with CriticalAddonsOnly=true:NoSchedule. This will allow to spawn coredns and karpenter on it and kick off the rest.

Karpenter installed with

webhook:
  replicas: 2
  tolerations:
    - key: "CriticalAddonsOnly"
      operator: "Exists"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
              - linux
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: karpenter.sh/provisioner-name
                operator: DoesNotExist
controller:
  env:
    - name: AWS_ENI_LIMITED_POD_DENSITY
      value: "false"
  replicas: 1
  tolerations:
    - key: "CriticalAddonsOnly"
      operator: "Exists"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
              - linux
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: karpenter.sh/provisioner-name
                operator: DoesNotExist
  clusterName: ${var.cluster_name}
  clusterEndpoint: ${data.aws_eks_cluster.cluster.endpoint}
  resources:
    requests:
      memory: 500Mi
EOT

@rayterrill
Copy link
Contributor Author

That's exactly what I started with, but I flipped over to using Fargate because it felt "cleaner". Having second thoughts given the extra setup to get DNS working, which means whacking the coredns deployment to remove the ec2 annotation...

@rayterrill
Copy link
Contributor Author

It does indeed look like our instances were terminated due to spot pre-emption (which looks like it gets recorded as BidEvictedEvent, not TerminateInstance):

{
    "eventVersion": "1.08",
    "userIdentity": {
        "accountId": "REDACTED",
        "invokedBy": "ec2.amazonaws.com"
    },
    "eventTime": "2022-01-27T01:16:15Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "BidEvictedEvent",
    "awsRegion": "ap-south-1",
    "sourceIPAddress": "ec2.amazonaws.com",
    "userAgent": "ec2.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "f93d9c3f-b2cc-4b1a-92c9-5108b378afa0",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "REDACTED",
    "serviceEventDetails": {
        "instanceIdSet": [
            "i-07d75f63adacdef9f"
        ]
    },
    "eventCategory": "Management"
}

@rayterrill
Copy link
Contributor Author

Once I got another node provisioned (which got CoreDNS back up), the nodes did indeed disappear after force deleting the pods stuck in terminating status.

@rayterrill
Copy link
Contributor Author

rayterrill commented Jan 27, 2022

Same issue on another one of the clusters I added yesterday - spot pre-emption leaving the node in a stuck state. Removed finalized to let node go.

@njtran
Copy link
Contributor

njtran commented Jan 27, 2022

Hey @rayterrill, regardless of if an instance is terminated manually by an operator or "reclaimed" by AWS through spot interruption, the Kubernetes node controller should eventually realize the instance no longer exists, and delete the node, triggering the termination workflow you've mentioned.

With regards to a previous comment:

Happened again in our env. ip-10-30-14-231.us-west-2.compute.internal is currently "stuck" - sitting there in cordoned state according to k8s, but gone from EC2.

If I manually uncordon ip-10-30-14-231.us-west-2.compute.internal, Karpenter immediately re-cordons it:

This behavior sounds like it's acting as intended if the underlying instance has already been terminated in EC2, spot or not. The finalizer on the node acts as a signal for a pre-deletion hook, so if someone tries to delete a node, Karpenter will reconcile that node state to be cordoned, fully drained, then terminated. Any attempts to uncordon in that process will be futile, as the controller will cordon it again.

As previously mentioned by @alekc, if any pods are unable to be evicted/terminated, then that will block deletion. Any pods that tolerate unschedulable or are daemonsets are ignored for this condition. W.r.t. pods that are stuck terminating even though we have deleted them, we've seen this if the node is of NotReady/Unknown status, where the kubelet is unable to register pods as terminated and remove them.

Additionally in the same logs, it looks like Karpenter may have been scheduled to a node it had provisioned, evicting itself and eventually re-scheduling to another node, which may have caused a delay in your cluster.

the nodes did indeed disappear after force deleting the pods stuck in terminating status.

Would you be able to describe these pods if they're unable to be deleted? How long are they deleting before you force terminate them? Are they owned by any replication controllers, and do you see more replicas being brought up in response to the eviction call?

@rayterrill
Copy link
Contributor Author

I can try to gather that additional detail if it happens again, @njtran.

As previously mentioned by @alekc, if any pods are unable to be evicted/terminated, then that will block deletion. Any pods that tolerate unschedulable or are daemonsets are ignored for this condition. W.r.t. pods that are stuck terminating even though we have deleted them, we've seen this if the node is of NotReady/Unknown status, where the kubelet is unable to register pods as terminated and remove them

The node being in Unknown status is exactly what we're seeing from the EKS console - the behavior you're describing where the kubelet is unable to register pods as terminated seems to be exactly what's happening.

regardless of if an instance is terminated manually by an operator or "reclaimed" by AWS through spot interruption, the Kubernetes node controller should eventually realize the instance no longer exists, and delete the node, triggering the termination workflow you've mentioned.

That's what we were thinking too, but that's not what we're seeing unfortunately. Maybe we're not waiting long enough? It seems to be happening overnight, and when I check on the clusters the following day, they're still stuck and require some manual intervention to unstick (it doesn't happen every night, but still fairly regularly).

@alekc
Copy link
Contributor

alekc commented Jan 27, 2022

From what I've seen in my tests, it doesn't matter what kind of pods they are. As soon as an operator terminates the instance from the console, it doesn't wait for the node drain and at some point kubelet is killed without reporting the state of the node. Pods then are stuck in Terminating state seemingly forever (in case of @rayterrill as you can see above over 16h).

Normally eks would be intelligent enough to remove the node from the cluster thus forcing the scheduler to reallocate those nodes, but since we have a finalizer in place it's getting stuck.

With autoscaling group you have lifecycle hook which helps with this situation, in our case I guess we don't, thus if the node has not drained after x we would need to ask cloud provider if that node still exists.

@rayterrill
Copy link
Contributor Author

rayterrill commented Jan 27, 2022

Yes @alekc that's exactly what I was thinking might be happening the workflow is @njtran is describing where the node controller realizes a node is gone and should be removed from the cluster is in fact running, but the Karpenter finalizer is blocking it from actually removing the node.

It seems like Karpenter should handle this case and remove it's finalizer if the node is gone.

In fact now that I'm thinking about it - that's likely exactly what I've simulated - if i manually remove the finalizer on the stuck node, the node controller indeed removes the node without any other intervention.

@suket22
Copy link
Contributor

suket22 commented Jan 28, 2022

When you do find a node in this state again, could you provide us a list of all the pods that are on it? During our drain process, we skip any pods that are stuck in Terminating so I'm curious to see why we're stuck in a cordon and drain loop.

The Karpenter controller was otherwise healthy right?

@rayterrill
Copy link
Contributor Author

@suket22 I provided that today from a stuck node before I manually intervened: #1166 (comment)

@alekc
Copy link
Contributor

alekc commented Jan 28, 2022

I could not retrigger the condition under the debugging this morning, so can't say which pod was marked as not drained (might be a good idea to add a debug there).

Looking through the code I've noticed these bits of code:

// 2. Separate pods as non-critical and critical
	// https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown
	for _, pod := range pods {
		if val := pod.Annotations[v1alpha5.DoNotEvictPodAnnotationKey]; val == "true" {
			logging.FromContext(ctx).Debugf("Unable to drain node, pod %s has do-not-evict annotation", pod.Name)
			return false, nil
		}
	}

Am I right to think that in case of a deleted node which had that taint, we won't ever be able to safely delete the node as it is (and just to be clear, this is NOT what's happening in both our cases because we would have seen that message in debug log).

For now I added some additional debugging to the drain function to see which pods are marked as not drained and I will try to retrigger the scenario later on.

@olemarkus
Copy link
Contributor

The default rate limit for the EC2 API is 100/sec for non-mutating calls. It isn't that restrictive. I think that would only cause problems if a large amount of nodes were deleted, but that would be throttled by PDBs and other things anyway.

@rayterrill
Copy link
Contributor Author

rayterrill commented Jan 31, 2022

Even something like a gate where you only check if a node has been "stuck" in unknown for something like 10m would cut down on the calls as well.

Something like:
Node unknown for > 10m:
--Ask EC2/provider - does node still exist?
----If not, remove finalizer and let k8s clean the node up

@olemarkus
Copy link
Contributor

It could be an idea to directly delete nodes once the unreachable taint has been added. It isn't possible to (gracefully) terminate pods when the kubelet is not responding anyway.

@ellistarn
Copy link
Contributor

Is this possibly the root cause? #1257

@alekc alekc mentioned this issue Feb 2, 2022
3 tasks
@alekc
Copy link
Contributor

alekc commented Feb 2, 2022

@olemarkus please no, I've had unreachable taint placed on the nodes due to spike in load average (kubelet was not able to report properly), that would be too drastic?

@ellistarn might be in some scenario, but the underlaying issue still persist (do we really care about pods on terminated instance?)

I've also created a PR (above) which adds additional logging eliminating the black hole (where we are unable to understand why the node is not being terminated).

@olemarkus
Copy link
Contributor

@alekc I meant that the finalizer would be removed if that was the case. I.e someone would have had to try to delete the node and the unreachable taint would have to be there. I certainly did not intend that the unreachable taint would cause node deletion on its own.

@alekc
Copy link
Contributor

alekc commented Feb 2, 2022

oh, this is a very good point. If we remove the finalizer as soon as the node is tainted unreachable and put it back when it's reachable again eks should be able to clean everything up without need for karpenter to do anything else. This solution won't require change in cloud provider interface either.

I guess the only downside would be that in case (or better, when) we expand to other providers in the future, what if the control plane is not able to remove missing nodes, should it be our responibility?

@olemarkus
Copy link
Contributor

It's CCM's responsibility to remove nodes when the underlying instance is gone. It's a requirement of the CCM node controller interface. So if it is no able to remove missing nodes, the issue must be fixed there. To have karpenter make up for a misbehaving CCM is not exactly ideal.

@ellistarn
Copy link
Contributor

I don't think removing the finalizer will help. That means that if a node is partitioned, Karpenter will delete the node without cordon/drain. This means that the node may still exist, as well as the pods running on it, causing a wide array of problems like violating stateful set guarantees.

@ellistarn
Copy link
Contributor

As far as we understand, logically, the current implementation does the right thing. We need to understand our gap in understanding is before making changes, or we risk making the situation worse.

@rayterrill
Copy link
Contributor Author

That means that if a node is partitioned, Karpenter will delete the node without cordon/drain.

@ellistarn Will Karpenter delete the node? Or will K8s itself delete the node if the Karpenter finalizer is removed?

@alekc
Copy link
Contributor

alekc commented Feb 2, 2022

@ellistarn I did a quick poc (without testing) alekc-forks@dc24cc8

As far as I understand, that implementation would remove the finalizer in case there is a new taint assigned to the node (we are not talking about the deletion yet). I have yet to test what happens when the taint is removed and if there is a conflict with karpenter trying to reassign the finalizer, so as I said a minimal POC.

In theory (and I might be very wrong), if the node is without finalizer (so it's in the not reachable state), then for it to be deleted, it would require either intervention of ccm or operator? (or the ttl expiry, potentially).

As for

the current implementation does the right thing

I don't think it does it by 100% as I mentioned above.
In example, if the node is terminated, and we have a pod scheduled on it with DoNotEvictPodAnnotationKey or (and this should be tested) it has a PDB assigned to it, the node will be stuck forever.

So, either a check for the existence of the node from the cloud (requiring interface change) or removal of finalizer in case the node is unschedulable probably would be needed

@ellistarn ellistarn assigned ellistarn and suket22 and unassigned suket22 and ellistarn Feb 8, 2022
@ellistarn
Copy link
Contributor

ellistarn commented Feb 8, 2022

Sorry for the delay -- getting back to this

@rayterrill

@ellistarn Will Karpenter delete the node? Or will K8s itself delete the node if the Karpenter finalizer is removed?

Karpenter sends the deletion request which sets deletiontimestamp in the API Server, which causes cordon, drain, a delete call to EC2, and eventually removes the finalizer from the node. Kubernetes garbage collection deletes the node.

@ellistarn
Copy link
Contributor

In example, if the node is terminated, and we have a pod scheduled on it with DoNotEvictPodAnnotationKey or (and this should be tested) it has a PDB assigned to it, the node will be stuck forever.
So, either a check for the existence of the node from the cloud (requiring interface change) or removal of finalizer in case the node is unschedulable probably would be needed

I think this may be the gap. Can you confirm that you're using the do-not-evict label?

@ellistarn
Copy link
Contributor

I'm able to reproduce one case. I don't know if it's your exact case, since you don't have debugging logs on, so you wouldn't have seen these lines.

  • Set a do-not-evict annotation on a pod
  • Terminate the instance it's running on directly in EC2
  • The node sticks around forever
  • The pod is stuck terminating

Node stuck draining

ip-192-168-119-87.us-west-2.compute.internal   NotReady,SchedulingDisabled   <none>   4m36s   v1.20.11-eks-f17b81

Pod stuck terminating

inflate-b8fc9fdcc-np5fv   1/1     Terminating   0          7m53s

Logs

karpenter-controller-55c748cb47-hklqg manager 2022-02-08T19:59:00.326Z	DEBUG	controller.termination	Unable to drain node, pod inflate-b8fc9fdcc-np5fv has do-not-evict annotation	{"commit": "4c1fb8a", "node": "ip-192-168-119-87.us-west-2.compute.internal"}
karpenter-controller-55c748cb47-hklqg manager 2022-02-08T19:59:00.842Z	DEBUG	controller.termination	Unable to drain node, pod inflate-b8fc9fdcc-np5fv has do-not-evict annotation	{"commit": "4c1fb8a", "node": "ip-192-168-119-87.us-west-2.compute.internal"}
karpenter-controller-55c748cb47-hklqg manager 2022-02-08T19:59:10.843Z	DEBUG	controller.termination	Unable to drain node, pod inflate-b8fc9fdcc-np5fv has do-not-evict annotation	{"commit": "4c1fb8a", "node": "ip-192-168-119-87.us-west-2.compute.internal"}
karpenter-controller-55c748cb47-hklqg manager 2022-02-08T19:59:20.844Z	DEBUG	controller.termination	Unable to drain node, pod inflate-b8fc9fdcc-np5fv has do-not-evict annotation	{"commit": "4c1fb8a", "node": "ip-192-168-119-87.us-west-2.compute.internal"}
karpenter-controller-55c748cb47-hklqg manager 2022-02-08T19:59:30.845Z	DEBUG	controller.termination	Unable to drain node, pod inflate-b8fc9fdcc-np5fv has do-not-evict annotation	{"commit": "4c1fb8a", "node": "ip-192-168-119-87.us-west-2.compute.int

@rayterrill
Copy link
Contributor Author

@ellistarn we are not using the do-not-evict key.

@ellistarn
Copy link
Contributor

ellistarn commented Feb 8, 2022

Ah bummer -- well it's a case, if not your case. Will keep digging on this. Have you been able to successfully reproduce?

@rayterrill
Copy link
Contributor Author

Just checked again today - we haven't had it happen in awhile now.

I wonder if it's a combination of a couple of things - multiple spot instances recalled at once, PDBs on critical services, etc - basically just leaving things in a bad state. I know we have some pods that don't play nicely with disruptions.

@ellistarn
Copy link
Contributor

Feel free to reopen if this recurs.

@ermiaqasemi
Copy link

ermiaqasemi commented Jul 19, 2022

This happen to me right now, working with version 0.13.2 but after the node becomes empty, it will hang on Finalizer!

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ${instance_types}
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
  providerRef:
    name: default
  limits:
    resources:
      cpu: 1000
      memory: 1000Gi
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 1728000
  labels:
    eks.exanoke.so/node-role: worker
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  amiFamily: AL2
  instanceProfile: ${cluster_instance_profile}
  subnetSelector:
    karpenter.sh/discovery: ${cluster_name}
  securityGroupSelector:
    karpenter.sh/discovery: ${cluster_name}
  tags:
    karpenter.sh/discovery: ${cluster_name}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

No branches or pull requests

7 participants