Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Available ENIs left dangling after node termination #608

Closed
krzysztof-bronk opened this issue Sep 4, 2019 · 34 comments
Closed

Available ENIs left dangling after node termination #608

krzysztof-bronk opened this issue Sep 4, 2019 · 34 comments
Labels
bug priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release

Comments

@krzysztof-bronk
Copy link

Hello,

I have encountered an issue with aws-cni 1.5.1(+?), where, even in a single node test cluster, if you terminate the node so that the ASG kicks in a replacement, the terminated instance ENI switches back to Available, holding IPs, and is seemingly never deleted.

Eventually one will exhaust the IP pool and pods will fail to be created.

This is a bit surprising as node recycling is the basis of autoscaling groups.
Is there some cleanup mechanism I am not aware of? Or is it a bug?

regards,
Krzysztof Bronk

@mogren mogren added bug priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release labels Sep 4, 2019
@mogren
Copy link
Contributor

mogren commented Sep 4, 2019

Thanks @krzysztof-bronk for reporting, we will have to take a look why this is happening. ENIs attached to the terminated node should be freed by EC2 automatically.

@krzysztof-bronk
Copy link
Author

Thank you for acknowledging this. If your cluster has a high node churn, or the IP pool is small, this can quickly become an issue. The current workaround, and also an independent report for the issue can be found here: #59 (comment)

@vipulsabhaya
Copy link
Contributor

@krzysztof-bronk Do you happen to have the ipamd logs from one such instance?

@caiconkhicon
Copy link

caiconkhicon commented Sep 6, 2019

ENIs attached to the terminated node should be freed by EC2 automatically.

No, it's not true. Only the main (eth0) ENI is cleaned up. The additional ones are not cleaned up, just detached and become available.

@krzysztof-bronk
Copy link
Author

Here's the situation:

Fresh test cluster, nothing fancy or custom (except external SNAT but I tested both and it's not relevant), a single m5.xlarge worker node.

Private IPs: 10.250.9.56 (eth0), 10.250.11.217 (eth1)
Secondary Private IPs: (a whole bunch of them)

ENI: 3 total
2 In-use (as confirmed by the Private IPs above).
Interestingly, only 1 of them has a Description (aws-K8S-i-0da187e6c45d9d4d5), the other one has an empty Description.
1 Available

aws-node logs are empty (even though they're in DEBUG mode and I've deployed a test nginx container):

===== Starting installing AWS-CNI =========
===== Starting amazon-k8s-agent ===========

Node terminated. ASG spun up a new one.

Private IPs: 10.250.19.53 (eth0), 10.250.16.192 (eth1)
Secondary Private IPs: (a whole bunch of them)

ENI: FOUR total
2 In-use (as confirmed by the Private IPs above).
2 Available - the 1 Available from the earlier instance is still there

aws-node logs (with the successfully running nginx container):

===== Starting installing AWS-CNI =========
===== Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0912 06:35:31.629160      13 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)

So it looks like aws-cni is leaking the warmup ENIs.

@mogren
Copy link
Contributor

mogren commented Sep 27, 2019

The latest pre-release, v1.6.0-rc2, has some changes to mitigate this problem.

@Pluies
Copy link

Pluies commented Oct 29, 2019

Good to know this is getting worked on, it's an issue for us as well. 👍

We have nightly infra testing jobs that bring up a cluster (w/ Terraform), run tests, and delete the cluster, and we've noticed that the terraform destroy step fails fairly often due to ENIs being left in "available" state, which prevents the security group from being deleted.

@robin-engineml
Copy link

@Pluies We have a similar process, and the same problem. @mogren When can we look forward to a 1.6.0 release with the mitigating changes applied?

@robin-engineml
Copy link

... or even a 1.6.0-rc4 (but without the problems of 1.5.4?)

@mogren
Copy link
Contributor

mogren commented Nov 5, 2019

@robin-engineml Hey, please try v1.6.0-rc4, fresh out of the oven. 😄

@robin-engineml
Copy link

@mogren The problem persists with v1.6.0-rc4, which we have been using for a couple of weeks. Anecdotally, it does seem to be less frequent.

@mogren
Copy link
Contributor

mogren commented Nov 22, 2019

@robin-engineml Thanks for the update! Glad that it has improved a bit at least. There is still a small chance that a fes ENIs will leak, but they should be cleaned up as long as there is at least some nodes running in the cluster.

@robin-engineml
Copy link

@mogren This occurs upon cluster termination, for us. So, there are not "some nodes running in the cluster". Would this stop occurring if we were to allow some cluster nodes to remain alive longer? (We destroy the EKS cluster via the AWS API, via Terraform.)

@mogren
Copy link
Contributor

mogren commented Dec 9, 2019

@robin-engineml The issue is that the EC2 API requires you to first detach an ENI, then wait 2-5 seconds before deleting it. If the instance get terminated after the ENI is detached, but before we delete it, it will stay around. With the v1.6 branch, we try to clean up when the CNI starts up, or in the background once per hour.

@jlforester
Copy link

Could this be integrated with ASG Lifecycle hooks to allow the processes more time to clean up on instance termination? Simply adding a lifecycle hook to an ASG isn't enough.

@krzysztof-bronk
Copy link
Author

I've tested 1.6.0-rc5 a bit and I don't see much progress, after terminating a couple nodes, I saw available ENIs dangling, so I terminated all nodes and now I have:

  • 2 fresh m5.xlarge nodes total working just fine (few pods, 2 ENIs attached with WARM_ENI_TARGET=1)
  • 6 available ENIs doing nothing

@mogren
Copy link
Contributor

mogren commented Jan 20, 2020

@krzysztof-bronk Did those ENIs stay around? They should have been cleaned up if they were created by the CNI. Not directly, but within five minutes of another worker node being started.

@steven-cherry
Copy link

Hi, has there been any progress on this issue? This is really affecting us, we are even considering changing the CNI vendor we use.

@mogren
Copy link
Contributor

mogren commented Feb 11, 2020

Hi @steven-cherry. The base issue is that the EC2 API requires clients to detach ENIs before they can be deleted. If the node (or the aws-node) gets restarted in the around 2-3 seconds we have to wait for the detach to complete, there will be an ENI with status "available" around.

The code to do the clean up is here

It will filter out ENIs with the tag key node.k8s.amazonaws.com/instance_id and status available in order to only get ENIs that the CNI has created.

I've done some more tests with v1.6.0 on spot instances that get randomly terminated, and the leaked ENIs do get cleaned up eventually. The only sure way to not leak any ENIs is to have this handled outside the node like in our 2.0 CNI design.

@steven-cherry
Copy link

Hi @steven-cherry. The base issue is that the EC2 API requires clients to detach ENIs before they can be deleted. If the node (or the aws-node) gets restarted in the around 2-3 seconds we have to wait for the detach to complete, there will be an ENI with status "available" around.

The code to do the clean up is here

It will filter out ENIs with the tag key node.k8s.amazonaws.com/instance_id and status available in order to only get ENIs that the CNI has created.

I've done some more tests with v1.6.0 on spot instances that get randomly terminated, and the leaked ENIs do get cleaned up eventually. The only sure way to not leak any ENIs is to have this handled outside the node like in our 2.0 CNI design.

Thanks @mogren any ETA regarding version 2.0 for production workloads?

@krzysztof-bronk
Copy link
Author

I'll be getting back on the topic soon so I will have a chance to test this once more

@krzysztof-bronk
Copy link
Author

@mogren
Setup:

  • CNI 1.6.0
  • k8s 1.14
  • 2x m5.xlarge nodes
  • pretty much no customisations introduced

instance 1 ENIs: 10.250.1.228, 10.250.5.254
instance 2 ENIs: 10.250.19.53, 10.250.21.204

The primary interfaces have a Description like "aws-K8S-i-0c841ac56fbadc9b3" indicating the node they belong to.
The secondary interfaces have no Description.

All 4 ENIs are Active.

Terminating instance 1. ASG kicks in a replacement.

Waiting 10 minutes.

There is a third ENI attached to the remaining instance, not sure why, there were several pods running on the terminated instances but not that many. However...

The primary interface of the terminated instance is now stuck in Available state.

Terminating instance 2 (the one with 3 ENIs). ASG kicks in a replacement.

Waiting 10 minutes.

Cluster now has 2 fresh nodes.
There are 4 total ENIs in Active state, 2 for each instance. However...

The primary interface of the second terminated instance is now stuck in Available state.

Maybe I'm triggering some special case but... the cleanup of Available ENIs simply does not happen.

errm added a commit to cookpad/terraform-aws-eks that referenced this issue Mar 9, 2020
This was removed in #49, but I think that change only fixed the issue
with EKS managed SG not being deleted. Stale ENIs are related to
this issue aws/amazon-vpc-cni-k8s#608
@krzysztof-bronk
Copy link
Author

I'll do some further tests because sometimes the interfaces do get cleaned up. How does the mechanism work exactly? Is only the instance that had the interfaces attached responsible for cleaning them up and there is a race condition between the instance termination and the cleanup code? Or is it that if there is at least one node in the cluster, that aws-node pod will attempt to delete unused Available interfaces for the whole cluster?

@nickdgriffin
Copy link

nickdgriffin commented Mar 23, 2020

Also noticed that the ENIs that appear to be leaking for us are missing the tags/description that mean they won't be picked up by the clean up loop. Not on 1.6 yet, but when we are I'll check if that's still the case.

EDIT: It is. We run our nodes in ASGs and on scaling down a test cluster of 6 nodes it leaked all 6 ENIs and left them untagged so they won't be cleaned up.

EDIT 2: It looks like it is the secondary ENIs that are getting leaked because they aren't being tagged or given the "special description" in the first place (i.e. even while "in-use") that allows the clean up to catch them.

@nickdgriffin
Copy link

We upgraded to 1.6.0 and fixed an oopsie and haven't had any ENIs leaking since.

@korjek
Copy link

korjek commented Apr 28, 2020

We have upgraded to 1.6.1 and there is no issue with dangling ENIs anymore, thank you!

P.S. you should have "delete on termination" enabled for a primary interface to clean it up on node's termination as well.

@mogren
Copy link
Contributor

mogren commented Jun 9, 2020

It is still a small chance that ENIs will leak, but they should be cleaned up pretty quickly if there are still any nodes still in the cluster. Also, I have seen that pods creating ALBs might create ENIs in subnets that then doesn't get cleaned up. If anyone sees ENIs still around in a cluster using CNI v1.6.1 or later, please gather logs and open a new ticket.

@mogren mogren closed this as completed Jun 9, 2020
@Nuru
Copy link

Nuru commented Sep 18, 2020

@mogren I am having a problem with danging ENIs using amazon-k8s-cni:v1.6.3-eksbuild.1. Please give me details about "gathering logs".

The basic problem is I am using Terraform and trying to destroy a node group and a security group that goes with it, but I cannot because the ENI is dangling after the node group is deleted, so the delete of the security group hangs.

Note that the dangling ENI has the tag node.k8s.amazonaws.com/instance_id=<instance-id> and the instance is terminated.

@mogren
Copy link
Contributor

mogren commented Sep 18, 2020

Hi @Nuru,

This has been an issue forever when scaling down the pods, and then suddenly the whole instance gets deleted. The issue triggering this is that there is no EC2 API call to "delete" an ENI that is attached, so instead they first have to be detached, which takes a few seconds, then deleted. If the instance gets terminated after the ENI has been detached, but before it has been deleted, it will be leaked. We have tried to mitigate this by for example having 10s termination policy on the aws-node daemonset, and never detach any ENIs while the CNI is shutting down, but none of this helps when the instance goes away.

Is this a managed nodegroup, or do you handle it on your own using Terraform? If so, terminating all the aws-node pods first, before terminating the instances might at least prevent them from detaching any ENIs in the last few seconds when the other pods are being deleted.

Another option would be if we had a setting to never detach any ENIs, since then the ENIs will get deleted when the instance gets deleted. The reason we don't do this by default is that running out of ENIs is also a common problem.

@Nuru
Copy link

Nuru commented Sep 18, 2020

@mogren wrote

Is this a managed nodegroup, or do you handle it on your own using Terraform? If so, terminating all the aws-node pods first, before terminating the instances might at least prevent them from detaching any ENIs in the last few seconds when the other pods are being deleted.

In my immediate case, I am using the AWS Terraform provider to create an aws_eks_node_group resource; in other words, Terraform is creating a managed node group. It is up to EKS to drain the pods from the nodes and then shut down the instances. There are other node groups (and therefore other nodes) in the cluster so EKS should be able to move all the nodes around, but there are always going to be critical pods (like kube-proxy) that need to stick around until the very end.

Prior to the node being shut down, it is cordoned off, meaning it will be marked as "unschedulable", meaning no new pods should be assigned to the node. You could surely arrange things such that any ENIs that are freed while the node is marked unschedulable are not detached. You do not need to worry about running out of ENIs at that point because there should be no new ENIs getting created. Then the ENIs can be deleted with the instance on termination, or, if the node is marked "schedulable" again without being terminated, a detach/delete loop could be run when the pod returns to the schedulable state. This, of course, requires the "delete on termination" option be set for the ENIs, such that they are automatically deleted when the instance is deleted. I do not see any downside to that setting always being set, as it still leaves you the option of detaching and deleting the ENI when a pod is deleted but the instance is intended to remain.

Maybe the building blocks were not there earlier, but it looks like the piece of the solution are now ready to be put together. Am I missing something?

@mogren
Copy link
Contributor

mogren commented Sep 18, 2020

@Nuru I do think you are right, having the VPC CNI be aware of the eks.amazonaws.com/nodegroup=unschedulable:NoSchedule tag does seem feasible. All ENIs that the CNI attaches are marked with DeleteOnTermination. Do you want to open a feature request issue to add this?

(Btw, kube-proxy is using host-networking, just like the aws-node pod does, so it is independent of the CNI being up.)

@Nuru
Copy link

Nuru commented Sep 18, 2020

@mogren I would be happy to have you open the feature request, as you would know better how to put the request together (what parts of code should react to what, and how) and see it through, and also be happy to lend my support to your request. I don't need credit or recognition for the feature request, I just want this done as quickly and efficiently as possible, so I would prefer you do it if you have the time. If it won't happen unless I do it, let me know and I will do it.

@Nuru I do think you are right, having the VPC CNI be aware of the eks.amazonaws.com/nodegroup=unschedulable:NoSchedule tag does seem feasible. All ENIs that the CNI attaches are marked with DeleteOnTermination. Do you want to open a feature request issue to add this?

(Btw, kube-proxy is using host-networking, just like the aws-node pod does, so it is independent of the CNI being up.)

@Nuru
Copy link

Nuru commented Sep 18, 2020

By the way, @mogren

The issue triggering this is that there is no EC2 API call to "delete" an ENI that is attached

Have you opened a feature request for this feature? That would be even better than my suggestion.

@mogren
Copy link
Contributor

mogren commented Sep 18, 2020

@Nuru I have checked, and that doesn't seem to be a lot of support for simplifying the API. I would love to have a single call to create or delete ENIs.

For now, I created #1223 to track improving how the VPC CNI is handling this.

rafael-mendes-pereira added a commit to Azure/telescope that referenced this issue Oct 15, 2024
…destroy (#336)

This is a workaround for the known VPC CNI addon's "leaked ENIs" issue:
See aws/amazon-vpc-cni-k8s#608

Co-authored-by: Rafael Mendes Pereira <[email protected]>
rafael-mendes-pereira added a commit to Azure/telescope that referenced this issue Nov 14, 2024
…destroy (#336)

This is a workaround for the known VPC CNI addon's "leaked ENIs" issue:
See aws/amazon-vpc-cni-k8s#608

Co-authored-by: Rafael Mendes Pereira <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release
Projects
None yet
Development

No branches or pull requests