-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ip rule leaked under some conditions #2048
Comments
Hi @cdemonchy , We will land here cni.go#L364. because of
|
I forgot to send the log, I've just sent them. |
Hi @cdemonchy Based on the logs, the CRI is missing the container hence the IP was never recovered - Here is the snapshot of the events - IPAMD started ->
Pod created ->
Restart of IPAMD ->
Container
Hence delete failed and
We should have cleaned up the old rules before adding new one, will check if that is in place. |
Ideally when the aws-node restarts we will have to clean up the rules associated with free IPs if one exists. |
We managed to mitigate the issue by updating cluster auto scaler configuration to not evict daemonset pods. We didn't have leaked rules since the update. aws-node is not evicted by cluster autoscaler and don't restart. This confirm that the problem occurs when pod are deleted during a restart of aws-node. Regards, |
@cdemonchy were you previously enabling daemonset eviction? It should be disabled by default |
Hi @jdn5126 The daemonset eviction has been enabled automatically when we've upgraded cluster autoscaler from v1.20.2 and to v1.21.2 To solve our issue we've upgraded to cluster autoscaler v1.21.3 and added the flag |
Ah, thank you for pointing that out, I see that autoscaler has indeed changed the default behavior for occupied nodes. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/notstale |
Closing as PR has merged |
|
Hi,
What happened:
Sometimes the ip rule associated to a pod are not deleted when the pod is deleted. The leaked rules are never deleted but the ip is free in the ipam.
The ip can be reallocated on another node in the cluster. When this happens, the pods on the node with the leaked rules can't reach the pod with the reallocated ip.
We've the case when cluster autoscaler fails to evict a node :
aws-node
is stopped and restart quickly.Attach logs
The log for a pod during the delete phase :
We can see 3 attempts with a connection refused when connecting to ipamd. The last attempt fails has the container has been removed.
When looking at the rule tables, the rules associated to the ip are still present.
I'll send the complete support tarball by mail.
What you expected to happen:
We expect that the rules are removed when a container is stopped.
Anything else we need to know?:
When looking at the del function del, the logic seems to be :
In our case, during the first call the function
tryDelWithPrevResult
didn't do anything as the code seems to only handle pod with a security group attached. We're not attaching security group to the pod we're usingIndividual IP addresses assigned to network interface
mode.During the last call, the call to ipamd seems to succeed the cleaning from the del function is not called : cni.go#L405 as we've a container not found error before : cni.go#L364.
Maybe it's possible to do the clean in the
tryDelWithPrevResult
like for branch eni ?Environment:
kubectl version
): 1.21The text was updated successfully, but these errors were encountered: