Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External network past SNAT unavailable on pod start #732

Closed
gbarazer opened this issue Mar 29, 2021 · 3 comments · Fixed by #755, #1015 or #1346
Closed

External network past SNAT unavailable on pod start #732

gbarazer opened this issue Mar 29, 2021 · 3 comments · Fixed by #755, #1015 or #1346
Assignees

Comments

@gbarazer
Copy link

Following our discussion on Slack, we confirmed that any network request to an external network made during 0-3 seconds after pod start is blocked and as a result lots of pod are failing (such as pod executing helm or curl or git clone as their first CMD).
During investigation we confirmed that this behavior occurs only on requests made to external network, which lead us to look at the SNAT implementation.
It seems that when the CNI is attaching the network, it return success if the pod gateway is pingable, but at that time the SNAT rules for external access are not installed yet, because they are installed and updated every 3 seconds by ipset/iptables. This causes in turn the requests to external network to be blocked and most commands are waiting forever until the pod fails.
This can be reproduced using a pod run command like this one :
kubectl run testcurl --image=centos:8 --restart=Never -- sh -c 'for i in $(seq 1 10); do echo $i ; date; curl -vsI --connect-timeout 0.5 https://www.google.com/; echo ; sleep 0.1; done'

On other implementations, the logs show that the external network is always ready on the first request, where as in kube-ovn this happens only a few (1-3) seconds after pod creation.

@oilbeater
Copy link
Collaborator

We see this issue again in 1.8.2 with lower probability after the patch, maybe there is still some delay between ipset take effect. We need find new way to fix it.

@hackeren
Copy link
Contributor

We see this issue again in 1.8.2 with lower probability after the patch, maybe there is still some delay between ipset take effect. We need find new way to fix it.

这个bug是不是在如下场景中也会出现:
一个使用EIP的Pod在initContainer中判断所需要的service是否就绪,此时该initContainer会访问nodelocaldns,即169.xx网段(相当于是外网),此时initContainer无法进行dns解析。从lr-router-list中看到,该Pod的源IP路由未被添加。

@zhangzujian
Copy link
Member

zhangzujian commented Mar 1, 2022

It seems to be a TCP-related problem. In my testing, it was reproduced only in TCP connections.

EDIT:

It's still an ipset issue. Should be fixes now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants