-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker tests instability #64
Comments
On it |
networkservicemesh/cmd-forwarder-vpp#64 Signed-off-by: Ed Warnicke <[email protected]>
Its seems we have two kinds of problems:
Missing interface looks like
Client connection closing looks like:
|
What's weird about the missing interface case is that it happens after successful ping. |
@edwarnicke I retested with your fix for |
@Tiberivs Thank you for the retest :) I posted the fix after I was 50 successful tests into a run... it failed at 80, and then failed earlier on subsequent runs... so I was tentatively optimistic at the time I posted the fix... but you are correct, it doesn't actually fix the issue. |
The problem is related with veth interfaces and applicable for test scenarios: Kernel2Kernel, Memif2Kernel, Kernel2Memif. Fail: root@9bef63ebf491:/build/internal/tests# ip netns
endpoint-c5f8e8ae
client-f5a879ba (id: 1)
root@9bef63ebf491:/build/internal/tests# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
3: client-af0de89c@ns-af0de89c-41a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 0a:cf:82:55:41:0a brd ff:ff:ff:ff:ff:ff
alias veth-client-af0de89c-41a1-4e5f-b504-91335b634960
4: ns-af0de89c-41a@client-af0de89c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether da:da:35:1e:d1:c0 brd ff:ff:ff:ff:ff:ff
alias client-af0de89c-41a1-4e5f-b504-91335b634960
5: server-36a7fdb6@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 4e:33:cd:da:88:c4 brd ff:ff:ff:ff:ff:ff link-netns client-f5a879ba
alias veth-server-36a7fdb6-2c06-41ad-aa56-1d6af44d2702
294: eth0@if295: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 Success: root@a958979d86b2:/build/internal/tests# ip netns
endpoint-f29f908a (id: 1)
client-d4fa0cd1 (id: 2)
root@a958979d86b2:/build/internal/tests# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
3: client-349e2785@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether f2:40:bb:2a:92:bc brd ff:ff:ff:ff:ff:ff link-netns endpoint-f29f908a
alias veth-client-349e2785-866a-4918-a125-3e48480e1f5f
5: server-7c48898c@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether ca:69:0b:05:bc:99 brd ff:ff:ff:ff:ff:ff link-netns client-d4fa0cd1
alias veth-server-7c48898c-67a7-4696-9fb0-380182b368ea
298: eth0@if299: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 |
@Tiberivs Is the interface moving back to the name space of its paired device, or is it never being moved to its own namespace at all? |
Was networkservicemesh/sdk-vpp#106 supposed to be a partial or full fix for this? I ask, because I'm still seeing some instability after networkservicemesh/sdk-vpp#106 has merged and propagated to cmd-forwarder-vpp. |
Nope, networkservicemesh/sdk-vpp#106 fixes #66 issue. |
@edwarnicke I found the place of the problem, but not root of it. |
The root of problem is switching goroutine to another OS thread after changing namespace. I will fix it. |
@edwarnicke what sort of failures? Could you share failure log? |
@edwarnicke thanks! As I see you got an error |
@Tiberivs Yes... the question is why :) |
@Tiberivs I'm not sure that it is valid behavior. This test passing on CI in ~0.29s
https://github.com/networkservicemesh/cmd-forwarder-vpp/runs/2032195233 |
That's actually pretty normal for me :) |
@edwarnicke I can't reproduce your case, my all tries fail with error:
|
@Tiberivs OK... chase down the error you can reproduce :) |
The |
I don't quite understand this statement. Could you clarify? |
I replaced
and all 5 packets was lost on testing. |
@Tiberivs Interesting... I'm not getting that when I change the ping count there. I also changed:
To And the tests continued working ( at least on one off testing). Are you seeing this as a means to make failure more likely, or does this change make it fail for you all the time? |
@edwarnicke rather I mean count of ping packets doesn't influence on test at all. Initially I thought ping losing one or two first packets on failing tests, but it lose all packets. Sorry for mislead. |
@Tiberivs Got it :) So what are you seeing when you see the error? Anything that we can chase down to fix? |
Actually, I nothing found for fix. Just I leave the project and @denis-tingaikin asked me describe my last steps on the issue. |
I think this is likely fixed by: which then is included in our work in But we should keep an eye on it, as I'm sometimes seeing some errors related to:
|
Initial problem fixed. Other problems will be moved in separate issues. |
I wrote a simple script for run the Docker tests repeatedly and discovered test failing after several iterations (around 3-5 more often). In rather cases tests fail on the first run.
The case was reproduced on two different platforms, includes Ubuntu 20.04.
I collected logs from a few failed runs and attached it to the issue.
Script:
The text was updated successfully, but these errors were encountered: