Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker tests instability #64

Closed
Tiberivs opened this issue Feb 16, 2021 · 29 comments · Fixed by #86
Closed

Docker tests instability #64

Tiberivs opened this issue Feb 16, 2021 · 29 comments · Fixed by #86
Assignees
Labels
bug Something isn't working

Comments

@Tiberivs
Copy link
Contributor

I wrote a simple script for run the Docker tests repeatedly and discovered test failing after several iterations (around 3-5 more often). In rather cases tests fail on the first run.

The case was reproduced on two different platforms, includes Ubuntu 20.04.
I collected logs from a few failed runs and attached it to the issue.

Script:

#!/bin/bash

DOCKER_IMAGE=$(docker build -q --target test .)
LOG_FILE='/tmp/test.log'

i=1
while true; do
  echo -n "Test ${i}..."
  docker run --privileged --rm ${DOCKER_IMAGE} > "${LOG_FILE}"
  result=$?
  if [ $result -ne 0 ]; then
    echo 'FAIL'
    exit $result
  fi
  echo 'DONE'
  i=$((i+1))
done
@denis-tingaikin denis-tingaikin added the bug Something isn't working label Feb 16, 2021
@edwarnicke
Copy link
Member

On it

edwarnicke added a commit to edwarnicke/sdk-vpp that referenced this issue Feb 16, 2021
@edwarnicke
Copy link
Member

Its seems we have two kinds of problems:

  1. Missing interface
  2. Client connection closing

Missing interface looks like

Feb 17 02:11:37.821�[36m [INFO] �[0mVerifying Connection (time since start: 148.6893ms)
PING 10.0.0.0 (10.0.0.0) 56(84) bytes of data.
64 bytes from 10.0.0.0: icmp_seq=1 ttl=64 time=0.662 ms

--- 10.0.0.0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.662/0.662/0.662/0.000 ms
    suite_combinatronics_test.go:126: 
        	Error Trace:	suite_combinatronics_test.go:126
        	Error:      	Received unexpected error:
        	            	Link not found
        	            	unable to find interface "ns-e47ecd59-447"
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.checkKernelInterface
        	            		/build/internal/tests/suite_kernel_test.go:201
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*kernelVerifiableEndpoint).VerifyConnection
        	            		/build/internal/tests/suite_kernel_test.go:119
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*ForwarderTestSuite).TestCombinations.func5
        	            		/build/internal/tests/suite_combinatronics_test.go:126
        	            	testing.tRunner
        	            		/go/src/testing/testing.go:1123
        	            	runtime.goexit
        	            		/go/src/runtime/asm_amd64.s:1374
        	Test:       	TestForwarderTestSuite/TestCombinations/KernelToKernel
=== RUN   TestForwarderTestSuite/TestCombinations/MemifToKernel
Feb 17 02:11:37.821�[36m [INFO] �[0mLaunching TestForwarderTestSuite/TestCombinations/MemifToKernel test server (time since start: 40.8µs)

Client connection closing looks like:

Feb 17 02:11:38.872�[36m [INFO] [name:forwarder] �[0m(36)                                    ⎆ sdk/pkg/networkservice/common/updatetoken/updateTokenClient.Request()
Feb 17 02:11:38.872�[36m [INFO] [name:forwarder] �[0m(36.1)                                      request-diff={"connection":{"path":{"path_segments":{"1":{"token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy90ZXN0cy50ZXN0IiwiZXhwIjoxNjEzNTMxNDk4LCJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9mb3J3YXJkZXIifQ.tJ2f5zoN5PMYU9e7NBzx95Mkb4EoaCQ1OG56QUd_xWDskc5FZOKGHttntcAR15syxmwZr1eT0MxdqCgkTYWakA"}}}}}
Feb 17 02:11:38.872�[36m [INFO] [name:forwarder] �[0m(37)                                     ⎆ api/pkg/api/networkservice/networkServiceClient.Request()
Feb 17 02:11:38.872�[31m [ERRO] [name:forwarder] �[0m(37.1)                                       rpc error: code = Canceled desc = grpc: the client connection is closing

@edwarnicke
Copy link
Member

What's weird about the missing interface case is that it happens after successful ping.

@Tiberivs
Copy link
Contributor Author

@edwarnicke I retested with your fix for sdk-vpp, but one doesn't affected on results.
I collected new logs from tests with sdk-vpp fix.

@edwarnicke
Copy link
Member

@Tiberivs Thank you for the retest :)

I posted the fix after I was 50 successful tests into a run... it failed at 80, and then failed earlier on subsequent runs... so I was tentatively optimistic at the time I posted the fix... but you are correct, it doesn't actually fix the issue.

@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 2, 2021

The problem is related with veth interfaces and applicable for test scenarios: Kernel2Kernel, Memif2Kernel, Kernel2Memif.
Seems, one of paired interface move back from own namespace to name space with its paired device:

Fail:

root@9bef63ebf491:/build/internal/tests# ip netns
endpoint-c5f8e8ae
client-f5a879ba (id: 1)
root@9bef63ebf491:/build/internal/tests# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
3: client-af0de89c@ns-af0de89c-41a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0a:cf:82:55:41:0a brd ff:ff:ff:ff:ff:ff
    alias veth-client-af0de89c-41a1-4e5f-b504-91335b634960
4: ns-af0de89c-41a@client-af0de89c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether da:da:35:1e:d1:c0 brd ff:ff:ff:ff:ff:ff
    alias client-af0de89c-41a1-4e5f-b504-91335b634960
5: server-36a7fdb6@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 4e:33:cd:da:88:c4 brd ff:ff:ff:ff:ff:ff link-netns client-f5a879ba
    alias veth-server-36a7fdb6-2c06-41ad-aa56-1d6af44d2702
294: eth0@if295: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Success:

root@a958979d86b2:/build/internal/tests# ip netns
endpoint-f29f908a (id: 1)
client-d4fa0cd1 (id: 2)
root@a958979d86b2:/build/internal/tests# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
3: client-349e2785@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f2:40:bb:2a:92:bc brd ff:ff:ff:ff:ff:ff link-netns endpoint-f29f908a
    alias veth-client-349e2785-866a-4918-a125-3e48480e1f5f
5: server-7c48898c@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ca:69:0b:05:bc:99 brd ff:ff:ff:ff:ff:ff link-netns client-d4fa0cd1
    alias veth-server-7c48898c-67a7-4696-9fb0-380182b368ea
298: eth0@if299: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

@edwarnicke
Copy link
Member

@Tiberivs Is the interface moving back to the name space of its paired device, or is it never being moved to its own namespace at all?

@Tiberivs Tiberivs self-assigned this Mar 3, 2021
@edwarnicke
Copy link
Member

Was networkservicemesh/sdk-vpp#106 supposed to be a partial or full fix for this?

I ask, because I'm still seeing some instability after networkservicemesh/sdk-vpp#106 has merged and propagated to cmd-forwarder-vpp.

@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 4, 2021

Nope, networkservicemesh/sdk-vpp#106 fixes #66 issue.

@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 4, 2021

@edwarnicke I found the place of the problem, but not root of it.
The problem points here and here. ns.Request() chain item successfully changes current namespace to another, but something changes it back before grpcfd.FilenameToURL() inside sendfd.Request() will be called. Hence grpcfd.FilenameToURL() gets incorrect namespace inode.

@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 4, 2021

The root of problem is switching goroutine to another OS thread after changing namespace. I will fix it.

@edwarnicke
Copy link
Member

@Tiberivs #86 is a good catch, but I am still seeing failures from your test.sh script locally

@edwarnicke edwarnicke reopened this Mar 4, 2021
@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 4, 2021

@edwarnicke what sort of failures? Could you share failure log?

@edwarnicke
Copy link
Member

@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 4, 2021

@edwarnicke thanks! As I see you got an error Mar 4 15:07:07.380 [ERRO] [name:forwarder] context deadline exceeded.
This error seems valid because context lifetime is 100 seconds and your test failed after 100 seconds of execution
FAIL: TestForwarderTestSuite/TestCombinations/KernelToKernel (100.00s)

@edwarnicke
Copy link
Member

@Tiberivs Yes... the question is why :)

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 4, 2021

@Tiberivs I'm not sure that it is valid behavior.

This test passing on CI in ~0.29s

        --- PASS: TestForwarderTestSuite/TestCombinations/KernelToKernel (0.29s)

https://github.com/networkservicemesh/cmd-forwarder-vpp/runs/2032195233

@edwarnicke
Copy link
Member

@Tiberivs I'm not sure that it is valid behavior.

This test passing on CI in ~0.29s

        --- PASS: TestForwarderTestSuite/TestCombinations/KernelToKernel (0.29s)

https://github.com/networkservicemesh/cmd-forwarder-vpp/runs/2032195233

That's actually pretty normal for me :)

@Tiberivs
Copy link
Contributor Author

Tiberivs commented Mar 9, 2021

@edwarnicke I can't reproduce your case, my all tries fail with error:

    suite_combinatronics_test.go:124: 
        	Error Trace:	suite_combinatronics_test.go:124
        	Error:      	Received unexpected error:
        	            	exit status 1
        	            	failed to ping with command "ping -c 1 10.0.0.1"
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.pingKernel
        	            		/build/internal/tests/suite_kernel_test.go:244
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*kernelVerifiableEndpoint).VerifyConnection
        	            		/build/internal/tests/suite_kernel_test.go:122
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*ForwarderTestSuite).TestCombinations.func5
        	            		/build/internal/tests/suite_combinatronics_test.go:124
        	            	testing.tRunner
        	            		/go/src/testing/testing.go:1123
        	            	runtime.goexit
        	            		/go/src/runtime/asm_amd64.s:1374
        	Test:       	TestForwarderTestSuite/TestCombinations/MemifToKernel

forwarder.log

@edwarnicke
Copy link
Member

@Tiberivs OK... chase down the error you can reproduce :)

@Tiberivs
Copy link
Contributor Author

The ping problem hard to reproduce, it appears after 80 successful tests in my case.
I tried to test with 5 ping packets instead of 1 (by default) - all packages was lost.
Seems problem doesn't related with initialization time of components.

@Tiberivs Tiberivs removed their assignment Mar 19, 2021
@edwarnicke
Copy link
Member

I tried to test with 5 ping packets instead of 1 (by default) - all packages was lost.
Seems problem doesn't related with initialization time of components.

I don't quite understand this statement. Could you clarify?

@Tiberivs
Copy link
Contributor Author

I replaced -c1 to -c5 here:

pingStr := fmt.Sprintf("ping -c 1 %s", ipnet.IP.String())

and all 5 packets was lost on testing.

@edwarnicke
Copy link
Member

@Tiberivs Interesting... I'm not getting that when I change the ping count there. I also changed:

Cmd: fmt.Sprintf("ping %s interval 0.1 repeat 1 verbose", ip.String()),

To repeat 5

And the tests continued working ( at least on one off testing).

Are you seeing this as a means to make failure more likely, or does this change make it fail for you all the time?

@Tiberivs
Copy link
Contributor Author

@edwarnicke rather I mean count of ping packets doesn't influence on test at all. Initially I thought ping losing one or two first packets on failing tests, but it lose all packets. Sorry for mislead.

@edwarnicke
Copy link
Member

@Tiberivs Got it :)

So what are you seeing when you see the error? Anything that we can chase down to fix?

@Tiberivs
Copy link
Contributor Author

Actually, I nothing found for fix. Just I leave the project and @denis-tingaikin asked me describe my last steps on the issue.

@edwarnicke
Copy link
Member

I think this is likely fixed by:

edwarnicke/grpcfd#18

which then is included in our work in

networkservicemesh/sdk#909

But we should keep an eye on it, as I'm sometimes seeing some errors related to:

May  7 20:37:43.732�[33m [WARN] [name:forwarder] �[0m(1.2)   an error during getting token from the context: metadata is missed in ctx
github.com/networkservicemesh/sdk/pkg/tools/token.FromContext
	/build/local/sdk/pkg/tools/token/context.go:38
github.com/networkservicemesh/sdk/pkg/networkservice/common/updatepath.(*updatePathServer).Request
	/build/local/sdk/pkg/networkservice/common/updatepath/server.go:64

@Mixaster995
Copy link
Contributor

Initial problem fixed. Other problems will be moved in separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants