Docker tests instability #64

Tiberivs · 2021-02-16T02:58:47Z

I wrote a simple script for run the Docker tests repeatedly and discovered test failing after several iterations (around 3-5 more often). In rather cases tests fail on the first run.

The case was reproduced on two different platforms, includes Ubuntu 20.04.
I collected logs from a few failed runs and attached it to the issue.

Script:

#!/bin/bash

DOCKER_IMAGE=$(docker build -q --target test .)
LOG_FILE='/tmp/test.log'

i=1
while true; do
  echo -n "Test ${i}..."
  docker run --privileged --rm ${DOCKER_IMAGE} > "${LOG_FILE}"
  result=$?
  if [ $result -ne 0 ]; then
    echo 'FAIL'
    exit $result
  fi
  echo 'DONE'
  i=$((i+1))
done

edwarnicke · 2021-02-16T15:30:00Z

On it

networkservicemesh/cmd-forwarder-vpp#64 Signed-off-by: Ed Warnicke <[email protected]>

edwarnicke · 2021-02-17T02:14:40Z

Its seems we have two kinds of problems:

Missing interface
Client connection closing

Missing interface looks like

Feb 17 02:11:37.821�[36m [INFO] �[0mVerifying Connection (time since start: 148.6893ms)
PING 10.0.0.0 (10.0.0.0) 56(84) bytes of data.
64 bytes from 10.0.0.0: icmp_seq=1 ttl=64 time=0.662 ms

--- 10.0.0.0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.662/0.662/0.662/0.000 ms
    suite_combinatronics_test.go:126: 
        	Error Trace:	suite_combinatronics_test.go:126
        	Error:      	Received unexpected error:
        	            	Link not found
        	            	unable to find interface "ns-e47ecd59-447"
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.checkKernelInterface
        	            		/build/internal/tests/suite_kernel_test.go:201
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*kernelVerifiableEndpoint).VerifyConnection
        	            		/build/internal/tests/suite_kernel_test.go:119
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*ForwarderTestSuite).TestCombinations.func5
        	            		/build/internal/tests/suite_combinatronics_test.go:126
        	            	testing.tRunner
        	            		/go/src/testing/testing.go:1123
        	            	runtime.goexit
        	            		/go/src/runtime/asm_amd64.s:1374
        	Test:       	TestForwarderTestSuite/TestCombinations/KernelToKernel
=== RUN   TestForwarderTestSuite/TestCombinations/MemifToKernel
Feb 17 02:11:37.821�[36m [INFO] �[0mLaunching TestForwarderTestSuite/TestCombinations/MemifToKernel test server (time since start: 40.8µs)

Client connection closing looks like:

Feb 17 02:11:38.872�[36m [INFO] [name:forwarder] �[0m(36)                                    ⎆ sdk/pkg/networkservice/common/updatetoken/updateTokenClient.Request()
Feb 17 02:11:38.872�[36m [INFO] [name:forwarder] �[0m(36.1)                                      request-diff={"connection":{"path":{"path_segments":{"1":{"token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy90ZXN0cy50ZXN0IiwiZXhwIjoxNjEzNTMxNDk4LCJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9mb3J3YXJkZXIifQ.tJ2f5zoN5PMYU9e7NBzx95Mkb4EoaCQ1OG56QUd_xWDskc5FZOKGHttntcAR15syxmwZr1eT0MxdqCgkTYWakA"}}}}}
Feb 17 02:11:38.872�[36m [INFO] [name:forwarder] �[0m(37)                                     ⎆ api/pkg/api/networkservice/networkServiceClient.Request()
Feb 17 02:11:38.872�[31m [ERRO] [name:forwarder] �[0m(37.1)                                       rpc error: code = Canceled desc = grpc: the client connection is closing

edwarnicke · 2021-02-17T02:15:37Z

What's weird about the missing interface case is that it happens after successful ping.

Tiberivs · 2021-02-17T08:11:51Z

@edwarnicke I retested with your fix for sdk-vpp, but one doesn't affected on results.
I collected new logs from tests with sdk-vpp fix.

edwarnicke · 2021-02-17T15:17:22Z

@Tiberivs Thank you for the retest :)

I posted the fix after I was 50 successful tests into a run... it failed at 80, and then failed earlier on subsequent runs... so I was tentatively optimistic at the time I posted the fix... but you are correct, it doesn't actually fix the issue.

Tiberivs · 2021-03-02T14:28:45Z

The problem is related with veth interfaces and applicable for test scenarios: Kernel2Kernel, Memif2Kernel, Kernel2Memif.
Seems, one of paired interface move back from own namespace to name space with its paired device:

Fail:

root@9bef63ebf491:/build/internal/tests# ip netns
endpoint-c5f8e8ae
client-f5a879ba (id: 1)
root@9bef63ebf491:/build/internal/tests# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
3: client-af0de89c@ns-af0de89c-41a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0a:cf:82:55:41:0a brd ff:ff:ff:ff:ff:ff
    alias veth-client-af0de89c-41a1-4e5f-b504-91335b634960
4: ns-af0de89c-41a@client-af0de89c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether da:da:35:1e:d1:c0 brd ff:ff:ff:ff:ff:ff
    alias client-af0de89c-41a1-4e5f-b504-91335b634960
5: server-36a7fdb6@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 4e:33:cd:da:88:c4 brd ff:ff:ff:ff:ff:ff link-netns client-f5a879ba
    alias veth-server-36a7fdb6-2c06-41ad-aa56-1d6af44d2702
294: eth0@if295: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Success:

root@a958979d86b2:/build/internal/tests# ip netns
endpoint-f29f908a (id: 1)
client-d4fa0cd1 (id: 2)
root@a958979d86b2:/build/internal/tests# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
3: client-349e2785@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f2:40:bb:2a:92:bc brd ff:ff:ff:ff:ff:ff link-netns endpoint-f29f908a
    alias veth-client-349e2785-866a-4918-a125-3e48480e1f5f
5: server-7c48898c@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ca:69:0b:05:bc:99 brd ff:ff:ff:ff:ff:ff link-netns client-d4fa0cd1
    alias veth-server-7c48898c-67a7-4696-9fb0-380182b368ea
298: eth0@if299: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0

edwarnicke · 2021-03-02T16:24:49Z

@Tiberivs Is the interface moving back to the name space of its paired device, or is it never being moved to its own namespace at all?

edwarnicke · 2021-03-03T18:52:11Z

Was networkservicemesh/sdk-vpp#106 supposed to be a partial or full fix for this?

I ask, because I'm still seeing some instability after networkservicemesh/sdk-vpp#106 has merged and propagated to cmd-forwarder-vpp.

Tiberivs · 2021-03-04T03:33:54Z

Nope, networkservicemesh/sdk-vpp#106 fixes #66 issue.

Tiberivs · 2021-03-04T05:07:35Z

@edwarnicke I found the place of the problem, but not root of it.
The problem points here and here. ns.Request() chain item successfully changes current namespace to another, but something changes it back before grpcfd.FilenameToURL() inside sendfd.Request() will be called. Hence grpcfd.FilenameToURL() gets incorrect namespace inode.

Tiberivs · 2021-03-04T08:45:23Z

The root of problem is switching goroutine to another OS thread after changing namespace. I will fix it.

edwarnicke · 2021-03-04T15:32:36Z

@Tiberivs #86 is a good catch, but I am still seeing failures from your test.sh script locally

Tiberivs · 2021-03-04T15:49:44Z

@edwarnicke what sort of failures? Could you share failure log?

edwarnicke · 2021-03-04T15:57:58Z

@Tiberivs https://gist.github.com/edwarnicke/bcb6a3ebd7bce229b80aa933e77a1136

Tiberivs · 2021-03-04T16:20:14Z

@edwarnicke thanks! As I see you got an error Mar 4 15:07:07.380 [ERRO] [name:forwarder] context deadline exceeded.
This error seems valid because context lifetime is 100 seconds and your test failed after 100 seconds of execution
FAIL: TestForwarderTestSuite/TestCombinations/KernelToKernel (100.00s)

edwarnicke · 2021-03-04T16:22:01Z

@Tiberivs Yes... the question is why :)

denis-tingaikin · 2021-03-04T16:23:43Z

@Tiberivs I'm not sure that it is valid behavior.

This test passing on CI in ~0.29s

        --- PASS: TestForwarderTestSuite/TestCombinations/KernelToKernel (0.29s)

https://github.com/networkservicemesh/cmd-forwarder-vpp/runs/2032195233

edwarnicke · 2021-03-04T16:35:05Z

@Tiberivs I'm not sure that it is valid behavior.

This test passing on CI in ~0.29s
        --- PASS: TestForwarderTestSuite/TestCombinations/KernelToKernel (0.29s)
https://github.com/networkservicemesh/cmd-forwarder-vpp/runs/2032195233

That's actually pretty normal for me :)

Tiberivs · 2021-03-09T06:04:34Z

@edwarnicke I can't reproduce your case, my all tries fail with error:

    suite_combinatronics_test.go:124: 
        	Error Trace:	suite_combinatronics_test.go:124
        	Error:      	Received unexpected error:
        	            	exit status 1
        	            	failed to ping with command "ping -c 1 10.0.0.1"
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.pingKernel
        	            		/build/internal/tests/suite_kernel_test.go:244
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*kernelVerifiableEndpoint).VerifyConnection
        	            		/build/internal/tests/suite_kernel_test.go:122
        	            	github.com/networkservicemesh/cmd-forwarder-vpp/internal/tests.(*ForwarderTestSuite).TestCombinations.func5
        	            		/build/internal/tests/suite_combinatronics_test.go:124
        	            	testing.tRunner
        	            		/go/src/testing/testing.go:1123
        	            	runtime.goexit
        	            		/go/src/runtime/asm_amd64.s:1374
        	Test:       	TestForwarderTestSuite/TestCombinations/MemifToKernel

forwarder.log

edwarnicke · 2021-03-15T15:24:53Z

@Tiberivs OK... chase down the error you can reproduce :)

Tiberivs · 2021-03-19T09:52:48Z

The ping problem hard to reproduce, it appears after 80 successful tests in my case.
I tried to test with 5 ping packets instead of 1 (by default) - all packages was lost.
Seems problem doesn't related with initialization time of components.

edwarnicke · 2021-03-20T21:47:52Z

I tried to test with 5 ping packets instead of 1 (by default) - all packages was lost.
Seems problem doesn't related with initialization time of components.

I don't quite understand this statement. Could you clarify?

Tiberivs · 2021-03-21T12:05:21Z

I replaced -c1 to -c5 here:

cmd-forwarder-vpp/internal/tests/suite_kernel_test.go

Line 239 in d97caba

pingStr := fmt.Sprintf("ping -c 1 %s", ipnet.IP.String())

and all 5 packets was lost on testing.

edwarnicke · 2021-03-21T20:08:09Z

@Tiberivs Interesting... I'm not getting that when I change the ping count there. I also changed:

cmd-forwarder-vpp/internal/tests/suite_memif_test.go

Line 140 in d97caba

Cmd: fmt.Sprintf("ping %s interval 0.1 repeat 1 verbose", ip.String()),

To repeat 5

And the tests continued working ( at least on one off testing).

Are you seeing this as a means to make failure more likely, or does this change make it fail for you all the time?

Tiberivs · 2021-03-21T22:23:48Z

@edwarnicke rather I mean count of ping packets doesn't influence on test at all. Initially I thought ping losing one or two first packets on failing tests, but it lose all packets. Sorry for mislead.

edwarnicke · 2021-03-22T01:33:12Z

@Tiberivs Got it :)

So what are you seeing when you see the error? Anything that we can chase down to fix?

Tiberivs · 2021-03-22T02:16:47Z

Actually, I nothing found for fix. Just I leave the project and @denis-tingaikin asked me describe my last steps on the issue.

edwarnicke · 2021-05-07T21:16:00Z

I think this is likely fixed by:

edwarnicke/grpcfd#18

which then is included in our work in

networkservicemesh/sdk#909

But we should keep an eye on it, as I'm sometimes seeing some errors related to:

May  7 20:37:43.732�[33m [WARN] [name:forwarder] �[0m(1.2)   an error during getting token from the context: metadata is missed in ctx
github.com/networkservicemesh/sdk/pkg/tools/token.FromContext
	/build/local/sdk/pkg/tools/token/context.go:38
github.com/networkservicemesh/sdk/pkg/networkservice/common/updatepath.(*updatePathServer).Request
	/build/local/sdk/pkg/networkservice/common/updatepath/server.go:64

Mixaster995 · 2021-11-29T08:09:37Z

Initial problem fixed. Other problems will be moved in separate issues.

denis-tingaikin assigned edwarnicke Feb 16, 2021

denis-tingaikin added the bug Something isn't working label Feb 16, 2021

edwarnicke added a commit to edwarnicke/sdk-vpp that referenced this issue Feb 16, 2021

Fix docker test instability due to incorrect closing order

5e54120

networkservicemesh/cmd-forwarder-vpp#64 Signed-off-by: Ed Warnicke <[email protected]>

Tiberivs self-assigned this Mar 3, 2021

Tiberivs mentioned this issue Mar 4, 2021

Fix tests failing on netns switching #86

Merged

edwarnicke closed this as completed in #86 Mar 4, 2021

edwarnicke reopened this Mar 4, 2021

Tiberivs removed their assignment Mar 19, 2021

edwarnicke mentioned this issue May 11, 2021

vxlan/metadata issue somewhere in forwarder #185

Open

Mixaster995 closed this as completed Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker tests instability #64

Docker tests instability #64

Tiberivs commented Feb 16, 2021

edwarnicke commented Feb 16, 2021

edwarnicke commented Feb 17, 2021

edwarnicke commented Feb 17, 2021

Tiberivs commented Feb 17, 2021

edwarnicke commented Feb 17, 2021

Tiberivs commented Mar 2, 2021

edwarnicke commented Mar 2, 2021

edwarnicke commented Mar 3, 2021

Tiberivs commented Mar 4, 2021

Tiberivs commented Mar 4, 2021 •

edited

Loading

Tiberivs commented Mar 4, 2021

edwarnicke commented Mar 4, 2021

Tiberivs commented Mar 4, 2021

edwarnicke commented Mar 4, 2021

Tiberivs commented Mar 4, 2021 •

edited

Loading

edwarnicke commented Mar 4, 2021

denis-tingaikin commented Mar 4, 2021 •

edited

Loading

edwarnicke commented Mar 4, 2021

Tiberivs commented Mar 9, 2021 •

edited

Loading

edwarnicke commented Mar 15, 2021

Tiberivs commented Mar 19, 2021

edwarnicke commented Mar 20, 2021

Tiberivs commented Mar 21, 2021

edwarnicke commented Mar 21, 2021

Tiberivs commented Mar 21, 2021

edwarnicke commented Mar 22, 2021

Tiberivs commented Mar 22, 2021

edwarnicke commented May 7, 2021

Mixaster995 commented Nov 29, 2021

Docker tests instability #64

Docker tests instability #64

Comments

Tiberivs commented Feb 16, 2021

edwarnicke commented Feb 16, 2021

edwarnicke commented Feb 17, 2021

edwarnicke commented Feb 17, 2021

Tiberivs commented Feb 17, 2021

edwarnicke commented Feb 17, 2021

Tiberivs commented Mar 2, 2021

edwarnicke commented Mar 2, 2021

edwarnicke commented Mar 3, 2021

Tiberivs commented Mar 4, 2021

Tiberivs commented Mar 4, 2021 • edited Loading

Tiberivs commented Mar 4, 2021

edwarnicke commented Mar 4, 2021

Tiberivs commented Mar 4, 2021

edwarnicke commented Mar 4, 2021

Tiberivs commented Mar 4, 2021 • edited Loading

edwarnicke commented Mar 4, 2021

denis-tingaikin commented Mar 4, 2021 • edited Loading

edwarnicke commented Mar 4, 2021

Tiberivs commented Mar 9, 2021 • edited Loading

edwarnicke commented Mar 15, 2021

Tiberivs commented Mar 19, 2021

edwarnicke commented Mar 20, 2021

Tiberivs commented Mar 21, 2021

edwarnicke commented Mar 21, 2021

Tiberivs commented Mar 21, 2021

edwarnicke commented Mar 22, 2021

Tiberivs commented Mar 22, 2021

edwarnicke commented May 7, 2021

Mixaster995 commented Nov 29, 2021

Tiberivs commented Mar 4, 2021 •

edited

Loading

Tiberivs commented Mar 4, 2021 •

edited

Loading

denis-tingaikin commented Mar 4, 2021 •

edited

Loading

Tiberivs commented Mar 9, 2021 •

edited

Loading