Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aardvark/Netavark flakes #14173

Closed
edsantiago opened this issue May 9, 2022 · 16 comments · Fixed by #14822
Closed

Aardvark/Netavark flakes #14173

edsantiago opened this issue May 9, 2022 · 16 comments · Fixed by #14822
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@edsantiago
Copy link
Member

Aardvark Test 2: Two containers, same subnet
...

Expected
               <[]string | len:0, cap:0>: nil
           to have length 2
         
           /var/tmp/go/src/github.com[/containers/podman/test/e2e/run_aardvark_test.go:105](https://github.com/containers/podman/blob/bb8f53a727aeebb49873b2f4c9c12d6f1d0ce1d3/test/e2e/run_aardvark_test.go#L105)

Podman run networking [It] Aardvark Test 2: Two containers, same subnet

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label May 9, 2022
@edsantiago
Copy link
Member Author

This is flaking a lot lately. Can a network person PTAL.

@vrothberg
Copy link
Member

@flouthoc @Luap99 in case you have/find time.

@flouthoc
Copy link
Collaborator

flouthoc commented May 20, 2022

I am unable to reproduce this flake on host not entirely sure but I think i know why this happens on CI one reason it below but there can be more reasons for it.

One root cause could be that netavark spawns aardvark-dns using Command::spawn() and never waits for it especially in case of systemd-run. Usually it does not makes any difference since by the time podman actually runs command on container , aardvark-dns should be ready but on CI env on certain runs we are surpassing this window.

I think adding a backoff on netavark end should help here. Something like this: #14173

Above case is specially for the test [It] podman run check dnsname plugin with Netavark where this can happen more often.

@flouthoc
Copy link
Collaborator

PS: Also while running tests on host I am only running test/e2e/run_networking_test.go test/e2e/run_aardvark_test.go with commons assuming that tests are isolated with each other they share nothing.

flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 14, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that double-forking will happen at aardvark-end instead of
netavark.

Redesign proposal

* Aardvark will invoke server on the child process by double-forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
@flouthoc
Copy link
Collaborator

There is a bigger design change under discussion which should help in this as well as some other issue, please track it here. However it will take some time to get merged: containers/aardvark-dns#148

flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 14, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that double-forking will happen at aardvark-end instead of
netavark.

Redesign proposal

* Aardvark will invoke server on the child process by double-forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 15, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 15, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 15, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 15, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 15, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 15, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 16, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jun 16, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Jul 4, 2022

The netavark/aardvark changes are fine but I don't think it is required to fix this flake in CI. The test already "tries" to implement a back off logic:

podman/test/e2e/common_test.go

Lines 1045 to 1062 in 6dffa45

func digShort(container, lookupName string, matchNames []string, p *PodmanTestIntegration) {
digInterval := time.Millisecond * 250
for i := 0; i < 6; i++ {
time.Sleep(digInterval * time.Duration(i))
dig := p.Podman([]string{"exec", container, "dig", "+short", lookupName})
dig.WaitWithDefaultTimeout()
if dig.ExitCode() == 0 {
output := dig.OutputToString()
Expect(output).To(MatchRegexp(IPRegex))
for _, name := range matchNames {
Expect(output).To(Equal(name))
}
// success
return
}
}
Fail("dns is not responding")
}

This does not work because dig always returns exit code 0 even when there is no match.

Luap99 added a commit to Luap99/libpod that referenced this issue Jul 4, 2022
The retry logic in digshort() did not work because dig always exits with
0 even when the domain name is not found. To make it work we have to
check the standard output.

We work on fixing the underlying issue in aardvark/netavark but
this will take more time.

Fixes containers#14173
Fixes containers#14171

Signed-off-by: Paul Holzinger <[email protected]>
@flouthoc
Copy link
Collaborator

flouthoc commented Jul 4, 2022

Netavark/aardvark change is for more scenarios like #14356 (comment) i am not able to collect all the reference but it was reported few times that on the first container run dns resolve failed intermittently. But we can close these issues early if dig output match is fixed.

@edsantiago
Copy link
Member Author

I'm still seeing flakes on this test, and I'm somewhat sure that it's after the "fix" merged. Error message seems different:

dns is not responding

Shall I file a new issue?

Podman run networking [It] Aardvark Test 2: Two containers, same subnet

@Luap99
Copy link
Member

Luap99 commented Jul 7, 2022

Yeah something is wrong, I have to add more debugging information but at this point it does not look like a race. There should be more than enough timout.

@Luap99 Luap99 reopened this Jul 7, 2022
flouthoc added a commit to flouthoc/aardvark-dns that referenced this issue Jul 13, 2022
This is one the redesign PR for aardvark-dns and netavark. Design
proposes that  forking will happen at aardvark-end instead of
netavark and aardvark will verify if servers are up before parent goes
away.

Redesign proposal

* Aardvark will invoke server on the child process by  forking.
* Parent waits for child to show up  and verify against a dummy DNS
  query to check if server is running.
* Exit parent on success and deatch child.

* Calling process will wait for aardvark's parent process to return.
* On successful return from parent it will be assumed that aardvark is
  running properly

One new design is implemented and merged and netavark starts using this
it should close

* containers/podman#14173
* containers/podman#14171

Signed-off-by: Aditya R <[email protected]>
@edsantiago
Copy link
Member Author

New set of symptoms in remote f36 root:

Execing /usr/sbin/ip netns add XVlBzgbaiCMR

Cannot create namespace file "/var/run/netns/XVlBzgbaiCMR": File exists

The surprising thing about this is that a lot of tests fail, and all of them have the same presumably-random string. I would expect each test to have a unique string. Whoever groks the random-netns-name code, that might be a good starting place to look.

@edsantiago edsantiago changed the title Aardvark Test 2: Two containers, same subnet Aardvark/Netavark flakes Jul 14, 2022
@edsantiago
Copy link
Member Author

Renaming & repurposing this issue as a general place to log *vark flakes

@Luap99
Copy link
Member

Luap99 commented Jul 14, 2022

New set of symptoms in remote f36 root:

Execing /usr/sbin/ip netns add XVlBzgbaiCMR

Cannot create namespace file "/var/run/netns/XVlBzgbaiCMR": File exists

The surprising thing about this is that a lot of tests fail, and all of them have the same presumably-random string. I would expect each test to have a unique string. Whoever groks the random-netns-name code, that might be a good starting place to look.

This should not effect netavark/aardvark at all, using a custom net namespace will never call into the network backend.

@edsantiago
Copy link
Member Author

Another one, this time podman (non-remote) f36 root, same "file exists" symptom

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Last seen November 10. I will assume this is fixed.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
4 participants