Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e] network connect: tearing down network namespace configuration: netavark: IO error: aardvark pid not found #18325

Closed
edsantiago opened this issue Apr 24, 2023 · 7 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. network Networking related issue or feature stale-issue

Comments

@edsantiago
Copy link
Member

In rawhide:

  podman network connect
...
# podman [options] rm --time=0 -f test
time="2023-04-24T11:25:57-05:00" level=error msg="Unable to clean up network for container <cid>:
      \"tearing down network namespace configuration for container <cid>:
      netavark: IO error: aardvark pid not found\""

Only one instance, but I expect this to keep recurring. [reason: this happened within hours of enabling rawhide]

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Apr 24, 2023
@Luap99 Luap99 added the network Networking related issue or feature label Apr 25, 2023
@Luap99
Copy link
Member

Luap99 commented Apr 25, 2023

So I am not sure what could be causing this, unfortunately netavark just throws away the error context so it is hard to tell without reproducer.

Anyway I will patch netavark to report a better error.

@Luap99
Copy link
Member

Luap99 commented Apr 25, 2023

From the journal around the time the error was logged:

Apr 24 11:25:53 cirrus-task-6638728005812224 systemd[1]: Started run-r574bf4ef0fed4912b8c230f7ceca24e3.scope - /usr/libexec/podman/aardvark-dns --config /run/containers/networks/aardvark-dns -p 53 run.
Apr 24 11:25:53 cirrus-task-6638728005812224 systemd[1]: Started libpod-b3598785b3e16413734384429fd24087d629698ba0f8c812a0f523b2aa81495b.scope - libcrun container.
Apr 24 11:25:53 cirrus-task-6638728005812224 audit: BPF prog-id=2135 op=LOAD
Apr 24 11:25:53 cirrus-task-6638728005812224 systemd[1]: run-r574bf4ef0fed4912b8c230f7ceca24e3.scope: Deactivated successfully.

So it looks like aardvark immediately stopped thus later when we tried to stop it it was already gone.

Luap99 added a commit to Luap99/netavark that referenced this issue Apr 25, 2023
Ed reported a flake in podman where aardvark-dns pid was not found hen
we tried to stop it. It is not clear why but the error report is far
from optimal, we should never trow away the original error context.
This is very important to undersatnd what is going on.

Port most over to our netavark error type and wrap where needed.
Also change the pid type to pid_t. It is i32 so it did not cause any
issues before but this makes it more obvoius that we have the correct
type.

[1] containers/podman#18325

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/netavark that referenced this issue Apr 25, 2023
Ed reported a flake in podman where aardvark-dns pid was not found hen
we tried to stop it. It is not clear why but the error report is far
from optimal, we should never throw away the original error context.
This is very important to understand what is going on.

Port most over to our netavark error type and wrap where needed.
Also change the pid type to pid_t. It is i32 so it did not cause any
issues before but this makes it more obvious that we have the correct
type.

[1] containers/podman#18325

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Still happening. Do these include the new better errors?

  • fedora-38 : int remote fedora-38 root host boltdb [remote]
    • 05-30 09:35 in podman verify network scoped DNS server and also verify updating network dns server
  • rawhide : int podman rawhide root host sqlite
    • 06-07 17:07 in Podman run networking Aardvark Test 6: Three subnets, first container on 1/2 and second on 2/3, w/ network aliases
    • 04-24-2023 12:38 in Podman network connect and disconnect [It] podman network connect

Also, is this flake (rawhide root) the same thing, or a different bug?

  [It] podman verify network scoped DNS server and also verify updating network dns server
...
# podman [options] run -d --name con1 --network IntTestfaf5dc0e55 busybox top
dc9b31d661fe71a979c3e28a0c1a0a5e0950809797c28474c8c46d8036dca53a
# podman [options] exec con1 nslookup google.com 10.89.0.1
nslookup: write to '10.89.0.1': Connection refused
;; connection timed out; no servers could be reached
         
# podman [options] network rm -f IntTestfaf5dc0e55
IntTestfaf5dc0e55
# podman [options] network rm -f IntTestfaf5dc0e55

[FAILED] Expected
    <int>: 1
to match exit code:
    <int>: 0

@Luap99
Copy link
Member

Luap99 commented Jun 8, 2023

No we haven't made a netavark release with the new error. message In any case my changes will not fix anything only change the error message. The only realistic reason for this error is that the pid file got deleted which makes sense as aardvark-dns will remove the pidfile on exit.

The journal from #18810 looks similar so this could be related.
We need to figure out why aardvark-dns exits on its own before netavark instructs it to do so vie SIGHUB. Signals are async so we are definitely open to race conditions. I think it is clear to me that we need a bidirectional channel between netavark and aardvark-dns. This would allow us to block container execution until aardvark-dns is ready which should fix most dns flakes. And even more important we could actually return reasonable errors and not hide them in the journal (for example address in use when failing to bind port 53).

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Not seen since June 7.

@edsantiago edsantiago closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. network Networking related issue or feature stale-issue
Projects
None yet
Development

No branches or pull requests

2 participants