Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman stop: Unable to clean up network: netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied #22103

Closed
edsantiago opened this issue Mar 20, 2024 · 6 comments · Fixed by containers/netavark#956
Assignees
Labels
flakes Flakes from Continuous Integration In Progress This issue is actively being worked by the assignee, please do not work on this at this time. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. network Networking related issue or feature

Comments

@edsantiago
Copy link
Member

This is one of those nasty ones that hides in logs, making it impossible for me to get full data.

Best I can tell, the first instance was Feb 9, in rawhide rootless. Seen also in f39 root.

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" level=error msg="Unable to clean up network for container SHA: \"1 error occurred:\\n\\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

Incomplete list below. There are maybe 3-4 others, it is way too hard to get a complete list.

  • fedora-39 : int podman fedora-39 root host sqlite
    • 03-04 07:36 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
  • fedora-39 : int podman fedora-39 rootless host sqlite
    • 03-20 13:38 in TOP-LEVEL [AfterEach] Podman kube play with image data
  • rawhide : int podman rawhide root host sqlite
    • 03-05 13:32 in TOP-LEVEL [AfterEach] Podman kube play no security context
x x x x x x
int(3) podman(3) fedora-39(2) root(2) host(3) sqlite(3)
rawhide(1) rootless(1)
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Mar 20, 2024
@Luap99
Copy link
Member

Luap99 commented Mar 21, 2024

I don't get why it would fail with EACCES even as root.
These are the only two lines that could fail https://github.com/containers/netavark/blob/cc3f35d2e87defa2e12d0ffeb59a57035e8a5902/src/dns/aardvark.rs#L131-L132

And I really do not see why this would fail with anything other the ENOENT which is already ignored by the code. I can see the EACCES might happen as rootless in case where the aardvark pid was already reused by another process where we do not have privs on, but as root that can never be the case.

@Luap99
Copy link
Member

Luap99 commented Mar 21, 2024

ok I guess we need to ignore more errors, I am using something this to reproduce the logic easily:
while :; do sleep 10 & kill -HUP $! && ls -l /proc/$!/ns/net 2>&1 | tee /dev/stderr | grep -E "No such file or directory|net:" || break ; done
I wrongly assumed the only error can be ENOENT, however during testing this several times I also saw ESRCH and importantly the here reported EACCES.

So at this point I wonder if it makes sense to not simply ignore all errors. This check is only a nice to have to make us aware of a inconsistent aardvark-dns vs rootless-netns state: #20396.

@Luap99 Luap99 added the network Networking related issue or feature label Mar 21, 2024
@edsantiago
Copy link
Member Author

ping

  • fedora-39 : int podman fedora-39 rootless host sqlite
    • 03-26 14:31 in TOP-LEVEL [AfterEach] Podman kube play test with reserved Label annotation in yaml
  • rawhide : int podman rawhide root host sqlite
    • 03-26 07:50 in Podman checkpoint podman checkpoint container with established tcp connections
  • rawhide : int podman rawhide rootless host sqlite
    • 03-26 10:03 in TOP-LEVEL [AfterEach] Podman kube play use network mode from config
x x x x x x
int(3) podman(3) rawhide(2) rootless(2) host(3) sqlite(3)
fedora-39(1) root(1)

@Luap99 Luap99 self-assigned this Apr 3, 2024
@Luap99 Luap99 added the In Progress This issue is actively being worked by the assignee, please do not work on this at this time. label Apr 3, 2024
Luap99 added a commit to Luap99/netavark that referenced this issue Apr 3, 2024
Right now there is a race condition where we return errors even in
cases where they should be ignored. When we send SIGHUP to aardvark on
teardown it might exit when all containers are removed. This means the
check afterwards might read the netns path at a weird time while the
process is being removed from the kernel structures. I asummed the only
error can be ENOENT but I was wrong, in CI we also see EACCES and in my
reproducer I also saw ESRCH. Given the check is a nice to have do ignore
all errors there.

Fixes containers/podman#22103

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/netavark that referenced this issue Apr 3, 2024
Right now there is a race condition where we return errors even in
cases where they should be ignored. When we send SIGHUP to aardvark on
teardown it might exit when all containers are removed. This means the
check afterwards might read the netns path at a weird time while the
process is being removed from the kernel structures. I assumed the only
error can be ENOENT but I was wrong, in CI we also see EACCES and in my
reproducer I also saw ESRCH. Given the check is a nice to have do ignore
all errors there.

Fixes containers/podman#22103

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Apr 3, 2024

containers/netavark#956

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jul 10, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jul 10, 2024
@edsantiago
Copy link
Member Author

Looks like the same bug, except ENOENT instead of EACCESS:

# podman [options] stop --all -t 0
[cid1]
Error: removing container [cid2] network: netavark: remove aardvark entries: failed to get aardvark pid: IO error: No such file or directory (os error 2)

In f40 root. File a new bug, or reopen this one?

@Luap99
Copy link
Member

Luap99 commented Sep 6, 2024

I saw that earlier, we can reopen this but on stop it is working differently and I very much fear that there is no way around these races until containers/aardvark-dns#338 is addressed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration In Progress This issue is actively being worked by the assignee, please do not work on this at this time. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. network Networking related issue or feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants