Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: unable to find network with name or ID podman-default-kube-network #17946

Closed
edsantiago opened this issue Mar 27, 2023 · 10 comments · Fixed by #18085 or #18281
Closed

ci: unable to find network with name or ID podman-default-kube-network #17946

edsantiago opened this issue Mar 27, 2023 · 10 comments · Fixed by #18085 or #18281
Assignees
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

In e2e tests:

podman play kube --no-host
...
podman [options] play kube --no-hosts /tmp/podman_test3817601232/kube.yaml
...
starting container <sha>: unable to find network with name or ID podman-default-kube-network: network not found
starting container <sha>: a dependency of container <sha> failed to start: container state improper
Error: failed to start 2 containers

Probably a collision between multiple tests. Predicted solution: rewrite tests to stop using default network, or at least so there's at most one test that does so.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Mar 27, 2023
@edsantiago
Copy link
Member Author

...but then again, there's this flake:

  podman network create with name and IPv6 flag (dual-stack)
...
# podman [options] run -it --rm --network dual-36384dcdcad634f5feb9a53eadb5e202d18e479679d0d891d61b6cf7340b1a56 quay.io/libpod/alpine:latest sh -c ip addr show eth0 |  grep global | awk ' /inet6 / {print $2}'
time="2023-03-24T18:57:46-05:00" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
fd00:4:3:2::2/64

# podman [options] run -it --rm --network dual-36384dcdcad634f5feb9a53eadb5e202d18e479679d0d891d61b6cf7340b1a56 quay.io/libpod/alpine:latest sh -c ip addr show eth0 |  awk ' /inet / {print $2}'
time="2023-03-24T18:57:47-05:00" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
Error: unable to find network with name or ID dual-36384dcdcad634f5feb9a53eadb5e202d18e479679d0d891d61b6cf7340b1a56: network not found

The string "3638" does not appear anywhere else in this log. And it's generated via stringid.GenerateRandomID(), hence is unlikely to be a collision. I'm wondering if some test is doing podman network rm -a (could not find that in test dir), or maybe the system reset test is not being properly locked?

@Luap99
Copy link
Member

Luap99 commented Mar 27, 2023

There is no --all flag for podman network rm, but yes the system reset or prune commands can cause issues.
In the past I fixed these test to use their own custom network config dir but I guess there are still some without that fix.

@edsantiago
Copy link
Member Author

Yet another possibly-similar failure

  podman verify network scoped DNS server and also verify updating network dns server
...
# podman-remote [options] network update IntTestf438e02a77 --dns-add 7.7.7.7
Error: unable to find network with name or ID IntTestf438e02a77: network not found
# podman-remote [options] network rm -f IntTestf438e02a77
time="2023-03-27T17:05:55-05:00" level=error msg="IPAM error: could not find network \"IntTestf438e02a77\""
time="2023-03-27T17:05:55-05:00" level=error msg="Unable to clean up network for
      container 528ec6ba9f4e0d266b138bb10579250e83dd9d7fa5ce6fbc77e8bee3ce367d7d:
      \"tearing down network namespace configuration for
      container 528ec6ba9f4e0d266b138bb10579250e83dd9d7fa5ce6fbc77e8bee3ce367d7d: 
     failed to convert net opts: unable to find network with name or ID IntTestf438e02a77: network not found\""

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@Luap99
Copy link
Member

Luap99 commented Mar 29, 2023

I think it is time to go with the big hammer and make every test case use its own config dir just likes --root and --runroot.
This will make sure there are no conflicts and we can in theory remove this stupid extra defer podmanTest.removeNetwork(...) that we have to use in every single test.

@vrothberg
Copy link
Member

That sounds very reasonable, @Luap99.

@Luap99 Luap99 self-assigned this Mar 29, 2023
Luap99 added a commit to Luap99/libpod that referenced this issue Mar 30, 2023
The e2e test are isolated and have their own --root/--runroot arguments.
However networks were always shared, this causes problem with tests that
do a prune or reset because they can effect other parallel running
tests.

Over the time I fixed some of of these cases to use their own config dir
but containers#17946 suggests that this is not enough. Instead of trying to find
and fix these tests just go with the big hammer and make every test use
a new clean network config directory.

This will also make the use of `defer podmanTest.removeNetwork(...)`
unnecessary. This is required at the moment for every test which creates
a network. However to keep the diff small and to see if it is even
working I will do it later in a follow up commit.

Fixes containers#17946

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Mar 30, 2023

Just linking #17975 (comment) here again, my change will not work so we actually have to go through all test which do prune or reset.

@edsantiago
Copy link
Member Author

Flakes in the past six days, am reporting in case it's helpful to see which tests are failing so you can at least target those:

Luap99 added a commit to Luap99/libpod that referenced this issue Apr 6, 2023
Since commit f250560 the play kube command uses its own network.
this is racy be design because we create the network followed by
creating/running pod/containers. This means in the meantime another
prune or reset process could wipe out the network config because we have
to share the network config directory by design in the test.

The problem is we only have one host netns which is shared between
tests. If the network config dir is not shared we cannot make conflict
checks for interface names and ip address. This results in different
tests trying to use the same interface and/or ip address which will
cause runtime failures in CNI and netavark.

The only solution I see is to make sure only the reset/prune tests are
using a custom network dir. This makes sure they do not wipe configs
that are otherwise required by other parallel running tests.

Fixes containers#17946

Signed-off-by: Paul Holzinger <[email protected]>
edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 20, 2023
...in "built using Dockerfile" test and "play kube fail with
custom selinux label" test. The latter, since it's in a test
file with lots of other kube tests, I just put into BeforeEach().

References: Issue containers#17946, PR containers#18085

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

Seen yesterday, in a fully-rebased PR, f36 root. Reopening.

@edsantiago edsantiago reopened this Apr 20, 2023
@Luap99
Copy link
Member

Luap99 commented Apr 20, 2023

I found two prune tests which were missing the custom network dir.

@Luap99
Copy link
Member

Luap99 commented Apr 20, 2023

podman system prune --volume is directly logged after the failing test which deletes the config so it explains the flake. I create a PR.

Luap99 added a commit to Luap99/libpod that referenced this issue Apr 20, 2023
Adds two custom config dirs to tests that were missed in
commit dc9a65e.

Fixes containers#17946 (hopefully finally)

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants