Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

Closed
edsantiago opened this issue Mar 28, 2022 · 5 comments · Fixed by #13692
Closed

upgrade from v3.1.2: cni plugin bridge failed: failed to allocate #13679

edsantiago opened this issue Mar 28, 2022 · 5 comments · Fixed by #13692
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

This one isn't new -- the first instance I see is from January -- but it's starting to happen daily

[+0033s] [not ok 10 network - restart]()
         # (from function `die' in file test/upgrade/../system/[helpers.bash, line 500](https://github.com/containers/podman/blob/640c2d53a88f46e997d4e5a594cfc85a57e74d36/test/system/helpers.bash#L500),
         #  from function `run_podman' in file test/upgrade/../system/[helpers.bash, line 219](https://github.com/containers/podman/blob/640c2d53a88f46e997d4e5a594cfc85a57e74d36/test/system/helpers.bash#L219),
         #  in test file test/upgrade/[test-upgrade.bats, line 254](https://github.com/containers/podman/blob/640c2d53a88f46e997d4e5a594cfc85a57e74d36/test/system/test-upgrade.bats#L254))
         #   `run_podman start myrunningcontainer' failed with status 125
         # # podman stop -t0 myrunningcontainer
         # myrunningcontainer
         # # podman start myrunningcontainer
         # Error: unable to start container "c1e9132d70f2879283fe46d4e9aa5e4a0740766c3c2d38c3b61f38a2db7c13a6": plugin type="bridge" failed (add): cni plugin bridge failed: failed to allocate for range 0: 10.89.0.2 has been allocated to c1e9132d70f2879283fe46d4e9aa5e4a0740766c3c2d38c3b61f38a2db7c13a6, duplicate allocation is not allowed
         # [ rc=125 (** EXPECTED 0 **) ]

[Upgrade] 12 exec

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Mar 28, 2022
@Luap99
Copy link
Member

Luap99 commented Mar 28, 2022

Yeah I think there was a reason why I added podman start && podman stop instead of restart.
I will take a look.

@Luap99
Copy link
Member

Luap99 commented Mar 28, 2022

Good news, I can reproduce. Bad news, it looks like the upgrade test is terribly broken (it is hanging) when your host uses netavark because the old podman will still use cni obviously.

@Luap99
Copy link
Member

Luap99 commented Mar 29, 2022

I think I understand the root cause now.

Podman4 uses a new db structure for networks. Podman3 cannot read this any more.
To understand how it flakes, we need to look at how podman container cleanup works. Running podman stop will always cause a race between the podman container cleanup and podman stop, both processes will try to cleanup the mounts/networks but only one can do it (locked operation). This is not a problem normally since both processes use the same version. However in this particular test setup, the cleanup process will be spawned with the old podman version.
On a slow system such as in CI usually the stop process wins and thus network cleanup will work. However on a fast system with many cores the cleanup process wins and thus it fails 100% of the time for me locally.

One fix is to remove the network connect/disconnect test before the stop/start since it causes a v4 network db migration but I do not want this since it should test connect/disconnect. This would also explain why upgrade test from 2.X are not flaking because we skip network connect/disconnect there since it was only added in 3.0.

@mheon Any ideas if we could influence the behaviour between stop and cleanup so that stop would win always?

If not possible we can manually doing a network teardown with network disconnect before stopping, this should also work.

@mheon
Copy link
Member

mheon commented Mar 29, 2022

Usually podman stop will always win because it has a higher priority (assuming it is run keyboard-interactive). However, if run from a script, we lose that.

Maybe we can deliberately nice our podman cleanup processes, to try and guarantee other Podman processes get CPU time first? It's not a guarantee but it's better than nothing.

Luap99 added a commit to Luap99/libpod that referenced this issue Mar 29, 2022
With podman4 we support netavark, however old versions will still use
cni. Since netavark and cni can conflict we should not mix them.
Remove the network setup from the inital podman command and create the
directories manually to prevent such conflicts.

Also the update to 4.0 changes the network db structure. While it is
compatible from 3.X to 4.0 it will fail the other way around. In this
test it will happen because the cleanup process still uses the old
podman while the network connect/disconnect test already changed the db
format. Therefore the cleanup process cannot see any networks and will
not tear it down. The following start will fail because the ip address
is already assigned.

Fixes containers#13679

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Mar 29, 2022

Usually podman stop will always win because it has a higher priority (assuming it is run keyboard-interactive). However, if run from a script, we lose that.

I cannot confirm this on my laptop, if I run podman --log-level debug stop interactively it always shows Network is already cleaned up, skipping...

Anyway I fixed it in the test, I don't think this is a real world problem.

mheon pushed a commit to mheon/libpod that referenced this issue Mar 30, 2022
With podman4 we support netavark, however old versions will still use
cni. Since netavark and cni can conflict we should not mix them.
Remove the network setup from the inital podman command and create the
directories manually to prevent such conflicts.

Also the update to 4.0 changes the network db structure. While it is
compatible from 3.X to 4.0 it will fail the other way around. In this
test it will happen because the cleanup process still uses the old
podman while the network connect/disconnect test already changed the db
format. Therefore the cleanup process cannot see any networks and will
not tear it down. The following start will fail because the ip address
is already assigned.

Fixes containers#13679

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants