Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race(?) in podman run --rm then podman rm? #23640

Closed
edsantiago opened this issue Aug 15, 2024 · 4 comments · Fixed by #23646
Closed

race(?) in podman run --rm then podman rm? #23640

edsantiago opened this issue Aug 15, 2024 · 4 comments · Fixed by #23646
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

This is NOT in a parallel test:

[+0821s] not ok 376 [800] podman CONTAINERS_CONF - override runtime name in 2441ms
         # (from function `bail-now' in file test/system/[helpers.bash, line 192](https://github.com/containers/podman/blob/39893e2725cdd5a71a386c19de203197ec82ea01/test/system/helpers.bash#L192),
         #  from function `die' in file test/system/[helpers.bash, line 944](https://github.com/containers/podman/blob/39893e2725cdd5a71a386c19de203197ec82ea01/test/system/helpers.bash#L944),
         #  from function `run_podman' in file test/system/[helpers.bash, line 546](https://github.com/containers/podman/blob/39893e2725cdd5a71a386c19de203197ec82ea01/test/system/helpers.bash#L546),
         #  in test file test/system/[800-config.bats, line 82](https://github.com/containers/podman/blob/39893e2725cdd5a71a386c19de203197ec82ea01/test/system/800-config.bats#L82))
         #   `CONTAINERS_CONF="$conf_tmp" run_podman 1 rm "$cid"' failed
         #
<+     > # # podman info --format {{ .Host.OCIRuntime.Path }}
<+590ms> # /usr/bin/crun
         #
<+010ms> # # podman info --format {{ .Host.DatabaseBackend }}
<+590ms> # sqlite
         #
<+013ms> # # podman run -d --rm quay.io/libpod/testimage:20240123 true
<+245ms> # 6a7e6cd12a3246d01ff62d6ddb199052ba5b9f288b4308bee8ea82c9754bb828
         #
<+010ms> # # podman wait 6a7e6cd12a3246d01ff62d6ddb199052ba5b9f288b4308bee8ea82c9754bb828
<+054ms> # 0
         #
<+010ms> # # podman rm 6a7e6cd12a3246d01ff62d6ddb199052ba5b9f288b4308bee8ea82c9754bb828
<+314ms> # 6a7e6cd12a3246d01ff62d6ddb199052ba5b9f288b4308bee8ea82c9754bb828
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 0; expected 1
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

debian root. Technically it is in #23275, my parallel test, but this happened in the FIRST Bats run, the serial one.

The comments in this part of the code read:

# The --rm option means the container should no longer exist.
# However https://github.com/containers/podman/issues/12917 meant
# that the container cleanup triggered by conmon's --exit-cmd
# could fail, leaving the container in place.
#
# We verify that the container is indeed gone, by checking that a
# podman rm *fails* here - and it has the side effect of cleaning
# up in the case this test fails.
CONTAINERS_CONF="$conf_tmp" run_podman 1 rm "$cid"
is "$output" "Error:.*no such container"

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Aug 15, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 16, 2024

This already failed on #23601, I thought I fixed it there, see the conversation on the PR.

In general wait waits for stopping not removal... so maybe we should just fix the broken assumption instead of trying to make it work...

@Luap99
Copy link
Member

Luap99 commented Aug 16, 2024

But I don't see how this can happen though. wait waits for the conmon pid to be gone, but conmon will wait for the podman container cleanup process to finish first (and that process deletes the ctr). So in theory I do not see how wait can return before the cleanup process finished.

@Luap99
Copy link
Member

Luap99 commented Aug 16, 2024

Took me an hour but I got it to fail locally...

800-config.bats
 ✗ [800] podman CONTAINERS_CONF - override runtime name [678]
   (from function `bail-now' in file test/system/helpers.bash, line 184,
    from function `die' in file test/system/helpers.bash, line 932,
    from function `run_podman' in file test/system/helpers.bash, line 534,
    in test file test/system/800-config.bats, line 82)
     `CONTAINERS_CONF="$conf_tmp" run_podman 1 rm "$cid"' failed
   
   [11:56:54.535097362] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman info --format {{ .Host.OCIRuntime.Path }}
   [11:56:54.638004684] /usr/bin/crun
   
   [11:56:54.644345703] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman info --format {{ .Host.DatabaseBackend }}
   [11:56:54.741087640] sqlite
   
   [11:56:54.747929515] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman run -d --rm quay.io/libpod/testimage:20240123 true
   [11:56:54.848971681] bcd6c75a32d79c12d135f6b36e4f1d01731356aef900736f344e1abaedd67450
   
   [11:56:54.855732493] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman wait bcd6c75a32d79c12d135f6b36e4f1d01731356aef900736f344e1abaedd67450
   [11:56:54.902993461] 0
   
   [11:56:54.909089002] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman rm bcd6c75a32d79c12d135f6b36e4f1d01731356aef900736f344e1abaedd67450
   [11:56:54.975835038] bcd6c75a32d79c12d135f6b36e4f1d01731356aef900736f344e1abaedd67450
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 0; expected 1
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]

1 test, 1 failure in 1 seconds

@Luap99
Copy link
Member

Luap99 commented Aug 16, 2024

Ok using my ebpf script I found the issue and there is just no sane way for podman wait to wait until ctr is removed as it only wait for the ctr to be stopped and the status is already updated so wait exits right away but cleanup wasn't done removing as it lock/unlock sync twice and if waits get the lock ofet the first sync in cleanup we fail here.

So I say the test assumption is wrong. What we could do is make something like podman wait --condition removing work.

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 15, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants