-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
container stop: handle race #18457
container stop: handle race #18457
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrothberg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Luap99 WDYT? |
There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited. Handle those races by first checking if the container state has changed before returning the error. [NO NEW TESTS NEEDED] - as it's a hard to test race. Fixes: containers#18452 Signed-off-by: Valentin Rothberg <[email protected]>
sounds good
I cherry pick this into #18442 which should give us much more insight. |
I still see open pidfd flakes in https://api.cirrus-ci.com/v1/artifact/task/6578178697723904/html/int-podman-fedora-38-root-container-boltdb.log.html so this doe snot seem to help/change much |
There's probably another call path. I'll take another look. |
@Luap99 could you try the following patch in your PR? diff --git a/libpod/container_internal.go b/libpod/container_internal.go
index cf4fa1e0cb8b..f03016322042 100644
--- a/libpod/container_internal.go
+++ b/libpod/container_internal.go
@@ -1358,6 +1358,7 @@ func (c *Container) stop(timeout uint) error {
// If the container has already been removed (e.g., via
// the cleanup process), set the container state to "stopped".
c.state.State = define.ContainerStateStopped
+ logrus.Errorf("FIX ME HERE, PLEASE! %v", stopErr)
return stopErr
} My suspicion is that we need to tweak the |
yes, pushed |
I couldn't find |
That is surprising. |
@Luap99 how's it going? News on the bug? |
I don't know, this doesn't seem to make any difference. Neither did my cleanup patch. You can just grep for It is a pretty bad flake, so far no idea how to reproduce though. |
Also keep in mind that this is a crun error log. It doesn't seem to cause podman to exit >0 so it must be some some code path were podman ignores oci runtime errors. |
There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited.
Handle those races by first checking if the container state has changed before returning the error.
[NO NEW TESTS NEEDED] - as it's a hard to test race.
Fixes: #18452
Does this PR introduce a user-facing change?