container stop: handle race #18457

vrothberg · 2023-05-04T09:03:42Z

There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited.

Handle those races by first checking if the container state has changed before returning the error.

[NO NEW TESTS NEEDED] - as it's a hard to test race.

Fixes: #18452

Does this PR introduce a user-facing change?

Improve error handling when stopping or killing a container.

openshift-ci · 2023-05-04T09:03:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrothberg · 2023-05-04T09:04:02Z

@Luap99 WDYT?

There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited. Handle those races by first checking if the container state has changed before returning the error. [NO NEW TESTS NEEDED] - as it's a hard to test race. Fixes: containers#18452 Signed-off-by: Valentin Rothberg <[email protected]>

Luap99 · 2023-05-04T09:09:10Z

sounds good

[NO NEW TESTS NEEDED] - as it's a hard to test race.

I cherry pick this into #18442 which should give us much more insight.

Luap99 · 2023-05-04T10:18:49Z

I still see open pidfd flakes in https://api.cirrus-ci.com/v1/artifact/task/6578178697723904/html/int-podman-fedora-38-root-container-boltdb.log.html so this doe snot seem to help/change much

vrothberg · 2023-05-04T10:57:41Z

I still see open pidfd flakes in https://api.cirrus-ci.com/v1/artifact/task/6578178697723904/html/int-podman-fedora-38-root-container-boltdb.log.html so this doe snot seem to help/change much

There's probably another call path. I'll take another look.

vrothberg · 2023-05-04T11:17:38Z

@Luap99 could you try the following patch in your PR?

diff --git a/libpod/container_internal.go b/libpod/container_internal.go
index cf4fa1e0cb8b..f03016322042 100644
--- a/libpod/container_internal.go
+++ b/libpod/container_internal.go
@@ -1358,6 +1358,7 @@ func (c *Container) stop(timeout uint) error {
                                // If the container has already been removed (e.g., via
                                // the cleanup process), set the container state to "stopped".
                                c.state.State = define.ContainerStateStopped
+                               logrus.Errorf("FIX ME HERE, PLEASE! %v", stopErr)
                                return stopErr
                        }

My suspicion is that we need to tweak the return stopErr but I first want to make sure it's the right place.

Luap99 · 2023-05-04T11:22:32Z

@Luap99 could you try the following patch in your PR?

yes, pushed

Luap99 · 2023-05-04T13:10:12Z

I couldn't find FIX ME HERE in the failed logs but I see the open pidfd one. I wonder if #18462 changes the behaviour, I will add this to my PR.

vrothberg · 2023-05-04T13:16:28Z

That is surprising.

vrothberg · 2023-05-05T07:44:07Z

@Luap99 how's it going? News on the bug?

Luap99 · 2023-05-05T10:05:22Z

I don't know, this doesn't seem to make any difference. Neither did my cleanup patch.

You can just grep for open pidfd in this PR logs, i.e. https://api.cirrus-ci.com/v1/artifact/task/5572269305495552/html/int-podman-fedora-37-root-container-boltdb.log.html

It is a pretty bad flake, so far no idea how to reproduce though.

Luap99 · 2023-05-05T10:06:48Z

Also keep in mind that this is a crun error log. It doesn't seem to cause podman to exit >0 so it must be some some code path were podman ignores oci runtime errors.

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels May 4, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2023

vrothberg mentioned this pull request May 4, 2023

open pidfd: no such process #18452

Closed

vrothberg force-pushed the fix-18452 branch from 0e5eaa9 to 8d0050a Compare May 4, 2023 09:05

vrothberg closed this May 9, 2023

vrothberg deleted the fix-18452 branch May 9, 2023 09:11

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 24, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container stop: handle race #18457

container stop: handle race #18457

vrothberg commented May 4, 2023 •

edited

Loading

openshift-ci bot commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 5, 2023

Luap99 commented May 5, 2023

Luap99 commented May 5, 2023

container stop: handle race #18457

container stop: handle race #18457

Conversation

vrothberg commented May 4, 2023 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 5, 2023

Luap99 commented May 5, 2023

Luap99 commented May 5, 2023

vrothberg commented May 4, 2023 •

edited

Loading