Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container stop: handle race #18457

Closed
wants to merge 1 commit into from
Closed

Conversation

vrothberg
Copy link
Member

@vrothberg vrothberg commented May 4, 2023

There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited.

Handle those races by first checking if the container state has changed before returning the error.

[NO NEW TESTS NEEDED] - as it's a hard to test race.

Fixes: #18452

Does this PR introduce a user-facing change?

Improve error handling when stopping or killing a container.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note labels May 4, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2023
@vrothberg
Copy link
Member Author

@Luap99 WDYT?

There is an inherent race when stopping/killing a container with other
processes attempting to do the same and also with the container having
exited in the meantime.  In those cases, the OCI runtime may fail to
kill the container as it has already exited.

Handle those races by first checking if the container state has changed
before returning the error.

[NO NEW TESTS NEEDED] - as it's a hard to test race.

Fixes: containers#18452
Signed-off-by: Valentin Rothberg <[email protected]>
@Luap99
Copy link
Member

Luap99 commented May 4, 2023

sounds good

[NO NEW TESTS NEEDED] - as it's a hard to test race.

I cherry pick this into #18442 which should give us much more insight.

@Luap99
Copy link
Member

Luap99 commented May 4, 2023

I still see open pidfd flakes in https://api.cirrus-ci.com/v1/artifact/task/6578178697723904/html/int-podman-fedora-38-root-container-boltdb.log.html so this doe snot seem to help/change much

@vrothberg
Copy link
Member Author

I still see open pidfd flakes in https://api.cirrus-ci.com/v1/artifact/task/6578178697723904/html/int-podman-fedora-38-root-container-boltdb.log.html so this doe snot seem to help/change much

There's probably another call path. I'll take another look.

@vrothberg
Copy link
Member Author

@Luap99 could you try the following patch in your PR?

diff --git a/libpod/container_internal.go b/libpod/container_internal.go
index cf4fa1e0cb8b..f03016322042 100644
--- a/libpod/container_internal.go
+++ b/libpod/container_internal.go
@@ -1358,6 +1358,7 @@ func (c *Container) stop(timeout uint) error {
                                // If the container has already been removed (e.g., via
                                // the cleanup process), set the container state to "stopped".
                                c.state.State = define.ContainerStateStopped
+                               logrus.Errorf("FIX ME HERE, PLEASE! %v", stopErr)
                                return stopErr
                        }

My suspicion is that we need to tweak the return stopErr but I first want to make sure it's the right place.

@Luap99
Copy link
Member

Luap99 commented May 4, 2023

@Luap99 could you try the following patch in your PR?

yes, pushed

@Luap99
Copy link
Member

Luap99 commented May 4, 2023

I couldn't find FIX ME HERE in the failed logs but I see the open pidfd one. I wonder if #18462 changes the behaviour, I will add this to my PR.

@vrothberg
Copy link
Member Author

That is surprising.

@vrothberg
Copy link
Member Author

@Luap99 how's it going? News on the bug?

@Luap99
Copy link
Member

Luap99 commented May 5, 2023

I don't know, this doesn't seem to make any difference. Neither did my cleanup patch.

You can just grep for open pidfd in this PR logs, i.e. https://api.cirrus-ci.com/v1/artifact/task/5572269305495552/html/int-podman-fedora-37-root-container-boltdb.log.html

It is a pretty bad flake, so far no idea how to reproduce though.

@Luap99
Copy link
Member

Luap99 commented May 5, 2023

Also keep in mind that this is a crun error log. It doesn't seem to cause podman to exit >0 so it must be some some code path were podman ignores oci runtime errors.

@vrothberg vrothberg closed this May 9, 2023
@vrothberg vrothberg deleted the fix-18452 branch May 9, 2023 09:11
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

open pidfd: no such process
2 participants