Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container stop: release lock before calling the runtime #8906

Merged
merged 1 commit into from
Jan 14, 2021

Conversation

vrothberg
Copy link
Member

Podman defers stopping the container to the runtime, which can take some
time. Keeping the lock while waiting for the runtime to complete the
stop procedure, prevents other commands from acquiring the lock as shown
in #8501.

To improve the user experience, release the lock before invoking the
runtime, and re-acquire the lock when the runtime is finished. Also
introduce an intermediate "stopping" to properly distinguish from
"stopped" containers etc.

Fixes: #8501
Signed-off-by: Valentin Rothberg [email protected]

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2021
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 7, 2021
@edsantiago
Copy link
Member

I don't think you ran the BATS tests...

@vrothberg
Copy link
Member Author

I don't think you ran the BATS tests...

That's not very helpful. At the moment, the tests fail locally even on master on my machine. Not sure what's going on.

@edsantiago
Copy link
Member

I'm sorry; I don't know how to help, because I don't understand the change. What first caught my eye was the stop -t 50 followed by ps: there's no way that can work, because the container is run with --rm (unless you changed stop to not block, and it doesn't look like you did). Then I pulled your PR, ran tests, and even podman stop basic fails. I can't really help you debug it, but the tests are deliberately written to be simple so that it should be trivial to reproduce the failure. And indeed, I can easily manually reproduce the basic test failure.

@mheon
Copy link
Member

mheon commented Jan 7, 2021

@vrothberg One suggestion: I think that the container remove code may need a small change to ensure that it treats stopping containers properly, and they can still be removed - I'm concerned about podman stop being killed by the user mid-way, and the container becoming stuck in a strange state, unable to be removed.

@vrothberg
Copy link
Member Author

Thanks, @edsantiago and @mheon! Stuck in meetings for the rest of the day.

The test is definitely wrong as is. I made some progress locally but will continue tomorrow morning.

@edsantiago
Copy link
Member

It's more than the test. Look at the existing basic. The new stop is leaving containers in stopping state, with status 0 instead of 137. IOW, something is really badly broken.

@vrothberg
Copy link
Member Author

It's more than the test. Look at the existing basic. The new stop is leaving containers in stopping state, with status 0 instead of 137. IOW, something is really badly broken.

Yes, there's more work. Also kill behaved oddly in local tests.

@vrothberg vrothberg force-pushed the fix-8501 branch 2 times, most recently from 19f2892 to edd177f Compare January 11, 2021 11:56
libpod/container_internal.go Outdated Show resolved Hide resolved
@vrothberg vrothberg changed the title WIP - container stop: release lock before calling the runtime container stop: release lock before calling the runtime Jan 11, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 11, 2021
Podman defers stopping the container to the runtime, which can take some
time.  Keeping the lock while waiting for the runtime to complete the
stop procedure, prevents other commands from acquiring the lock as shown
in containers#8501.

To improve the user experience, release the lock before invoking the
runtime, and re-acquire the lock when the runtime is finished.  Also
introduce an intermediate "stopping" to properly distinguish from
"stopped" containers etc.

Fixes: containers#8501
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member Author

@baude @mheon @rhatdan this is looking good now :)

@@ -758,7 +758,7 @@ func (c *Container) isStopped() (bool, error) {
return true, err
}

return !c.ensureState(define.ContainerStateRunning, define.ContainerStatePaused), nil
return !c.ensureState(define.ContainerStateRunning, define.ContainerStatePaused, define.ContainerStateStopping), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: in future we should require all of these ensureState invocations to not be inverted - will make them safer when we add states, as all valid states will have to be explicitly listed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add private functions for isRunning etc.

@mheon
Copy link
Member

mheon commented Jan 14, 2021

LGTM

@rhatdan
Copy link
Member

rhatdan commented Jan 14, 2021

/lgtm
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2021
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 14, 2021
@vrothberg
Copy link
Member Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2021
@openshift-merge-robot openshift-merge-robot merged commit a1b4974 into containers:master Jan 14, 2021
mheon added a commit to mheon/libpod that referenced this pull request May 26, 2021
After containers#8906, there is a potential race condition in container
removal of running containers with `--rm`. Running containers
must first be stopped, which was changed to unlock the container
to allow commands like `podman ps` to continue to run while
stopping; however, this also means that the cleanup process can
potentially run before we re-lock, and remove the container from
under us, resulting in error messages from `podman rm`. The end
result is unchanged, the container is still cleanly removed, but
the `podman rm` command will seem to have failed.

Work around this by pinging the database after we stop the
container to make sure it still exists. If it doesn't, our job is
done and we can exit cleanly.

Signed-off-by: Matthew Heon <[email protected]>
sstosh added a commit to sstosh/podman that referenced this pull request Jun 24, 2022
This bug was introduced in containers#8906.

When we use 'podman rm/restart/stop/kill etc...' command to
the container running with --rm, the OCI runtime directory
remains at /run/<runtime name> (root user) or
/run/user/<user id>/<runtime name> (rootless user).

This bug could cause other bugs.
For example, when we checkpoint the container running with
--rm (podman checkpoint --export) and restore it
(podman restore --import) with crun, error message
"Error: OCI runtime error: crun: container `<container id>`
already exists" is outputted.
This error is caused by an attempt to restore the container with
the same container ID as the remaining OCI runtime's container ID.

Therefore, I fix that the cleanupRuntime() function runs to
remove the OCI runtime directory,
even if the container has already been removed by --rm option.

Signed-off-by: Toshiki Sonoda <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman stop causes other commands to hang
6 participants