container stop: release lock before calling the runtime #8906

vrothberg · 2021-01-07T12:23:30Z

Podman defers stopping the container to the runtime, which can take some
time. Keeping the lock while waiting for the runtime to complete the
stop procedure, prevents other commands from acquiring the lock as shown
in #8501.

To improve the user experience, release the lock before invoking the
runtime, and re-acquire the lock when the runtime is finished. Also
introduce an intermediate "stopping" to properly distinguish from
"stopped" containers etc.

Fixes: #8501
Signed-off-by: Valentin Rothberg [email protected]

openshift-ci-robot · 2021-01-07T12:23:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2021-01-07T12:57:57Z

I don't think you ran the BATS tests...

vrothberg · 2021-01-07T13:41:50Z

I don't think you ran the BATS tests...

That's not very helpful. At the moment, the tests fail locally even on master on my machine. Not sure what's going on.

edsantiago · 2021-01-07T14:24:51Z

I'm sorry; I don't know how to help, because I don't understand the change. What first caught my eye was the stop -t 50 followed by ps: there's no way that can work, because the container is run with --rm (unless you changed stop to not block, and it doesn't look like you did). Then I pulled your PR, ran tests, and even podman stop basic fails. I can't really help you debug it, but the tests are deliberately written to be simple so that it should be trivial to reproduce the failure. And indeed, I can easily manually reproduce the basic test failure.

mheon · 2021-01-07T14:28:58Z

@vrothberg One suggestion: I think that the container remove code may need a small change to ensure that it treats stopping containers properly, and they can still be removed - I'm concerned about podman stop being killed by the user mid-way, and the container becoming stuck in a strange state, unable to be removed.

vrothberg · 2021-01-07T14:55:34Z

Thanks, @edsantiago and @mheon! Stuck in meetings for the rest of the day.

The test is definitely wrong as is. I made some progress locally but will continue tomorrow morning.

edsantiago · 2021-01-07T15:04:19Z

It's more than the test. Look at the existing basic. The new stop is leaving containers in stopping state, with status 0 instead of 137. IOW, something is really badly broken.

vrothberg · 2021-01-07T15:04:58Z

It's more than the test. Look at the existing basic. The new stop is leaving containers in stopping state, with status 0 instead of 137. IOW, something is really badly broken.

Yes, there's more work. Also kill behaved oddly in local tests.

libpod/container_internal.go

Podman defers stopping the container to the runtime, which can take some time. Keeping the lock while waiting for the runtime to complete the stop procedure, prevents other commands from acquiring the lock as shown in containers#8501. To improve the user experience, release the lock before invoking the runtime, and re-acquire the lock when the runtime is finished. Also introduce an intermediate "stopping" to properly distinguish from "stopped" containers etc. Fixes: containers#8501 Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg · 2021-01-14T17:37:20Z

@baude @mheon @rhatdan this is looking good now :)

mheon · 2021-01-14T17:40:49Z

libpod/container_internal.go

@@ -758,7 +758,7 @@ func (c *Container) isStopped() (bool, error) {
 		return true, err
 	}

-	return !c.ensureState(define.ContainerStateRunning, define.ContainerStatePaused), nil
+	return !c.ensureState(define.ContainerStateRunning, define.ContainerStatePaused, define.ContainerStateStopping), nil


Note to self: in future we should require all of these ensureState invocations to not be inverted - will make them safer when we add states, as all valid states will have to be explicitly listed.

Or add private functions for isRunning etc.

mheon · 2021-01-14T17:44:03Z

LGTM

rhatdan · 2021-01-14T18:19:44Z

/lgtm
/hold

vrothberg · 2021-01-14T18:26:48Z

/hold cancel

After containers#8906, there is a potential race condition in container removal of running containers with `--rm`. Running containers must first be stopped, which was changed to unlock the container to allow commands like `podman ps` to continue to run while stopping; however, this also means that the cleanup process can potentially run before we re-lock, and remove the container from under us, resulting in error messages from `podman rm`. The end result is unchanged, the container is still cleanly removed, but the `podman rm` command will seem to have failed. Work around this by pinging the database after we stop the container to make sure it still exists. If it doesn't, our job is done and we can exit cleanly. Signed-off-by: Matthew Heon <[email protected]>

This bug was introduced in containers#8906. When we use 'podman rm/restart/stop/kill etc...' command to the container running with --rm, the OCI runtime directory remains at /run/<runtime name> (root user) or /run/user/<user id>/<runtime name> (rootless user). This bug could cause other bugs. For example, when we checkpoint the container running with --rm (podman checkpoint --export) and restore it (podman restore --import) with crun, error message "Error: OCI runtime error: crun: container `<container id>` already exists" is outputted. This error is caused by an attempt to restore the container with the same container ID as the remaining OCI runtime's container ID. Therefore, I fix that the cleanupRuntime() function runs to remove the OCI runtime directory, even if the container has already been removed by --rm option. Signed-off-by: Toshiki Sonoda <[email protected]>

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 7, 2021

vrothberg force-pushed the fix-8501 branch from d641df8 to 71852ed Compare January 7, 2021 12:25

vrothberg force-pushed the fix-8501 branch 2 times, most recently from 19f2892 to edd177f Compare January 11, 2021 11:56

mheon reviewed Jan 11, 2021

View reviewed changes

libpod/container_internal.go Outdated Show resolved Hide resolved

vrothberg force-pushed the fix-8501 branch from edd177f to 3b65834 Compare January 11, 2021 16:54

vrothberg changed the title ~~WIP - container stop: release lock before calling the runtime~~ container stop: release lock before calling the runtime Jan 11, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 11, 2021

vrothberg force-pushed the fix-8501 branch from 3b65834 to d6a62ee Compare January 12, 2021 10:39

rhatdan added the 3.0 Features label Jan 12, 2021

vrothberg force-pushed the fix-8501 branch from d6a62ee to d54478d Compare January 14, 2021 16:45

mheon reviewed Jan 14, 2021

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2021

openshift-ci-robot assigned rhatdan Jan 14, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 14, 2021

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2021

openshift-merge-robot merged commit a1b4974 into containers:master Jan 14, 2021

mheon mentioned this pull request May 26, 2021

Ensure that container still exists when removing #10476

Merged

sstosh mentioned this pull request Jun 24, 2022

Fix: Prevent OCI runtime directory remain #14720

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container stop: release lock before calling the runtime #8906

container stop: release lock before calling the runtime #8906

vrothberg commented Jan 7, 2021

openshift-ci-robot commented Jan 7, 2021

edsantiago commented Jan 7, 2021

vrothberg commented Jan 7, 2021

edsantiago commented Jan 7, 2021

mheon commented Jan 7, 2021

vrothberg commented Jan 7, 2021

edsantiago commented Jan 7, 2021

vrothberg commented Jan 7, 2021

vrothberg commented Jan 14, 2021

mheon Jan 14, 2021

vrothberg Jan 14, 2021

vrothberg Jan 14, 2021

mheon commented Jan 14, 2021

rhatdan commented Jan 14, 2021

vrothberg commented Jan 14, 2021

container stop: release lock before calling the runtime #8906

container stop: release lock before calling the runtime #8906

Conversation

vrothberg commented Jan 7, 2021

openshift-ci-robot commented Jan 7, 2021

edsantiago commented Jan 7, 2021

vrothberg commented Jan 7, 2021

edsantiago commented Jan 7, 2021

mheon commented Jan 7, 2021

vrothberg commented Jan 7, 2021

edsantiago commented Jan 7, 2021

vrothberg commented Jan 7, 2021

vrothberg commented Jan 14, 2021

mheon Jan 14, 2021

Choose a reason for hiding this comment

vrothberg Jan 14, 2021

Choose a reason for hiding this comment

vrothberg Jan 14, 2021

Choose a reason for hiding this comment

mheon commented Jan 14, 2021

rhatdan commented Jan 14, 2021

vrothberg commented Jan 14, 2021