Fix a potential race around the exec cleanup process #13600

mheon · 2022-03-22T18:10:51Z

Every exec session run attached will, on exit, do two things: it will signal the associated podman exec that it is finished (to allow Podman to collect the exit code and exit), and spawn a cleanup process to clean up the exec session (in case the podman exec process died, we still need to clean up). If an exec session is created that exits almost instantly, but generates a large amount of output (e.g. prints thousands of lines), the cleanup process can potentially execute before podman exec has a chance to read the exit code, resulting in errors. Handle this by detecting if the cleanup process has already removed the exec session before handling the error from reading the exec exit code.

[NO NEW TESTS NEEDED] I have no idea how to test this in CI.

Fixes #13227

openshift-ci · 2022-03-22T18:22:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2022-03-22T18:49:00Z

libpod/container_exec.go

+	// If we can't do this, no point in continuing, any attempt to save
+	// would write garbage to the DB.
+	if err := c.syncContainer(); err != nil {
+		if errors.Cause(err) == define.ErrNoSuchCtr || errors.Cause(err) == define.ErrCtrRemoved {


please start using errors.Is for new code

Luap99 · 2022-03-22T18:49:41Z

libpod/container_exec.go


-	logrus.Debugf("Container %s exec session %s completed with exit code %d", c.ID(), session.ID(), exitCode)
+		if newSess.State == define.ExecStateStopped && os.IsNotExist(exitCodeErr) {


same here use errors.Is instead of os.IsNotExist

mheon · 2022-03-22T19:07:12Z

Adding WIP this isn't doing what I expect

mheon · 2022-03-22T19:22:41Z

I think I fixed it, going to repush

mheon · 2022-03-22T19:24:09Z

@edsantiago PTAL, I have this running on a 1MT and it seems to be running through the reproducer without issue.

edsantiago · 2022-03-22T19:32:33Z

@thrix FYI

edsantiago

There's no way I can review the locking flow; and I shudder at the thought of building this in a CentOS container. I'll trust your testing. Two very minor comments.

edsantiago · 2022-03-22T19:55:10Z

libpod/container_exec.go

+
+	// Lock again.
+	// Important: we must lock and sync *before* the above error is handled.
+	// We need into from the database to handle the error.


I'm guessing that's a typo for "info"?

edsantiago · 2022-03-22T19:57:52Z

libpod/container_exec.go

+			// Container's entirely removed. We can't save status,
+			// but the container's entirely removed, so we don't
+			// need to. Exit without error.


Repetitive comment; maybe clearer as "Container is entirely removed, so there's no need to save status. Exit without error."

mheon · 2022-03-22T20:16:19Z

@containers/podman-maintainers PTAL

libpod/container_exec.go

giuseppe · 2022-03-23T08:25:47Z

If an exec session is created that exits almost instantly, but generates a large amount of output (e.g. prints thousands of lines), the cleanup process can potentially execute before podman exec has a chance to read the exit code, resulting in errors.

is it a race in conmon? I don' think it should call the cleanup process before reading all the container output and writing the exit code

mheon · 2022-03-23T13:31:05Z

@giuseppe I don't think so? My suspicion is this: Conmon sends all the output (a considerable amount, probably 10kb?) over the unix socket to Podman, detects that the container has exited (it does so almost instantly), and spawns the cleanup process. Podman, meanwhile, is still reading the container's output from the Unix socket and writing it to the terminal - which, per Ed's experimentation, seems to be the limiting factor (is writing to TTYs rate-limited in some way? I should look into this) and takes more time than it does for Conmon to finish sending output, write the exit file, and invoke the cleanup process. The cleanup process wakes up and runs in the background, reads (and removes) the exit file, and writes the exit code to the DB. Podman presumably finishes writing sometime during this, but the container lock forces it to wait until after the cleanup process is finished; by which point there was no exit file, hence error messages.

Every exec session run attached will, on exit, do two things: it will signal the associated `podman exec` that it is finished (to allow Podman to collect the exit code and exit), and spawn a cleanup process to clean up the exec session (in case the `podman exec` process died, we still need to clean up). If an exec session is created that exits almost instantly, but generates a large amount of output (e.g. prints thousands of lines), the cleanup process can potentially execute before `podman exec` has a chance to read the exit code, resulting in errors. Handle this by detecting if the cleanup process has already removed the exec session before handling the error from reading the exec exit code. [NO NEW TESTS NEEDED] I have no idea how to test this in CI. Fixes containers#13227 Signed-off-by: Matthew Heon <[email protected]>

mheon · 2022-03-23T15:08:28Z

Can we merge this, or are there further questions?

mheon · 2022-03-23T16:02:48Z

@baude @Luap99 @vrothberg PTAL

rhatdan · 2022-03-23T16:04:19Z

LGTM

libpod/container_exec.go

baude · 2022-03-23T18:33:13Z

/lgtm

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2022

mheon force-pushed the exec_cleanup_race branch from 2154093 to b8cf48a Compare March 22, 2022 18:44

Luap99 reviewed Mar 22, 2022

View reviewed changes

mheon changed the title ~~Fix a potential race around the exec cleanup process~~ WIP: Fix a potential race around the exec cleanup process Mar 22, 2022

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2022

mheon force-pushed the exec_cleanup_race branch from b8cf48a to 72bba5c Compare March 22, 2022 19:23

mheon changed the title ~~WIP: Fix a potential race around the exec cleanup process~~ Fix a potential race around the exec cleanup process Mar 22, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 22, 2022

mheon force-pushed the exec_cleanup_race branch from 72bba5c to 111ed2f Compare March 22, 2022 19:23

edsantiago reviewed Mar 22, 2022

View reviewed changes

mheon force-pushed the exec_cleanup_race branch from 111ed2f to 780f448 Compare March 22, 2022 20:16

TomSweeneyRedHat reviewed Mar 22, 2022

View reviewed changes

libpod/container_exec.go Show resolved Hide resolved

mheon force-pushed the exec_cleanup_race branch from 780f448 to 5b2597d Compare March 23, 2022 13:33

vrothberg reviewed Mar 23, 2022

View reviewed changes

libpod/container_exec.go Show resolved Hide resolved

giuseppe mentioned this pull request Mar 23, 2022

Error: open /var/lib/containers/storage/overlay-containers/.../userdata/.../exit/...: no such file or directory #13613

Closed

openshift-ci bot assigned baude Mar 23, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 23, 2022

openshift-merge-robot merged commit a1e2897 into containers:main Mar 23, 2022

rlanting mentioned this pull request Apr 20, 2022

Error: timed out waiting for file with large TTY output #13930

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a potential race around the exec cleanup process #13600

Fix a potential race around the exec cleanup process #13600

mheon commented Mar 22, 2022

openshift-ci bot commented Mar 22, 2022

Luap99 Mar 22, 2022

Luap99 Mar 22, 2022

mheon commented Mar 22, 2022

mheon commented Mar 22, 2022

mheon commented Mar 22, 2022

edsantiago commented Mar 22, 2022

edsantiago left a comment

edsantiago Mar 22, 2022

edsantiago Mar 22, 2022

mheon commented Mar 22, 2022

giuseppe commented Mar 23, 2022

mheon commented Mar 23, 2022

mheon commented Mar 23, 2022

mheon commented Mar 23, 2022

rhatdan commented Mar 23, 2022

baude commented Mar 23, 2022


		logrus.Debugf("Container %s exec session %s completed with exit code %d", c.ID(), session.ID(), exitCode)
		if newSess.State == define.ExecStateStopped && os.IsNotExist(exitCodeErr) {

Fix a potential race around the exec cleanup process #13600

Fix a potential race around the exec cleanup process #13600

Conversation

mheon commented Mar 22, 2022

openshift-ci bot commented Mar 22, 2022

Luap99 Mar 22, 2022

Choose a reason for hiding this comment

Luap99 Mar 22, 2022

Choose a reason for hiding this comment

mheon commented Mar 22, 2022

mheon commented Mar 22, 2022

mheon commented Mar 22, 2022

edsantiago commented Mar 22, 2022

edsantiago left a comment

Choose a reason for hiding this comment

edsantiago Mar 22, 2022

Choose a reason for hiding this comment

edsantiago Mar 22, 2022

Choose a reason for hiding this comment

mheon commented Mar 22, 2022

giuseppe commented Mar 23, 2022

mheon commented Mar 23, 2022

mheon commented Mar 23, 2022

mheon commented Mar 23, 2022

rhatdan commented Mar 23, 2022

baude commented Mar 23, 2022