-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a potential race around the exec cleanup process #13600
Fix a potential race around the exec cleanup process #13600
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2154093
to
b8cf48a
Compare
libpod/container_exec.go
Outdated
// If we can't do this, no point in continuing, any attempt to save | ||
// would write garbage to the DB. | ||
if err := c.syncContainer(); err != nil { | ||
if errors.Cause(err) == define.ErrNoSuchCtr || errors.Cause(err) == define.ErrCtrRemoved { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please start using errors.Is for new code
libpod/container_exec.go
Outdated
|
||
logrus.Debugf("Container %s exec session %s completed with exit code %d", c.ID(), session.ID(), exitCode) | ||
if newSess.State == define.ExecStateStopped && os.IsNotExist(exitCodeErr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here use errors.Is instead of os.IsNotExist
Adding WIP this isn't doing what I expect |
I think I fixed it, going to repush |
b8cf48a
to
72bba5c
Compare
72bba5c
to
111ed2f
Compare
@edsantiago PTAL, I have this running on a 1MT and it seems to be running through the reproducer without issue. |
@thrix FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no way I can review the locking flow; and I shudder at the thought of building this in a CentOS container. I'll trust your testing. Two very minor comments.
libpod/container_exec.go
Outdated
|
||
// Lock again. | ||
// Important: we must lock and sync *before* the above error is handled. | ||
// We need into from the database to handle the error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that's a typo for "info"?
libpod/container_exec.go
Outdated
// Container's entirely removed. We can't save status, | ||
// but the container's entirely removed, so we don't | ||
// need to. Exit without error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repetitive comment; maybe clearer as "Container is entirely removed, so there's no need to save status. Exit without error."
111ed2f
to
780f448
Compare
@containers/podman-maintainers PTAL |
is it a race in conmon? I don' think it should call the cleanup process before reading all the container output and writing the exit code |
@giuseppe I don't think so? My suspicion is this: Conmon sends all the output (a considerable amount, probably 10kb?) over the unix socket to Podman, detects that the container has exited (it does so almost instantly), and spawns the cleanup process. Podman, meanwhile, is still reading the container's output from the Unix socket and writing it to the terminal - which, per Ed's experimentation, seems to be the limiting factor (is writing to TTYs rate-limited in some way? I should look into this) and takes more time than it does for Conmon to finish sending output, write the exit file, and invoke the cleanup process. The cleanup process wakes up and runs in the background, reads (and removes) the exit file, and writes the exit code to the DB. Podman presumably finishes writing sometime during this, but the container lock forces it to wait until after the cleanup process is finished; by which point there was no exit file, hence error messages. |
Every exec session run attached will, on exit, do two things: it will signal the associated `podman exec` that it is finished (to allow Podman to collect the exit code and exit), and spawn a cleanup process to clean up the exec session (in case the `podman exec` process died, we still need to clean up). If an exec session is created that exits almost instantly, but generates a large amount of output (e.g. prints thousands of lines), the cleanup process can potentially execute before `podman exec` has a chance to read the exit code, resulting in errors. Handle this by detecting if the cleanup process has already removed the exec session before handling the error from reading the exec exit code. [NO NEW TESTS NEEDED] I have no idea how to test this in CI. Fixes containers#13227 Signed-off-by: Matthew Heon <[email protected]>
780f448
to
5b2597d
Compare
Can we merge this, or are there further questions? |
@baude @Luap99 @vrothberg PTAL |
LGTM |
/lgtm |
Every exec session run attached will, on exit, do two things: it will signal the associated
podman exec
that it is finished (to allow Podman to collect the exit code and exit), and spawn a cleanup process to clean up the exec session (in case thepodman exec
process died, we still need to clean up). If an exec session is created that exits almost instantly, but generates a large amount of output (e.g. prints thousands of lines), the cleanup process can potentially execute beforepodman exec
has a chance to read the exit code, resulting in errors. Handle this by detecting if the cleanup process has already removed the exec session before handling the error from reading the exec exit code.[NO NEW TESTS NEEDED] I have no idea how to test this in CI.
Fixes #13227