Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: exec: do not leak session IDs on errors #20405

Merged

Conversation

giuseppe
Copy link
Member

commit fa19e1b partially introduced the fix, but was merged too quickly and didn't work with remote.

Introduce a new binding to allow removing a session from the remote client.

[NO NEW TESTS NEEDED] since it fixes a CI failure

Does this PR introduce a user-facing change?

None

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 18, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2023
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Oct 18, 2023
@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from bc1fb91 to edf1654 Compare October 18, 2023 19:16
@edsantiago
Copy link
Member

You're going to need to rebase & repush, to incorporate #20404 and then remove that skip, but no need to do that yet. Let's see how this CI run goes. Thank you for working on this.

@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from edf1654 to 7765d46 Compare October 18, 2023 19:40
@mheon
Copy link
Member

mheon commented Oct 18, 2023

Idea LGTM. Haven't done a solid look over the code yet, though.

@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from d7dae7c to ddecb51 Compare October 19, 2023 07:42
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to expose a REST endpoint for this?
AFAIK via remote we use the 5 mins delay for the conmon cleanup process so the expectation of the test is just wrong IMO. (conf.Engine.ExitCommandDelay)

So if we want to cleanup right away this should be just set to 0 but that then would cause race conditions on the client as it has to inspect after exec finishes to get the exit code.

So I am not sure if that has to be fixed at all as the exec session will be cleanup up eventually. And in your test we only have a error condition so I see no reason why the server shouldn't handle the remove to make it work. Trusting lcients to do the right things sounds wrong to me.

As far as I can tell docker does also only cleanup exec session every 5 min:
https://github.com/moby/moby/blob/0253fedf03f4964c20906795e119404633cb9c1a/daemon/exec.go#L323

Comment on lines +622 to +626
defer func() {
if retErr != nil {
_ = containers.ExecRemove(ic.ClientCtx, sessionID, nil)
}
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be handled on the server side? We should not expect all API clients to know about this.

@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from ddecb51 to 00d2303 Compare October 19, 2023 09:25
commit fa19e1b partially introduced
the fix, but was merged too quickly and didn't work with remote.

Introduce a new binding to allow removing a session from the remote
client.

Signed-off-by: Giuseppe Scrivano <[email protected]>
This reverts commit 44ed415.

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from 00d2303 to 1d2589c Compare October 19, 2023 09:34
@giuseppe
Copy link
Member Author

Having the cleanup server-side would simplify the implementation but IMO it is cleaner if this is handled by the client since the client is already responsible for creating the session through ExecCreate, and now we are just exposing a way to delete it whenever necessary without waiting for the timeout.

If we handle it server-side, then should the session be deleted on any kind of error (e.g. invalid JSON request) or only on internal errors? Not sure if it classifies as a breaking change but it changes the current behavior since the session ID used by the client won't be valid anymore after an error, which is not the case now.

@Luap99
Copy link
Member

Luap99 commented Oct 19, 2023

Yes I guess that is a fair question but IMO dumping responsibility on clients makes it worse now, as they have the responsibility to cleanup exec session that they did not have before (ok it was just leaked for now). If we talk about podman-remote only I would agree that we can take care of it but this endpoint will be part of our public stable API.
Users like podman-py or really anyone using exec via REST API may need to use that as podman seems unable to correctly cleanup exec sessions which is a bug and the server should take care of leaked sessions regardless, we should never trust the client to do this.

Also if I read the moby code correctly deleting the exec session on a failed start seems to be what docker is doing so I doubt that this would cause problems for API users
https://github.com/moby/moby/blob/0253fedf03f4964c20906795e119404633cb9c1a/daemon/exec.go#L181-L193

@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch 2 times, most recently from aa7abd9 to 106b097 Compare October 19, 2023 12:09
@giuseppe
Copy link
Member Author

Also if I read the moby code correctly deleting the exec session on a failed start seems to be what docker is doing so I doubt that this would cause problems for API users
https://github.com/moby/moby/blob/0253fedf03f4964c20906795e119404633cb9c1a/daemon/exec.go#L181-L193

thanks for checking that. If that is the expectation for Docker compatibility, let's go with it. Pushed a simpler implementation with the server-side cleanup

@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from 106b097 to 5a6252a Compare October 19, 2023 12:28
@mheon
Copy link
Member

mheon commented Oct 19, 2023

LGTM, tests are red though

@giuseppe
Copy link
Member Author

if we delete the session ID on errors, we cannot retrieve the exit code later

@giuseppe giuseppe force-pushed the do-not-leak-sessions-with-remote branch from 5a6252a to 1d2589c Compare October 19, 2023 15:40
@giuseppe
Copy link
Member Author

we would break the current use case where we can query the sessionID after an error.

I've reverted to the previous version of doing the removal client-side, if you don't like this approach, we can just disable the test for remote and rely on the timeout mechanism

@rhatdan
Copy link
Member

rhatdan commented Oct 19, 2023

/lgtm

Can we also do it server side in a separate PR?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2023
@giuseppe
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2023
@giuseppe
Copy link
Member Author

Can we also do it server side in a separate PR?

I've tried doing it server side before, but it breaks the current client code where we check the exit status after the failure. If we delete the session immediately, the client won't be able to retrieve the exit code. Server-side I think the best we can do is what we already have with the timeout mechanism, unless we do it in a new API, which seems overkill for such a minor issue. I've not worked much on the server/client part, so if anyone has better ideas, I am all ears.

@cevich CI is currently failing with:

Failed to start an instance: FAILED_PRECONDITION: Monthly free compute limit exceeded!

@mheon
Copy link
Member

mheon commented Oct 19, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2023
@mheon
Copy link
Member

mheon commented Oct 19, 2023

CI has returned to normal

@openshift-ci openshift-ci bot merged commit 37292a1 into containers:main Oct 19, 2023
100 checks passed
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 18, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants