-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootless: use a pause process to keep namespaces alive #3084
rootless: use a pause process to keep namespaces alive #3084
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0353ec2
to
3d4f40f
Compare
20f2b54
to
9c8ffc5
Compare
3766e4f
to
e80d342
Compare
74fe761
to
7d7f7cb
Compare
dfa45eb
to
a8f585c
Compare
a8f585c
to
e3a2152
Compare
tests are passing here, @mheon if there are no objections, I'd like to get this into the next Podman release so that we have some time to play with it before 8.1 |
I'm reluctant to get this into 1.3.1, which is supposed to be a small bugfix release... Maybe we cut a 1.4 very soon (before end of month) to get this out quickly? |
@@ -113,6 +114,35 @@ func setupRootless(cmd *cobra.Command, args []string) error { | |||
MainGlobalOpts, | |||
remoteclient, | |||
} | |||
|
|||
pausePidPath, err := util.GetRootlessPauseProcessPidPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start moving things like this block of code out of cmd/podman
and consolidating them someplace? Maybe pkg/rootless
? I want to make it easier to integrate this into other Libpod consumers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree. The only reason I've left it here for now, is that the logic for detecting running containers and try to join their namespaces (and this is particularly important to not break rootless containers once again on an upgrade) cannot be easily moved around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at it now, but there are parts that cannot be easily moved down to the pkg/rootless package. For example we depend on util.GetRootlessRuntimeDir()
and in the case of not having a pause process running, we need to retrieve a list of conmon processes.
☔ The latest upstream changes (presumably #3104) made this pull request unmergeable. Please resolve the merge conflicts. |
this leaves the containers stopped but we won't risk to use the wrong user namespace. Signed-off-by: Giuseppe Scrivano <[email protected]>
use a pause process to keep the user and mount namespace alive. The pause process is created immediately on reload, and all successive Podman processes will refer to it for joining the user&mount namespace. This solves all the race conditions we had on joining the correct namespaces using the conmon processes. As a fallback if the join fails for any reason (e.g. the pause process was killed), then we try to join the running containers as we were doing before. Signed-off-by: Giuseppe Scrivano <[email protected]>
add a shortcut for joining immediately the namespace so we don't need to re-exec Podman. With the pause process simplificaton, we can now attempt to join the namespaces as soon as Podman starts (and before the Go runtime kicks in), so that we don't need to re-exec and use just one process. Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
e3a2152
to
53a7622
Compare
|
||
/* There can be another process at this point trying to configure the user namespace and the pause | ||
process, do not override the pid file if it already exists. */ | ||
if (renameat2 (AT_FDCWD, tmp_file_path, AT_FDCWD, pause_pid_file_path, RENAME_NOREPLACE) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this doesn't really look racy, but I think it's still worth asking: should we consider a file lock here to ensure only one process is trying to make the PID file and become the init process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to avoid the lock using only a renameat2(RENAME_NOREPLACE)
. If the file already exists, then it means another process managed to create it faster as it wasn't there earlier. If we fail at this point then Podman attempts to join the pause process instead (which we assume was created by the process who won the race).
If the file exists but the process is not running, then Podman raises an error and asks the user to fix it by deleting the file. We can improve this in future, but for now I preferred to have a clear error message
One question, but overall I think we're pretty good to merge. LGTM |
/lgtm |
@giuseppe It appears that
note:
I'm still trying to figure out the root cause. It occurs reliably every time a particular CI job of mine runs, which executes various podman commands in quick succession (build, run, top, inspect, rm). Do you have a hunch what might be the cause of this, so I can investigate further? Where in podman is the call which is supposed to delete |
The pause process does not appear to be pausing - which is our problem. If it's exiting cleanly, it's not doing its job of living as long as a user's session to preserve namespaces |
the pause process should be left alive after the first execution. It is implemented to avoid some race conditions trying to join the existing namespaces. Does it exist immediately after you run podman? What do you see if you run something like the following?
If you exit the current user session, the pause process might be killed by systemd. There is a fix for this issue in: #3188 In particular this patch: 30ef6ba You can try it by yourself with |
Thanks for the info! It appears to work correctly now. I had to manually enable-linger for my user as root, because Debian appears not to install policykit by default, so non-root users cannot enable-linger for themselves. |
Perhaps the deb package should recommend/suggest policykit-1 as a dependency |
@lsm5 can we add that dependency? |
use a pause process to keep the user and mount namespace alive.
The pause process is created immediately on reload, and all successive Podman processes will refer to it for joining the user&mount namespace.
This solves all the race conditions we had on joining the correct namespaces using the conmon processes.
As a fallback if the join fails for any reason (e.g. the pause process was killed), then we try to join the running containers as we were doing before.
This PR also adds an alternative way of joining the namespaces. We can now attempt to join them as soon as Podman starts, before the Go runtime kicks in, so that we don't need to re-exec and use just one process 🎉
This version doesn't try to recover and re-create a pause process if for any reason it terminated, but prints an error. This is not trivial to recover from, as there can be more podman processes already running and will require some sort of locking.