Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/run: Ensure underlying container is stopped when toolbox is killed #1207

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

halfline
Copy link

Right now "toolbox enter" creates a container on the fly, but then lets it linger after the foreground toolbox process is killed (for instance, from a terminal hangup).

Not killing the underlying container has the negative side effect of stalling shutdown if a toolbox shell is running.

This commit addresses that problem by detecting when the toolbox process is signaled, and then in response, kills off the entire cgroup associated with the underlying container.

Closes #1157

@halfline halfline force-pushed the just-die-after-i-hang-up branch from f28ca61 to af0fc8b Compare December 22, 2022 01:54
@halfline
Copy link
Author

@matthiasclasen i think this is probably a fix for the shutdown issue you told me about yesterday

@softwarefactory-project-zuul
Copy link

Build failed.

unit-test FAILURE in 24m 23s
unit-test-migration-path-for-coreos-toolbox FAILURE in 3m 12s
✔️ system-test-fedora-rawhide SUCCESS in 34m 09s
✔️ system-test-fedora-36 SUCCESS in 12m 45s
✔️ system-test-fedora-35 SUCCESS in 14m 19s

Right now "toolbox enter" creates a container on the fly, but
then lets it linger after the foreground toolbox process is
killed (for instance, from a terminal hangup).

Not killing the underlying container has the negative side effect of
stalling shutdown if a toolbox shell is running.

This commit addresses that problem by detecting when the toolbox process
is signaled, and then in response, kills off the entire cgroup associated
with the underlying container.

Closes containers#1157

Signed-off-by: Ray Strode <[email protected]>
@halfline halfline force-pushed the just-die-after-i-hang-up branch from db770e5 to 2c0ec31 Compare December 22, 2022 03:14
@softwarefactory-project-zuul
Copy link

Build succeeded.

✔️ unit-test SUCCESS in 27m 38s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 09s
✔️ system-test-fedora-rawhide SUCCESS in 38m 38s
✔️ system-test-fedora-36 SUCCESS in 12m 23s
✔️ system-test-fedora-35 SUCCESS in 13m 53s

@debarshiray
Copy link
Member

debarshiray commented Sep 30, 2023

Just for the sake of posterity ...

@matthiasclasen i think this is probably a fix for the shutdown issue you told me about yesterday

I think the blocked shutdown problem was reported as:

... and might have been fixed by:
containers/podman#17025

Note that while containers/podman#14531 points at containers/podman#16785 , I don't think it had anything to do with fixing blocked shutdowns caused by lingering podman exec sessions.

Copy link
Member

@debarshiray debarshiray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @halfline - my apologies for not looking at it sooner.

I don't think the changes in this pull request are working as expected, and the commit message suggests that there might be some misunderstanding.

Longer explanation follows ...

First, toolbox enter doesn't quite create a container on the fly. It is confusing because there's some syntactic sugar where toolbox create will offer to create a container if you had none. You could think of it as an embedded toolbox create.

The canonical flow is that a container is created with toolbox create, which wraps podman create. Then that container is entered with toolbox enter, which wraps podman start and podman exec. The podman start launches the container's entry point process, which would have been PID 1 inside the container had we been using a PID namespace, but we don't. For Toolbx containers, this entry point is toolbox init-container. The podman exec launches the process that the user actually interacts with. eg., a CLI shell, Emacs, Vim, etc..

The same container can be entered multiple times from multiple terminals at the same time. When that happens the podman start is basically a NOP because a container can have only one entry point and it's already running.

Neither the entry point nor the foreground processes are parented by podman start or podman exec. Each of them are parented by a separate conmon(8) process which is double forked from podman(1) and parented by the systemd user instance.

The blocked shutdown problem was that when SIGTERM was sent to the container's entry point, either through podman stop or the shutdown sequence, the processes launched by podman exec wouldn't be terminated. This would end up blocking shutdown. My understanding is that this was caused by Toolbx containers not having a separate PID namespace and was fixed by containers/podman#17025

This uncovers another problem. How did those processes launched by podman exec manage to outlive their respective terminals? This is #1204 I suspect this happens because the signal sent by the terminal only reaches upto the wrapper podman exec and doesn't get to the corresponding conmon(8) process.

With the changes in this branch, if I have multiple graphical terminals open with toolbox enter against the same container, then closing any one of those terminals with the cross button terminates the container's entry point and the foreground container process in all the other terminals. This shouldn't happen. Closing a terminal should only terminate its own foreground container process.

I don't understand cgroups very well, so I don't know why this is happening.

Toolbx containers also use the host's cgroup namespace, and, as far as I know, each terminal created by GNOME Terminal has its own cgroup. For each of the foreground container processes, I get:

$ cat /proc/56151/cgroup 
0::/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2b4111a9-f4a9-4f2b-8acd-4adc9e673bac.scope
$ 
$ cat /proc/55919/cgroup 
0::/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-9bf0516d-bd6f-4094-9504-0be37b1ce6bf.scope
$ 
$ cat /proc/56363/cgroup 
0::/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-6ecb9e7e-aa06-42d6-b9b6-bb9d6cdb44f4.scope

So, I have no idea why terminating the processes in one cgroup is affecting the others.

@halfline
Copy link
Author

halfline commented Oct 1, 2023

sorry it's been a while since i looked at this and even when I was working on this I only put it together in a few hours after Matthias visited and he mentioned the problem off hand, so I may be off base here... I haven't done much deep diving and I may have made wrong assumptions.

Having said that, my impression from your message is you believe I'm sending the hangup signal to the terminal's cgroup, but that's not the intent. The idea was to send the signal to the cgroup of the bash running in the container.

I got (at least) two things wrong I think:

  • I had assumed each conmon would create a new cgroup for each podman-exec invocation. It seems that assumption was perhaps wrong. Indeed, every shell seems to be using a shared cgroup. I get:

╎❯ cat /proc/self/cgroup
0::/user.slice/user-4153.slice/[email protected]/user.slice/libpod-4e141e468f1a756b43407f64f0273b5c18fcf5b784ea2a13050d0a9f4e1528ab.scope/container

in all my terminals. I find this really surprising. I do see crun has a --cgroup option, but indeed looking in ps output, it's not getting used.

  • I looking at the wrong pid to find the cgroup. I'm erroneously using the pid of the toolbox init-container process to find the cgroup.

This is #1204 I suspect this happens because the signal sent by the terminal only reaches upto the wrapper podman exec and doesn't get to the corresponding conmon(8) process

Right, when a terminal is closed, a SIGHUP is sent to all processes in the terminal "session" (in the setsid() sense of the word). These sessions are inherited from their parents, unless a child creates a fresh session. There's no way in unix afaik to move a child into a pre-existing session; it either uses the one it started with, or it starts its own. conmon isn't part of the child hierarchy of podman-exec, so it won't inherit the terminal session. This means the hangup signal needs to be ferried either from podman-exec or toolbox to the process started by conmon (or if children had their own cgroup like I thought, we could just send the signal to the whole cgroup)

@debarshiray
Copy link
Member

debarshiray commented Oct 3, 2023

Having said that, my impression from your message is you believe I'm sending the hangup signal to the terminal's cgroup, but that's not the intent. The idea was to send the signal to the cgroup of the bash running in the container.

I managed to confuse myself as well.

GNOME Terminal puts every VteTerminal into a separate systemd scope or cgroup. That's what I was showing here, because these process IDs are of the toolbox enter processes, and not of conmon(8) or the processes inside the container:

$ cat /proc/56151/cgroup 
0::/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2b4111a9-f4a9-4f2b-8acd-4adc9e673bac.scope
$ 
$ cat /proc/55919/cgroup 
0::/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-9bf0516d-bd6f-4094-9504-0be37b1ce6bf.scope
$ 
$ cat /proc/56363/cgroup 
0::/user.slice/user-1000.slice/[email protected]/app.slice/app-org.gnome.Terminal.slice/vte-spawn-6ecb9e7e-aa06-42d6-b9b6-bb9d6cdb44f4.scope

That's why I didn't realize that all the conmon(8) processes for the same container share the same cgroup. However, they don't share it across different containers.

I had assumed each conmon would create a new cgroup for each podman-exec invocation. It seems that assumption was perhaps wrong. Indeed, every shell seems to be using a shared cgroup. I get:

╎❯ cat /proc/self/cgroup 0::/user.slice/user-4153.slice/[email protected]/user.slice/libpod-4e141e468f1a756b43407f64f0273b5c18fcf5b784ea2a13050d0a9f4e1528ab.scope/container

in all my terminals. I find this really surprising. I do see crun has a --cgroup option, but indeed looking in ps output, it's not getting used.

Right. Thanks for the hint, because I had missed it before.

I looking at the wrong pid to find the cgroup. I'm erroneously using the pid of the toolbox init-container process to find the cgroup.

nod

This is #1204 I suspect this happens because the signal sent by the terminal only reaches upto the wrapper podman exec and doesn't get to the corresponding conmon(8) process

Right, when a terminal is closed, a SIGHUP is sent to all processes in the terminal "session" (in the setsid() sense of the word). These sessions are inherited from their parents, unless a child creates a fresh session. There's no way in unix afaik to move a child into a pre-existing session; it either uses the one it started with, or it starts its own. conmon isn't part of the child hierarchy of podman-exec, so it won't inherit the terminal session. This means the hangup signal needs to be ferried either from podman-exec or toolbox to the process started by conmon (or if children had their own cgroup like I thought, we could just send the signal to the whole cgroup)

nod

I don't see an obvious way to get the PID of the conmon(8) process or the process inside the container started by it. That was my question in #1204

Otherwise, we can insert a process that we control between conmon(8) and the user-requested process inside the container that will relay the PID inside the container back to the toolbox enter process. Then we can ferry the SIGHUP from the terminal to the container process. I don't know what would be the best way to do it. Maybe we can send a file descriptor?

This intermediate process under our control sounds a lot like the toolbox shell idea that I had before. We don't have to call it toolbox shell, though. It's just that I am really bad at naming. So, suggestions welcome.

@halfline
Copy link
Author

halfline commented Oct 3, 2023

I think it probably is possible to get the right pid. just as an experiment:

╎❯ bundle_path=$(podman inspect 4e141e468f1a756b43407f64f0273b5c18fcf5b784ea2a13050d0a9f4e1528a | jq -r '.[0].StaticDir')
╎❯ exec_id=$(podman exec -d 4e141e468f1a756b43407f64f0273b5c18fcf5b784ea2a13050d0a9f4e1528a sleep 300)
╎❯ pid=$(podman exec 4e141e468f1a756b43407f64f0273b5c18fcf5b784ea2a13050d0a9f4e1528a cat $bundle_path/$exec_id/exec_pid)
╎❯ echo $pid
193097

There may be a more direct command for it. not sure. libpod has an api for it:

// Get PID file path for a container's exec session
func (c *Container) execPidPath(sessionID string) string {
        return filepath.Join(c.execBundlePath(sessionID), "exec_pid")
}

though it looks like toolbx doesn't use libpod ?

@halfline
Copy link
Author

halfline commented Oct 3, 2023

I guess the complication is the exec_id is only printed for background exec invocations. You can fish it out of podman inspect output, but you would have to diff the list of execids before and after to find the right one, which is a little racy.

@debarshiray
Copy link
Member

I think it probably is possible to get the right pid. just as an experiment:

Cool. I didn't know that exec IDs are a thing. I think I never tried podman exec --detach until now.

though it looks like toolbx doesn't use libpod ?

When we rewrote Toolbx in Go, we wanted to use libpod and the other backend Go packages directly instead of going through podman(1). However, the Podman folks discouraged us from doing that, so we didn't.

These days, for other reasons, I am considering implementing some small things directly inside toolbox(1) to avoid going through podman(1), where we can rely on a stable interface. I have no idea if this is such a case.

If we can't get the PID inside the container through a Podman interface, do you think the alternative of inserting a shim is sane?

@debarshiray
Copy link
Member

podman exec --interactive --tty, which is what toolbox enter does, creates a nested pseudo-terminal device and wires it up to the terminal device that the user is interacting with. If the outer terminal goes away, then shouldn't it break the connection and somehow trigger the inner terminal?

@halfline
Copy link
Author

halfline commented Oct 3, 2023

podman exec --interactive --tty, which is what toolbox enter does, creates a nested pseudo-terminal device and wires it up to the terminal device that the user is interacting with. If the outer terminal goes away, then shouldn't it break the connection and somehow trigger the inner terminal?

Interesting. Right, when the terminal is closed, podman exec would be receiving a hang up signal (SIGHUP) except:

╎❯ echo $$
203855

╎❯ ps -eocomm,pid,ppid,tpgid,sess,pgrp,tty |grep -E 'podman|bash|toolbox|conmon|COMM'
COMMAND             PID    PPID   TPGID    SESS    PGRP TT
bash             203855    3069  207753  203855  203855 pts/18


╎❯ toolbox enter f39
╎❯ ps -eocomm,pid,ppid,tpgid,sess,pgrp,tty |grep -E 'podman|bash|toolbox|conmon|COMMAND|pts.18'
COMMAND             PID    PPID   TPGID    SESS    PGRP TT
bash             203855    3069  207768  203855  203855 ?
toolbox          207768  203855  207768  203855  207768 ?
podman           207888  207768  207768  203855  207768 ?
conmon           207903    1899      -1  207903  207903 ?
bash             207909  207903  207947  207909  207909 pts/3

You can see podman exec is in the toolbox process group and toolbox doesn't have a controlling tty. Neither conmon. In fact, nothing has /dev/pts/18 as the controlling tty anymore. This means when the terminal is closed, the hangup signal doesn't get sent. i guess maybe something became the foreground process group and then exited? or maybe there was a TIOCNOTTY ioctl?

regardless whatever is splicing input between the two ttys (podman? conmon?), should be getting a hang up in poll so it should know to tear things down i guess.

@halfline
Copy link
Author

halfline commented Oct 5, 2023

just browsing around a little, i think the splicing happens here:

func setupStdioChannels(streams *define.AttachStreams, conn *net.UnixConn, detachKeys []byte) (chan error, chan error) {
        receiveStdoutError := make(chan error)
        go func() {
                receiveStdoutError <- redirectResponseToOutputStreams(streams.OutputStream, streams.ErrorStream, streams.AttachOutput, streams.AttachError, conn)
        }()

        stdinDone := make(chan error)
        go func() {
                var err error
                if streams.AttachInput {
                        _, err = util.CopyDetachable(conn, streams.InputStream, detachKeys)
                }
                stdinDone <- err
        }()

        return receiveStdoutError, stdinDone
}

So maybe something like this would fix it
(untested, I don't know go, etc etc)
(EDIT: changed idea a bit to not slurp up the stdinDone before it can be processed by readStdio)
(EDIT2: changed readStdio to be a method so it has access to the container object, still completely untested)

diff --git a/libpod/oci_conmon_attach_common.go b/libpod/oci_conmon_attach_common.go
index 8d5a7f8..fa1dbad 100644
--- a/libpod/oci_conmon_attach_common.go
+++ b/libpod/oci_conmon_attach_common.go
@@ -101,5 +101,5 @@ func (r *ConmonOCIRuntime) Attach(c *Container, params *AttachOptions) error {
                params.AttachReady <- true
        }
-       return readStdio(conn, params.Streams, receiveStdoutError, stdinDone)
+       return c.readStdio(conn, params.Streams, receiveStdoutError, stdinDone)
 }
 
@@ -193,5 +193,5 @@ func (c *Container) attachToExec(streams *define.AttachStreams, keys *string, se
        }
 
-       return readStdio(conn, streams, receiveStdoutError, stdinDone)
+       return c.readStdio(conn, streams, receiveStdoutError, stdinDone)
 }
 
@@ -288,5 +288,5 @@ func redirectResponseToOutputStreams(outputStream, errorStream io.Writer, writeO
 }
 
-func readStdio(conn *net.UnixConn, streams *define.AttachStreams, receiveStdoutError, stdinDone chan error) error {
+func (c *Container) readStdio(conn *net.UnixConn, streams *define.AttachStreams, receiveStdoutError, stdinDone chan error) error {
        var err error
        select {
@@ -309,4 +309,7 @@ func readStdio(conn *net.UnixConn, streams *define.AttachStreams, receiveStdoutE
                        }
                }
+               if streams.AttachInput {
+                       c.Kill(uint(syscall.SIGHUP))
+               }
                if streams.AttachOutput || streams.AttachError {
                        return <-receiveStdoutError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

toolbox run command isn't terminated properly when terminal emulator is exited
2 participants