Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/plugins: use same pgid + skip signal forwarding when attached to a TTY #4792

Merged
merged 3 commits into from
Jan 16, 2024

Conversation

laurazard
Copy link
Contributor

@laurazard laurazard commented Jan 15, 2024

- What I did

Reverted ef5e5fa and solved the "double notification" issue by only using the socket to notify the plugin process when not attached to a tty/not already going to receive a signal.

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@laurazard laurazard self-assigned this Jan 15, 2024
@laurazard
Copy link
Contributor Author

@thaJeztah @crazy-max

@laurazard
Copy link
Contributor Author

laurazard commented Jan 15, 2024

Ah I missed DCO on the revert commit 🤦‍♀️ fixed ✅

This reverts commit ef5e5fa.

Running new plugins under a new pgid isn't a viable solution due to
it causing issues with plugin processes attempting to read from the
TTY (see: moby/moby#47073).

Signed-off-by: Laura Brehm <[email protected]>
In order to solve the "double notification" issue (see:
docker@ef5e5fa)
without running the plugin process under a new pgid (see:
moby/moby#47073) we instead check if we're
attached to a TTY, and if so skip signalling the plugin process since it
will already be signalled.

Signed-off-by: Laura Brehm <[email protected]>
@laurazard laurazard force-pushed the alternative-fix-no-socket-notif branch from 3973885 to 5f6c55a Compare January 15, 2024 13:30
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2024

Codecov Report

Merging #4792 (508346e) into master (077d07c) will increase coverage by 59.63%.
Report is 6 commits behind head on master.
The diff coverage is 0.00%.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4792       +/-   ##
===========================================
+ Coverage        0   59.63%   +59.63%     
===========================================
  Files           0      287      +287     
  Lines           0    24777    +24777     
===========================================
+ Hits            0    14775    +14775     
- Misses          0     9114     +9114     
- Partials        0      888      +888     

@thaJeztah thaJeztah added this to the 25.0.0 milestone Jan 15, 2024
@laurazard laurazard marked this pull request as draft January 15, 2024 13:34
Copy link
Member

@crazy-max crazy-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Tested with:

$ ./build/docker-linux-amd64 version
Client:
 Version:           24.0.0-rc.2-628-g5f6c55a724.m
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        5f6c55a724
 Built:             Mon Jan 15 13:41:28 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Desktop
 Engine:
  Version:          25.0.0-rc.1
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.5
  Git commit:       9cebefa
  Built:            Thu Jan  4 16:30:11 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.26
  GitCommit:        3dd1e886e55dd695541fdcd67420c2888645a495
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
BUILDX_EXPERIMENTAL=1 ./build/docker-linux-amd64 build .
#0 building with "default" instance using docker driver

#1 [internal] connecting to local controller
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 4.90kB 0.0s done
#2 DONE 0.0s

#3 resolve image config for docker.io/docker/dockerfile:1
#3 DONE 0.5s

#4 docker-image://docker.io/docker/dockerfile:1@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f50032edf31be0021
#4 CACHED

...

@laurazard
Copy link
Contributor Author

I was testing with Compose and noticed that we'd lost the desired "notify plugin to exit via socket if detached" behavior and mostly sure I nailed it down to

func accept(listener *net.UnixListener, conn **net.UnixConn) {
defer listener.Close()
go func() {
for {
// ignore error here, if we failed to accept a connection,
// conn is nil and we fallback to previous behavior
*conn, _ = listener.AcceptUnix()
}
}()
}

specifically, the defer listener.Close() line since it gets called before all those gouroutine executions.

With the defer listener.Close() line and logging the error:

func accept(listener *net.UnixListener, conn **net.UnixConn) {
	defer listener.Close()

	go func() {
		for {
			// ignore error here, if we failed to accept a connection,
			// conn is nil and we fallback to previous behavior
			newConn, err := listener.AcceptUnix()
			*conn = newConn
			if err != nil {
				println(err.Error())
			}
			// perform any platform-specific actions on accept (e.g. unlink non-abstract sockets)
			onAccept(*conn, listener)
		}
	}()
}
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
accept unix @docker_cli_1c8d52df-4d23-4c5a-9a1a-6a91b798da88: use of closed network connection
...

Looks like during the refactor we moved that defer and now it gets called too early. I'll move it back.

@laurazard laurazard marked this pull request as ready for review January 15, 2024 14:33
}

accept(listener, conn)

return EnvKey + "=" + listener.Addr().String(), nil
closeListener := func() {
listener.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps take the opportunity to make linters and IDEs "happy";

Suggested change
listener.Close()
_ = listener.Close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! My linters don't complain about this, I wonder if it's Goland (maybe we can add something to .golangci.yml so that CI/people not using Goland would pick it up too). Btw, does it also complain about this in a defer?

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved
@laurazard
Copy link
Contributor Author

Also tested with Compose v2.24.0 – both "attached" to a TTY and not-attached seem to be working correctly ✅

@laurazard laurazard force-pushed the alternative-fix-no-socket-notif branch from dcab77e to bec75eb Compare January 15, 2024 15:03
@laurazard
Copy link
Contributor Author

amended with comments, can you TAL @thaJeztah?

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the defer 😅

I'm not entirely convinced this is the right approach: now, each plugin will have to implement the 'graceful termination' behavior; which is a thing we can do/build into the cli-plugins package, but still, I'm not sure having two copies of that logic that run conditionally really makes sense.

It feels doubly dubious when the logic built into the parent CLI will almost never trigger in a way that is useful (given that it won't trigger when a TTY is present).

I do feel like the pgid change is a beneficial one, unless there is a definite reason that plugins must be able to read from the controlling TTY (they will still be able to read from a non-TTY stdin like a file).

cli-plugins/socket/socket.go Outdated Show resolved Hide resolved
@laurazard
Copy link
Contributor Author

That's a valid thought (and one I had while discussing this). The impetus to make this change, as is, now, was:

  1. The current state is not great (BUILDX_EXPERIMENTAL is completely broken, and users will/have noticed it)
  2. Simply reverting the pgid commit and shipping is a worse solution as it will cause docker compose up to break, as it will receive double SIGINTs on every CTRL+C so when the user tries to teardown we will always force-kill and not let the containers exit, which imo is even worse than having Buildx broken when using the experimental env var
  3. We want to get a change out quickly so we can make another RC

The other way we could explore this would be to build functionality into the CLI plugins API so that plugins can use that instead of reading from TTY – however, that's not trivial, should be well-explored, and not something we rush out to get a fix in. Since that's not an option for right now, and reverting the pgid changes without another patch (such as this one) leads to an even worse situation for right now, I think the current approach (or something similar) is the only solution for right now – even if we then take some time and figure out that functionality later so we can ultimatively move over for that.

Moreover, re: implementing an API for this in the CLI, we do need to think that all of these new features will be gated behind an "if running through the CLI" (for Buildx at least) and the plugin will still have to read directly from the TTY when running standalone, so those new features will have to be thoroughly thought out/not something we can do out of the gate.

@laurazard laurazard force-pushed the alternative-fix-no-socket-notif branch from bec75eb to 508346e Compare January 15, 2024 15:49
@laurazard
Copy link
Contributor Author

laurazard commented Jan 15, 2024

@neersighted we should decide on this to unblock v25 (@thaJeztah's probably waiting on it to make another RC 😅) – we either leave it as-is and own that we'll break buildx, or merge this change now and work on getting the pgid changes in later (together with an API for plugins to read from TTY).

@neersighted
Copy link
Member

To me I think it depends on how critical reading from the TTY is to the buildx folks; I'm okay with a fast revert if we think it's necessary to buildx (for now). However, if this is not a critical behavior and buildx could be changed, I'd suggest that it's better to adopt the pgid change now to prevent future issues of this nature (and I'd suggest that the regression with older buildx is fine given it's under _EXPERIMENTAL).

@thaJeztah
Copy link
Member

I think we all agree that this is not ideal, but the options we currently have limit our choices;

  1. keep the pgid patch; BUILDX_EXPERIMENTAL=1 is completely broken. While it's "experimental" for sure, it's also something that looks to have been around since buildx v0.11 (added in monitor: Enable to run build and invoke in background buildx#1296), and breaking that feature means that it'll be broken in docker (but still works with other tools, such as nerdctl). Of course ideally, we'd fix this in buildx, but I don't know what the timeline will be to have such a fix in a released version of buildx.
  2. revert the pgid patch; docker compose will be very much broken (terminating immediately on CTRL-C), which is pretty bad as well. We could possibly apply some fixes to compose, but that would have to be looked into.
  3. use the conditional code from this PR; it's not ideal, and we should continue looking at fixing buildx so that we can re-apply the pgid improvement, but AFIU, at least it would make both buildx and compose "work" (and signal handling improved compared to "pre v25").

Our of options above, I'm leaning towards 3

(LOL, and after writing... my list of options is ... well.. basically fully the same as the list @laurazard posted in #4792 (comment) above)

@docker/runtime-team any thoughts?

@krissetto
Copy link
Contributor

LGTM. Considering the options available, I'd also lean towards n.3, using the conditional code from this PR.

Shall we also open an issue to track this "suboptimal" solution so we don't forget about looking into it for too long? Or do we track these things somewhere else?

@neersighted
Copy link
Member

neersighted commented Jan 16, 2024

I'm really not sure how nerdctl factors in, as it has a completely separate implementation that wraps buildctl: https://github.com/containerd/nerdctl/blob/main/pkg/buildkitutil/buildkitutil.go

In any case, it seems like option 3 is the consensus (though I'd still really like to hear from the buildx folks WRT why the I/O is needed; cc @crazy-max); while with the compromise here my first thought is we can remove the 3-signal-to-exit feature from the host CLI (as it's no longer doing much good), if we plan to resolve the signaling/IO issues in the near future such that the functionality can primarily live in the host CLI, it likely makes sense to take this as is.

@crazy-max
Copy link
Member

@neersighted BUILDX_EXPERIMENTAL runs controller build instead of the basic one: https://github.com/docker/buildx/blob/78adfc80a90ede124ae30457dc0f3c2f688f5121/commands/build.go#L293

There is some pipe handling when detach mode is enabled to handle debugging but in the case of this issue it will use the local one: "connecting to local controller".

This might be related to https://github.com/docker/buildx/blob/78adfc80a90ede124ae30457dc0f3c2f688f5121/controller/processes/processes.go#L101

I will take a look. @ktock @jedevc if you have an idea in any case 🙏

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an okay band-aid to me

@thaJeztah
Copy link
Member

Let me post a quick update on discussions we had;

  • Yes, this PR isn't ideal
  • It looks like work is done in buildx that may help with this issue (runControllerBuild: Read from stdin only when needed buildx#2196)
  • But that work has not been reviewed / tested / shipped yet, and even when released, users may not be using "latest" buildx immediately
  • We don't consider the current patch to be the "final" solution, but it works "for now", and we can ship an improved version in a (patch) release, as it's not immediately user-facing (just improved implementation)

I'll merge this PR to unblock the v25.0.0 release, but we'll look at follow-ups 👍

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah thaJeztah merged commit 350d6bc into docker:master Jan 16, 2024
77 checks passed
laurazard added a commit to laurazard/compose that referenced this pull request Jan 17, 2024
The changes in
docker@dcbf005
fixed the "cancellable context" detection, and made it so that Compose
would conditionally set up signal handling when the context was already
not cancellable/when the plugin was running through the CLI, as we'd
introduced a mechanism into the CLI to signal plugins to exit through a
socket instead of handling signals themselves.

This had some (not noticed at the time) issues when running through the
CLI as, due to sharing a process group id with the parent CLI process,
when a user CTRL-Cs the CLI will notify the plugin via the socket but
the plugin process itself will also be signalled if attached to the TTY.
This impacted some Compose commands that don't set up signal handling -
so not `compose up`, but other commands would immediately quit instead
of getting some "graceful" cancelled output.

We initially attempted to address this "double notification" issue in
the CLI by executing plugins under a new pgid so that they wouldn't be
signalled, but that posed an issue with Buildx reading from the TTY,
(see: moby/moby#47073) so we reverted the
process group id changes and ended at a temporary solution in
docker/cli#4792 where the CLI will only notify
plugins via the socket when they are not already going to be signalled
(when attached to a TTY).

Due to this, plugins should always set up some signal handling, which
this commit implements.

Signed-off-by: Laura Brehm <[email protected]>
laurazard added a commit to laurazard/compose that referenced this pull request Jan 17, 2024
The changes in
docker@dcbf005
fixed the "cancellable context" detection, and made it so that Compose
would conditionally set up signal handling when the context was already
not cancellable/when the plugin was running through the CLI, as we'd
introduced a mechanism into the CLI to signal plugins to exit through a
socket instead of handling signals themselves.

This had some (not noticed at the time) issues when running through the
CLI as, due to sharing a process group id with the parent CLI process,
when a user CTRL-Cs the CLI will notify the plugin via the socket but
the plugin process itself will also be signalled if attached to the TTY.
This impacted some Compose commands that don't set up signal handling -
so not `compose up`, but other commands would immediately quit instead
of getting some "graceful" cancelled output.

We initially attempted to address this "double notification" issue in
the CLI by executing plugins under a new pgid so that they wouldn't be
signalled, but that posed an issue with Buildx reading from the TTY,
(see: moby/moby#47073) so we reverted the
process group id changes and ended at a temporary solution in
docker/cli#4792 where the CLI will only notify
plugins via the socket when they are not already going to be signalled
(when attached to a TTY).

Due to this, plugins should always set up some signal handling, which
this commit implements.

Signed-off-by: Laura Brehm <[email protected]>
@tonistiigi
Copy link
Member

revert the pgid patch; docker compose will be very much broken (terminating immediately on CTRL-C), which is pretty bad as well. We could possibly apply some fixes to compose, but that would have to be looked into.

@thaJeztah What is this issue and how is that a regression? I assume if compose "terminates immediately", then it can only be if it is lacking support for handling signals.

@@ -240,16 +241,17 @@ func tryPluginRun(dockerCli command.Cli, cmd *cobra.Command, subcommand string,
// we send a SIGKILL to the plugin process and exit
go func() {
retries := 0
for s := range signals {
for range signals {
if dockerCli.Out().IsTerminal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if dockerCli.Out() is ideal here, eg. for a command like build -o type=tar . > t.tar . Although afaics, the actual behavior change from this line is minimal.

@laurazard
Copy link
Contributor Author

What is this issue and how is that a regression? I assume if compose "terminates immediately", then it can only be if it is lacking support for handling signals.

No, that wasn't the issue. The issue is well-described in the linked commit in the PR description – ef5e5fa

For convenience:

Changes were made in 1554ac3 to provide
a mechanism for the CLI to notify running plugin processes that they
should exit, in order to improve the general CLI/plugin UX. The current
implementation boils down to:

  1. The CLI creates a socket
  2. The CLI executes the plugin
  3. The plugin connects to the socket
  4. (When) the CLI receives a termination signal, it uses the socket to
    notify the plugin that it should exit
  5. The plugin's gets notified via the socket, and cancels it's cmd.Context,
    which then gets handled appropriately

This change works in most cases and fixes the issue it sets out to solve
(see: docker/compose#11292) however, in the case
where the user has a TTY attached and the plugin is not already handling
received signals, steps 4+ changes:
4. (When) the CLI receives a termination signal, before it can use the
socket to notify the plugin that it should exit, the plugin process
also receives a signal due to sharing the pgid with the CLI

Since Compose up works like: 1st CTRL-C - start "graceful" teardown / 2nd CTRL-C - forceful teardown/exit, the double notification (via the plugin socket, and the process itself getting signalled) resulted in every time the user hits CTRL-C, Compose receives two "cancel signals" and skips the graceful teardown directly to the forceful exit.

That same "double notification" issue was what the process group id changes tried to address (fix the double notification by running the plugin process under a new process group id, so it doesn't also get signalled) and this PR solves it with a different approach (by not notifying via the socket when the plugin is attached to a TTY and hence already going to receive a signal).

@tonistiigi
Copy link
Member

By "double notification" issue, do you mean this line https://github.com/docker/cli/pull/4792/files#diff-3dcd6325b70a5f5db393a3ecc08060b69007fcef371fb4dabf3164da95646fb6L252 that has now been reverted by this PR and does not exist in v24. There is no "double notification" issue in v24.

I think what makes sense is to leave everything as was in v24 (plugins are in same processgroup, cli does not duplicate its signals). Additionally, this unix socket can be used to send custom termination request. Handling that request is optional and it doesn't cause any conflict for the plugin as when a plugin decides to support the socket, it can make sure that if it does its own signal counting logic, the new socket mechanism does not cause it to go out of sync. I believe this is mostly the current state after this PR, ignoring some isTerminal checks. From the previous comment, it seemed that there was some issue between this PR and compose, but if not then all good from my side.

@laurazard
Copy link
Contributor Author

Ah, no. The previous comments were discussing whether to "simply revert the pgid patch" (which would have reintroduced that issue with Compose) or to go with this change, which reverts that but also adds the isTerminal logic.

temenuzhka-thede pushed a commit to temenuzhka-thede/compose that referenced this pull request Sep 17, 2024
The changes in
docker@dcbf005
fixed the "cancellable context" detection, and made it so that Compose
would conditionally set up signal handling when the context was already
not cancellable/when the plugin was running through the CLI, as we'd
introduced a mechanism into the CLI to signal plugins to exit through a
socket instead of handling signals themselves.

This had some (not noticed at the time) issues when running through the
CLI as, due to sharing a process group id with the parent CLI process,
when a user CTRL-Cs the CLI will notify the plugin via the socket but
the plugin process itself will also be signalled if attached to the TTY.
This impacted some Compose commands that don't set up signal handling -
so not `compose up`, but other commands would immediately quit instead
of getting some "graceful" cancelled output.

We initially attempted to address this "double notification" issue in
the CLI by executing plugins under a new pgid so that they wouldn't be
signalled, but that posed an issue with Buildx reading from the TTY,
(see: moby/moby#47073) so we reverted the
process group id changes and ended at a temporary solution in
docker/cli#4792 where the CLI will only notify
plugins via the socket when they are not already going to be signalled
(when attached to a TTY).

Due to this, plugins should always set up some signal handling, which
this commit implements.

Signed-off-by: Laura Brehm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants