Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clisqlshell: handle interactive query cancellation #76437

Merged
merged 1 commit into from
Feb 12, 2022

Conversation

knz
Copy link
Contributor

@knz knz commented Feb 11, 2022

Fixes #76433.

As of this PR, there's a bug in lib/pq which forces the session to
terminate when any query gets cancelled. We find this unacceptable
and we plan to fix it later.

Release note (cli change): The interactive SQL shell (cockroach sql,
cockroach demo) now supports interrupting a currently running
query with Ctrl+C, without losing access to the shell.

@knz knz requested review from rafiss and otan February 11, 2022 16:14
@knz knz requested review from a team as code owners February 11, 2022 16:14
@knz knz requested review from a team and adityamaru and removed request for a team February 11, 2022 16:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

@rafiss could you download this branch and try it out?

I'm seeing the following annoying behavior: the cancel is processed by the server (the client receives the cancel error, as expected), but the server also immediately closes the connection. I think that's either a bug in the handling of the cancel request server-side, or lib/pq getting confused by its own context cancel support? But I'm not sure. any ideas how to troubleshoot this?

@knz knz force-pushed the 20220211-sql-cancel-shell branch from 8951c0d to 9f8582f Compare February 11, 2022 16:29
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you download this branch and try it out?

trying now

the cancel is processed by the server (the client receives the cancel error, as expected), but the server also immediately closes the connection.

do you mean the server immediately closes the original SQL connection, or immediately closes the new connection made to send the cancel request?

I think that's either a bug in the handling of the cancel request server-side, or lib/pq getting confused by its own context cancel support?

I did hit some annoyance with how lib/pq supports cancels when testing. Specifically, this block: https://github.com/lib/pq/blob/8446d16b8935fdf2b5c0fe333538ac395e3e1e4b/conn_go18.go#L177-L181

If I understand correctly, that means lib/pq requires the server to close the connection used to send the cancel request

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @otan)

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the server immediately closes the original SQL connection, or immediately closes the new connection made to send the cancel request?

We do not have control over the new connection, that's deep inside lib/pq, and that is essentially invisible here.
It's the original SQL connection that gets closed: after lib/pq returns from running the canceled query, the sql conn object is not available any more.

If I understand correctly, that means lib/pq requires the server to close the connection used to send the cancel request

Yes, that's what psql does too. We're not concerned about this.

We're focusin on the original SQL conn, and this should remain open. Unfortunately, it does not. I don't know why.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @otan)

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I captured the traffic using Wireshark. > denotes client-to-server traffic and < denotes server-to-client traffic.

    1   0.000000    127.0.0.1 → 127.0.0.1    PGSQL 81 >Q       # client sends pg_sleep(5)
    2   1.116952    127.0.0.1 → 127.0.0.1    PGSQL 72 >        # client sends cancel request
    3   1.117599    127.0.0.1 → 127.0.0.1    PGSQL 165 <T/E    # server sends "RowDescription" followed by "query execution canceled" ErrorResponse
    4   1.117613    127.0.0.1 → 127.0.0.1    PGSQL 62 <Z       # server sends ReadyForQuery message
    5   1.117684    127.0.0.1 → 127.0.0.1    PGSQL 61 >X       # client sends Termination message
    6   1.118599    127.0.0.1 → 127.0.0.1    PGSQL 169 >       # client sends StartupMessage to begin a new session

So either the CLI or lib/pq itself is sending a Termination message. ... After a bit of searching, I think it probably is lib/pq https://github.com/lib/pq/blob/8446d16b8935fdf2b5c0fe333538ac395e3e1e4b/conn_go18.go#L134

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @otan)

@rafiss
Copy link
Collaborator

rafiss commented Feb 11, 2022

That change was made as part of lib/pq#1000

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

That change was made as part of lib/pq#1000

Ouch, that's bad right? We really want the cancellation to leave the session open, with all its session state. So this means that lib/pq is hosed for us? No chance but to switch to pgx then?

@rafiss
Copy link
Collaborator

rafiss commented Feb 11, 2022

Yeah I don't agree with that lib/pq behavior... We could try patching it, but I guess it would be a pretty big compat change for lib/pq users who rely on that.

Using pgx should work as we want. Also relevant, it look like pgx does not send a pgwire CancelRequest when the context is done, so we'd have to do that ourselves if we want to start relying on context deadlines.

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

it look like pgx does not send a pgwire CancelRequest when the context is done

I think it does but it's very weird how it does it:

  • there's a "context watcher" that observes context cancellation
  • when a context cancel is detected, the onCancel callback on that watcher is called
  • pgconn.go set this to a callback that sets a deadline very far in the past
  • also, all the exec functions have a defer asyncClose() which does send a pg cancel query message.

So what I think happens is that if you cancel the context, the code sets a timeout far in the past, expects the go runtime to see that the I/O read (waiting for the query result) will never succeed, fail with a context deadline, then the deferred cancellation runs in the main goroutine.

This is a clever way to handle context cancellation without spawning a new goroutine every time, and while avoiding the ordering problem that motivated the (misguided) change in lib/pq.

So yeah, we could perhaps use pgx, even though it doesn't fill me with joy (it's a larger refactor!)
I wonder if we should not fork lib/pq in the interim.

@rafiss
Copy link
Collaborator

rafiss commented Feb 11, 2022

Do you think a patch to lib/pq to expose func (cn *conn) cancel publicly would be valuable? It wouldn't solve our problem for context cancellations, but it would make this Ctrl+C feature work.

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

Do you think a patch to lib/pq to expose func (cn *conn) cancel publicly would be valuable?

yes, even for the broader community of folk using lib/pq.

It wouldn't solve our problem for context cancellations, but it would make this Ctrl+C feature work.

Well it'd make it work by changing the code to have us peek inside the lib/pq API, instead of using a context cancel. That somewhat breaks the goal of the clisqlclient.Conn abstraction, to make that code independent of the SQL driver. Or maybe it's inside the clisqlclient.Conn.Exec/Query methods that we should watch the context cancel, and call pq.Cancel?

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

What about this:

  • we accept that in this PR, the session also gets trashed. It's temporarily somewhat OK, because our shell auto-reconnects as needed.
  • I add the tests that check that queries get canceled (since that works already)
  • we do some experiments separately, to compare the lib/pq approach vs a refactor to use pgx (Which we had wanted to do anyway), so see which one gets us to our goal. Then we make a separate PR for this.

WDYT?

@rafiss
Copy link
Collaborator

rafiss commented Feb 11, 2022

Your plan sounds good to me! Maybe we could even make the shell expect the current lib/pq behavior, so it doesn't show the warning: connection lost! message to the user, and instead just auto-reconnects.

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

Maybe we could even make the shell expect the current lib/pq behavior, so it doesn't show the warning: connection lost! message to the user, and instead just auto-reconnects.

Nah that's a dark pattern: the state of the session is lost, including any session var customizations. We don't want to hide that.

Also we want to fix it anyway, so why add some polish for a behavior we're going to not have in the next PR...

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

Now with tests. RFAL

@knz knz force-pushed the 20220211-sql-cancel-shell branch 2 times, most recently from 5ca0e1a to 19a4e65 Compare February 11, 2022 20:41
As of this PR, there's a bug in lib/pq which forces the session to
terminate when any query gets cancelled. We find this unacceptable
and we plan to fix it later.

Release note (cli change): The interactive SQL shell (`cockroach sql`,
`cockroach demo`) now supports interrupting a currently running
query with Ctrl+C, without losing access to the shell.
@knz knz force-pushed the 20220211-sql-cancel-shell branch from 19a4e65 to afc0786 Compare February 11, 2022 20:44
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @knz, and @otan)


pkg/cli/clisqlshell/sql.go, line 2055 at r7 (raw file):

	// The cancellation function can be used by the Ctrl+C handler
	// to cancel this query.
	ctx, cancel := context.WithCancel(context.Background())

[coming more from a lack of understanding, rather than giving feedback]: why do we need a new context each time we run a query here? if we were to use the same cancellable context to run all the queries, then wouldn't we be able to avoid having to set/unset cancelFn all the time?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @otan, and @rafiss)


pkg/cli/clisqlshell/sql.go, line 2055 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

[coming more from a lack of understanding, rather than giving feedback]: why do we need a new context each time we run a query here? if we were to use the same cancellable context to run all the queries, then wouldn't we be able to avoid having to set/unset cancelFn all the time?

a context can only be cancelled once. If we reused the same, after the 1st query cancelled we wouldn't be able to cancel any more query after that (in fact, no further query would ever run since the context would be cancelled already)

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @knz, and @otan)


pkg/cli/clisqlshell/sql.go, line 2055 at r7 (raw file):

Previously, knz (kena) wrote…

a context can only be cancelled once. If we reused the same, after the 1st query cancelled we wouldn't be able to cancel any more query after that (in fact, no further query would ever run since the context would be cancelled already)

right, makes sense!

@knz
Copy link
Contributor Author

knz commented Feb 11, 2022

TFYR!

bors r=rafiss

@craig
Copy link
Contributor

craig bot commented Feb 12, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli/sql: support query cancellation in the SQL shell
3 participants