Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flowinfra: Do not inhibit server shutdown #82765

Closed
miretskiy opened this issue Jun 10, 2022 · 9 comments · Fixed by #82752
Closed

flowinfra: Do not inhibit server shutdown #82765

miretskiy opened this issue Jun 10, 2022 · 9 comments · Fixed by #82752
Assignees
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Jun 10, 2022

A long running distSQL flow may inhibit server shutdown.

It appears that any bulk-io or changefeed processors running on the node at the time it
is being drained, may inhibit server shutdown.

Changefeed/builk-y processors are interesting in that they usually have 1 coordinator, and N processors that send updates
to the coordinator (changeFrontier and changeAggregator processors in changefeed case).
Furthermore, those flows are not usually tied to the "user" connection -- they are started by the jobs system.

Those leaf processors have no incoming inputs, and thus are not registered with the flow registry.
When the node is being drained, the registry is drained. However, only flows that have incoming inputs are
ever registered with the flow registry, and only those get cancelled when the stopper quiesces.
https://github.com/cockroachdb/cockroach/blob/master/pkg/server/drain.go#L361-L361
https://github.com/cockroachdb/cockroach/blob/master/pkg/server/drain.go#L361-L361
https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/flowinfra/flow_registry.go#L466-L466

As a result, when the node holding leaf processor (changeAggregator) is being shutdown, the processor does not
get any cancellation signal, and therefore, if the processor also uses stopper, as is the case with changefeedProcessor
https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/changefeed_processors.go#L332, then
the stopper will wait for the "task" to finish (https://github.com/cockroachdb/cockroach/blob/master/pkg/util/stop/stopper.go#L615-L615), but this will never happen.

As a workaround, changefeed will arrange for its context to be cancelled when server quiesces; however, distflow infra should ensure that flows are cancelled in general and not rely on processor implementation to arrange for that.

Jira issue: CRDB-16647

@miretskiy miretskiy added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-execution Relating to SQL execution. T-sql-execution labels Jun 10, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 10, 2022
Full fix tracked in cockroachdb#82765

This test demonstrates that running changfeeds inhibit node shutdown.
The issue is on the dist flow side.  This is a temporary fix.
The test itself takes more than 1 minute to run.  This is the reason
why it's being pushed as a proof of concept PR.
However, it does demonstrate the issue; and that the context
cancellation upon node shutdown added in change aggregator fixes the
issue.

Once we determine the cause of this test taking so much time, then
it can also be added.

Release Notes: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 10, 2022
Informs cockroachdb#82765

Permanet fix is being tracked by the above issue.
This is a temporary fix to ensure that change aggregators
cancel their context when the server quiesces so that the
server shutdown is not inhibited by the running changefeeds.

The test for this functionality is not being merged due
to the fact that it takes too long to run; however, the test
can be seen here: cockroachdb#82767

Release Notes (enterprise change): Ensure running changefeeds do
not inhibit node shutdown.
@miretskiy miretskiy changed the title distflow: Do not inhibit server shutdown flowinfra: Do not inhibit server shutdown Jun 10, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 13, 2022
Informs cockroachdb#82765

Permanet fix is being tracked by the above issue.
This is a temporary fix to ensure that change aggregators
cancel their context when the server quiesces so that the
server shutdown is not inhibited by the running changefeeds.

The test for this functionality is not being merged due
to the fact that it takes too long to run; however, the test
can be seen here: cockroachdb#82767

Release Notes (bug fix): Ensure running changefeeds do
not inhibit node shutdown.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 13, 2022
Informs cockroachdb#82765

Permanet fix is being tracked by the above issue.
This is a temporary fix to ensure that change aggregators
cancel their context when the server quiesces so that the
server shutdown is not inhibited by the running changefeeds.

The test for this functionality is not being merged due
to the fact that it takes too long to run; however, the test
can be seen here: cockroachdb#82767

Release Notes (bug fix): Ensure running changefeeds do
not inhibit node shutdown.
craig bot pushed a commit that referenced this issue Jun 14, 2022
82489: sql: disallow adding column as primary key r=jasonmchan a=jasonmchan

Previously, the behavior of ALTER TABLE ... ADD COLUMN ... PRIMARY KEY
was undefined. With the legacy schema changer, this statement would
appear to succeed, but it would break the new schema changer. This
commit changes this by returning an explicit error.

In the future, we should support this statement if the table's primary
key was originally the default rowid primary key (see #82735).

Fixes #82489

Release note: None

82768: changefeedccl: Do not inhibit node shutdown r=miretskiy a=miretskiy

See individual commits for details:
  * Ensure running changefeed do not inhibit node shutdown (Informs #82765)
  * Treat node unavailable error as retryable.

Test not being merged as part of this PR -- see #82767

Release Notes (bug fix): Ensure running changefeeds do
not inhibit node shutdown.
Release Notes (bug fix): Treat node unavailable error as a retryable changefeed error.

Co-authored-by: Jason Chan <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jun 14, 2022
Informs cockroachdb#82765

Permanet fix is being tracked by the above issue.
This is a temporary fix to ensure that change aggregators
cancel their context when the server quiesces so that the
server shutdown is not inhibited by the running changefeeds.

The test for this functionality is not being merged due
to the fact that it takes too long to run; however, the test
can be seen here: cockroachdb#82767

Release Notes (bug fix): Ensure running changefeeds do
not inhibit node shutdown.
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jun 22, 2022
miretskiy pushed a commit that referenced this issue Jul 5, 2022
Informs #82765

Permanet fix is being tracked by the above issue.
This is a temporary fix to ensure that change aggregators
cancel their context when the server quiesces so that the
server shutdown is not inhibited by the running changefeeds.

The test for this functionality is not being merged due
to the fact that it takes too long to run; however, the test
can be seen here: #82767

Release Notes (bug fix): Ensure running changefeeds do
not inhibit node shutdown.
@rytaft
Copy link
Collaborator

rytaft commented Jul 5, 2022

cc @vy-ton to discuss how this affects server shutdown process

@rytaft
Copy link
Collaborator

rytaft commented Jul 5, 2022

We think that we should cancel all queries -- there is an open question about whether we should increase the grace period from 10s to, say, 60s?

@vy-ton
Copy link
Contributor

vy-ton commented Jul 6, 2022

This was identified as a longstanding bug where long-running queries/jobs can prevent server shutdown because they are not cancelled after the query_wait grace period. A fix for changefeeds was merged after a user experienced the issue.

Queries would like to cancel all queries but is concerned about the backwards compatibility to server shutdown. Theoretically, it's better to see some query errors than to prevent server shutdown forever. Queries was thinking of only making this change for 22.2+. @mwang1026 curious about your Server opinion?

@mwang1026
Copy link

what types of queries were forever running? seems like an anti-pattern / not intended?

@vy-ton
Copy link
Contributor

vy-ton commented Jul 6, 2022

It appears that any bulk-io or changefeed processors running on the node at the time it
is being drained, may inhibit server shutdown.

Not just user queries but long-running jobs. So we would cancel an IMPORT as an example

@yuzefovich
Copy link
Member

Do I have a product approval for proceeding with cancellation of all long-running queries on a node when the node is being drained? @vy-ton @mwang1026

@mwang1026
Copy link

I don't know about the specifics and defer to @stevendanna and @miretskiy on the behavior here.

Specifically, what does "cancellation" mean here? And how does that impact the various "long running queries" such as changefeeds and import and backup and restore?

cc @livlobo @amruss

@yuzefovich
Copy link
Member

Specifically, what does "cancellation" mean here?

The distributed plan will result in an error propagated to the "client". I think in case of changefeeds, the "client" is one of the CRDB nodes and is smart enough to recognize such errors, and the new distributed plan will be setup (without using the node that is being drained) transparently to the user, not sure about BulkIO things.

@yuzefovich
Copy link
Member

We will proceed with this cancellation on 22.2 but introduce a cluster setting as an escape hatch to revert to the previous behavior should we need it.

@craig craig bot closed this as completed in 2087103 Aug 5, 2022
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants