flowinfra: Do not inhibit server shutdown #82765

miretskiy · 2022-06-10T23:10:48Z

A long running distSQL flow may inhibit server shutdown.

It appears that any bulk-io or changefeed processors running on the node at the time it
is being drained, may inhibit server shutdown.

Changefeed/builk-y processors are interesting in that they usually have 1 coordinator, and N processors that send updates
to the coordinator (changeFrontier and changeAggregator processors in changefeed case).
Furthermore, those flows are not usually tied to the "user" connection -- they are started by the jobs system.

Those leaf processors have no incoming inputs, and thus are not registered with the flow registry.
When the node is being drained, the registry is drained. However, only flows that have incoming inputs are
ever registered with the flow registry, and only those get cancelled when the stopper quiesces.
https://github.com/cockroachdb/cockroach/blob/master/pkg/server/drain.go#L361-L361
https://github.com/cockroachdb/cockroach/blob/master/pkg/server/drain.go#L361-L361
https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/flowinfra/flow_registry.go#L466-L466

As a result, when the node holding leaf processor (changeAggregator) is being shutdown, the processor does not
get any cancellation signal, and therefore, if the processor also uses stopper, as is the case with changefeedProcessor
https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/changefeed_processors.go#L332, then
the stopper will wait for the "task" to finish (https://github.com/cockroachdb/cockroach/blob/master/pkg/util/stop/stopper.go#L615-L615), but this will never happen.

As a workaround, changefeed will arrange for its context to be cancelled when server quiesces; however, distflow infra should ensure that flows are cancelled in general and not rely on processor implementation to arrange for that.

Jira issue: CRDB-16647

Full fix tracked in cockroachdb#82765 This test demonstrates that running changfeeds inhibit node shutdown. The issue is on the dist flow side. This is a temporary fix. The test itself takes more than 1 minute to run. This is the reason why it's being pushed as a proof of concept PR. However, it does demonstrate the issue; and that the context cancellation upon node shutdown added in change aggregator fixes the issue. Once we determine the cause of this test taking so much time, then it can also be added. Release Notes: None

Informs cockroachdb#82765 Permanet fix is being tracked by the above issue. This is a temporary fix to ensure that change aggregators cancel their context when the server quiesces so that the server shutdown is not inhibited by the running changefeeds. The test for this functionality is not being merged due to the fact that it takes too long to run; however, the test can be seen here: cockroachdb#82767 Release Notes (enterprise change): Ensure running changefeeds do not inhibit node shutdown.

Informs cockroachdb#82765 Permanet fix is being tracked by the above issue. This is a temporary fix to ensure that change aggregators cancel their context when the server quiesces so that the server shutdown is not inhibited by the running changefeeds. The test for this functionality is not being merged due to the fact that it takes too long to run; however, the test can be seen here: cockroachdb#82767 Release Notes (bug fix): Ensure running changefeeds do not inhibit node shutdown.

82489: sql: disallow adding column as primary key r=jasonmchan a=jasonmchan Previously, the behavior of ALTER TABLE ... ADD COLUMN ... PRIMARY KEY was undefined. With the legacy schema changer, this statement would appear to succeed, but it would break the new schema changer. This commit changes this by returning an explicit error. In the future, we should support this statement if the table's primary key was originally the default rowid primary key (see #82735). Fixes #82489 Release note: None 82768: changefeedccl: Do not inhibit node shutdown r=miretskiy a=miretskiy See individual commits for details: * Ensure running changefeed do not inhibit node shutdown (Informs #82765) * Treat node unavailable error as retryable. Test not being merged as part of this PR -- see #82767 Release Notes (bug fix): Ensure running changefeeds do not inhibit node shutdown. Release Notes (bug fix): Treat node unavailable error as a retryable changefeed error. Co-authored-by: Jason Chan <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Informs cockroachdb#82765 Permanet fix is being tracked by the above issue. This is a temporary fix to ensure that change aggregators cancel their context when the server quiesces so that the server shutdown is not inhibited by the running changefeeds. The test for this functionality is not being merged due to the fact that it takes too long to run; however, the test can be seen here: cockroachdb#82767 Release Notes (bug fix): Ensure running changefeeds do not inhibit node shutdown.

Informs #82765 Permanet fix is being tracked by the above issue. This is a temporary fix to ensure that change aggregators cancel their context when the server quiesces so that the server shutdown is not inhibited by the running changefeeds. The test for this functionality is not being merged due to the fact that it takes too long to run; however, the test can be seen here: #82767 Release Notes (bug fix): Ensure running changefeeds do not inhibit node shutdown.

rytaft · 2022-07-05T18:33:47Z

cc @vy-ton to discuss how this affects server shutdown process

rytaft · 2022-07-05T18:37:48Z

We think that we should cancel all queries -- there is an open question about whether we should increase the grace period from 10s to, say, 60s?

vy-ton · 2022-07-06T17:15:02Z

This was identified as a longstanding bug where long-running queries/jobs can prevent server shutdown because they are not cancelled after the query_wait grace period. A fix for changefeeds was merged after a user experienced the issue.

Queries would like to cancel all queries but is concerned about the backwards compatibility to server shutdown. Theoretically, it's better to see some query errors than to prevent server shutdown forever. Queries was thinking of only making this change for 22.2+. @mwang1026 curious about your Server opinion?

mwang1026 · 2022-07-06T17:31:18Z

what types of queries were forever running? seems like an anti-pattern / not intended?

vy-ton · 2022-07-06T17:34:42Z

It appears that any bulk-io or changefeed processors running on the node at the time it
is being drained, may inhibit server shutdown.

Not just user queries but long-running jobs. So we would cancel an IMPORT as an example

yuzefovich · 2022-07-26T17:09:04Z

Do I have a product approval for proceeding with cancellation of all long-running queries on a node when the node is being drained? @vy-ton @mwang1026

mwang1026 · 2022-08-01T16:55:58Z

I don't know about the specifics and defer to @stevendanna and @miretskiy on the behavior here.

Specifically, what does "cancellation" mean here? And how does that impact the various "long running queries" such as changefeeds and import and backup and restore?

cc @livlobo @amruss

yuzefovich · 2022-08-01T17:08:40Z

Specifically, what does "cancellation" mean here?

The distributed plan will result in an error propagated to the "client". I think in case of changefeeds, the "client" is one of the CRDB nodes and is smart enough to recognize such errors, and the new distributed plan will be setup (without using the node that is being drained) transparently to the user, not sure about BulkIO things.

yuzefovich · 2022-08-04T20:55:05Z

We will proceed with this cancellation on 22.2 but introduce a cluster setting as an escape hatch to revert to the previous behavior should we need it.

miretskiy added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-execution Relating to SQL execution. T-sql-execution labels Jun 10, 2022

This was referenced Jun 10, 2022

changefeedccl: Test demonstrating changefeed inhibiting node shutdown. #82767

Closed

changefeedccl: Do not inhibit node shutdown #82768

Merged

miretskiy mentioned this issue Jun 10, 2022

flowinfra: cancel remote flows when node is drained #82752

Merged

miretskiy changed the title ~~distflow: Do not inhibit server shutdown~~ flowinfra: Do not inhibit server shutdown Jun 10, 2022

miretskiy assigned yuzefovich Jun 10, 2022

miretskiy mentioned this issue Jun 14, 2022

release-22.1: changefeedccl: Do not inhibit node shutdown #82874

Merged

miretskiy mentioned this issue Jun 14, 2022

release-21.2: changefeedccl: Do not inhibit node shutdown #82876

Merged

blathers-crl bot added the T-sql-queries SQL Queries Team label Jun 22, 2022

craig bot closed this as completed in 2087103 Aug 5, 2022

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flowinfra: Do not inhibit server shutdown #82765

flowinfra: Do not inhibit server shutdown #82765

miretskiy commented Jun 10, 2022 •

edited by cockroach-jira-scripts

Loading

rytaft commented Jul 5, 2022

rytaft commented Jul 5, 2022

vy-ton commented Jul 6, 2022

mwang1026 commented Jul 6, 2022

vy-ton commented Jul 6, 2022

yuzefovich commented Jul 26, 2022

mwang1026 commented Aug 1, 2022

yuzefovich commented Aug 1, 2022

yuzefovich commented Aug 4, 2022

flowinfra: Do not inhibit server shutdown #82765

flowinfra: Do not inhibit server shutdown #82765

Comments

miretskiy commented Jun 10, 2022 • edited by cockroach-jira-scripts Loading

rytaft commented Jul 5, 2022

rytaft commented Jul 5, 2022

vy-ton commented Jul 6, 2022

mwang1026 commented Jul 6, 2022

vy-ton commented Jul 6, 2022

yuzefovich commented Jul 26, 2022

mwang1026 commented Aug 1, 2022

yuzefovich commented Aug 1, 2022

yuzefovich commented Aug 4, 2022

miretskiy commented Jun 10, 2022 •

edited by cockroach-jira-scripts

Loading