*: check node decommissioned/draining state for DistSQL/consistency #66632

erikgrinaker · 2021-06-18T16:58:18Z

The DistSQL planner and consistency queue did not take the nodes'
decommissioned or draining states into account, which in particular
could cause spurious errors when interacting with decommissioned nodes.

This patch adds convenience methods for checking node availability and
draining states, and avoids scheduling DistSQL flows on
unavailable nodes and consistency checks on unavailable/draining nodes.

Touches #66586, touches #45123.

Release note (bug fix): Avoid interacting with decommissioned nodes
during DistSQL planning and consistency checking.

/cc @cockroachdb/kv

cockroach-teamcity · 2021-06-18T16:58:27Z

This change is

erikgrinaker · 2021-06-21T13:46:33Z

@cockroachdb/kv Any opinions on adding NodeLiveness.IsAvailable() that checks live && !decommissioned, vs. changing the behavior of NodeLiveness.IsLive()? I get the impression from the code and other discussions that we'd want liveness to be distinct from other states (e.g. decommissioned and draining), even though this means callers have to take care with this. However, this conflicts with some other methods such as GetIsLiveMap() which take a decommissioned or decommissioning node to be non-live (this seems buggy in itself, since decommissioning nodes may still hold leases and such).

tbg

Reviewed 9 of 9 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/liveness/liveness.go, line 655 at r1 (raw file):

// IsAvailableNotDraining returns whether or not the specified node is available
// to serve requests and is not draining/decommissioning. Note that draining

nit: draining/decommissioning/decommissioned.

pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):

	if !h.isAvailable(nodeID) {
		return pgerror.Newf(pgcode.CannotConnectNow,
			"not using n%d due to liveness: not available", errors.Safe(nodeID))

not using n%d since it is not available? The reference to liveness is perhaps best dropped. Your call.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

not using n%d since it is not available? The reference to liveness is perhaps best dropped. Your call.

you don't need errors.Safe here. NodeID is already safe.

erikgrinaker

One implication here is that ranges with a leaseholder on a decommissioning node won't get DistSQL processors scheduled locally until the leases have been moved elsewhere. This may negatively affect latency for these ranges until they've been moved. There is a tradeoff here between latency of small/fast DistSQL flows and stability of longer-running DistSQL flows. This change picks stability (since the motivation was rangefeed planning), but it may cause a performance cliff for smaller queries that now have to do table reads across the network.

@cockroachdb/sql-execution Would like to get your take on this.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/kv/kvserver/liveness/liveness.go, line 655 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: draining/decommissioning/decommissioned.

decommissioned is already implied by available (as defined by IsAvailable), but I spelled this out.

pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):

Previously, knz (kena) wrote…

you don't need errors.Safe here. NodeID is already safe.

Updated. Thanks @knz, this code was there from before, but I removed the Safe() calls.

erikgrinaker

Decided to be conservative here, and keep scheduling of DistSQL flows onto decommissioning/draining nodes to avoid a latency cliff. This makes a backport less risky. Added a TODO comment to consider changing this later.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

The DistSQL planner and consistency queue did not take the nodes' decommissioned or draining states into account, which in particular could cause spurious errors when interacting with decommissioned nodes. This patch adds convenience methods for checking node availability and draining states, and avoids scheduling DistSQL flows on unavailable nodes and consistency checks on unavailable/draining nodes. Release note (bug fix): Avoid interacting with decommissioned nodes during DistSQL planning and consistency checking.

erikgrinaker · 2021-06-25T14:38:57Z

bors r=tbg,knz

craig · 2021-06-25T16:18:25Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker self-assigned this Jun 18, 2021

erikgrinaker force-pushed the check-decommissioned branch from 72b77de to eb6d210 Compare June 21, 2021 13:39

erikgrinaker force-pushed the check-decommissioned branch from eb6d210 to b869775 Compare June 21, 2021 15:05

This was referenced Jun 21, 2021

kv: operations failing when encountering decommissioned nodes #66586

Closed

distsql: Avoid decomissioned nodes. #66671

Closed

erikgrinaker force-pushed the check-decommissioned branch 2 times, most recently from 0eb64b9 to 2f42de7 Compare June 24, 2021 07:51

erikgrinaker changed the title *: check node decommissioned state where appropriate *: check node decommissioned/draining state for DistSQL/consistency Jun 24, 2021

erikgrinaker marked this pull request as ready for review June 24, 2021 08:43

erikgrinaker requested a review from tbg June 24, 2021 08:44

erikgrinaker force-pushed the check-decommissioned branch from 2f42de7 to a578a7c Compare June 24, 2021 08:45

tbg approved these changes Jun 24, 2021

View reviewed changes

knz reviewed Jun 24, 2021

View reviewed changes

erikgrinaker force-pushed the check-decommissioned branch from a578a7c to 1dc810c Compare June 24, 2021 12:55

erikgrinaker commented Jun 24, 2021

View reviewed changes

erikgrinaker force-pushed the check-decommissioned branch from 1dc810c to f266aea Compare June 24, 2021 15:14

erikgrinaker requested a review from a team June 24, 2021 15:15

erikgrinaker force-pushed the check-decommissioned branch from f266aea to eaff814 Compare June 25, 2021 11:37

erikgrinaker commented Jun 25, 2021

View reviewed changes

erikgrinaker force-pushed the check-decommissioned branch from eaff814 to 78688ea Compare June 25, 2021 12:54

craig bot merged commit 1918976 into cockroachdb:master Jun 25, 2021

erikgrinaker added backport-21.1.x labels Jun 25, 2021

This was referenced Jun 28, 2021

release-21.1: *: check node decommissioned/draining state for DistSQL/consistency #66950

Merged

release-20.2: *: check node decommissioned/draining state for DistSQL/consistency #66951

Merged

erikgrinaker deleted the check-decommissioned branch June 28, 2021 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: check node decommissioned/draining state for DistSQL/consistency #66632

*: check node decommissioned/draining state for DistSQL/consistency #66632

erikgrinaker commented Jun 18, 2021 •

edited

Loading

cockroach-teamcity commented Jun 18, 2021

erikgrinaker commented Jun 21, 2021

tbg left a comment

knz left a comment

erikgrinaker left a comment

erikgrinaker left a comment

erikgrinaker commented Jun 25, 2021

craig bot commented Jun 25, 2021

*: check node decommissioned/draining state for DistSQL/consistency #66632

*: check node decommissioned/draining state for DistSQL/consistency #66632

Conversation

erikgrinaker commented Jun 18, 2021 • edited Loading

cockroach-teamcity commented Jun 18, 2021

erikgrinaker commented Jun 21, 2021

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 25, 2021

craig bot commented Jun 25, 2021

erikgrinaker commented Jun 18, 2021 •

edited

Loading