Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: check node decommissioned/draining state for DistSQL/consistency #66632

Merged
merged 1 commit into from
Jun 25, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 18, 2021

The DistSQL planner and consistency queue did not take the nodes'
decommissioned or draining states into account, which in particular
could cause spurious errors when interacting with decommissioned nodes.

This patch adds convenience methods for checking node availability and
draining states, and avoids scheduling DistSQL flows on
unavailable nodes and consistency checks on unavailable/draining nodes.

Touches #66586, touches #45123.

Release note (bug fix): Avoid interacting with decommissioned nodes
during DistSQL planning and consistency checking.

/cc @cockroachdb/kv

@erikgrinaker erikgrinaker self-assigned this Jun 18, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from 72b77de to eb6d210 Compare June 21, 2021 13:39
@erikgrinaker
Copy link
Contributor Author

@cockroachdb/kv Any opinions on adding NodeLiveness.IsAvailable() that checks live && !decommissioned, vs. changing the behavior of NodeLiveness.IsLive()? I get the impression from the code and other discussions that we'd want liveness to be distinct from other states (e.g. decommissioned and draining), even though this means callers have to take care with this. However, this conflicts with some other methods such as GetIsLiveMap() which take a decommissioned or decommissioning node to be non-live (this seems buggy in itself, since decommissioning nodes may still hold leases and such).

@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from eb6d210 to b869775 Compare June 21, 2021 15:05
@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch 2 times, most recently from 0eb64b9 to 2f42de7 Compare June 24, 2021 07:51
@erikgrinaker erikgrinaker changed the title *: check node decommissioned state where appropriate *: check node decommissioned/draining state for DistSQL/consistency Jun 24, 2021
@erikgrinaker erikgrinaker marked this pull request as ready for review June 24, 2021 08:43
@erikgrinaker erikgrinaker requested a review from tbg June 24, 2021 08:44
@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from 2f42de7 to a578a7c Compare June 24, 2021 08:45
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/liveness/liveness.go, line 655 at r1 (raw file):

// IsAvailableNotDraining returns whether or not the specified node is available
// to serve requests and is not draining/decommissioning. Note that draining

nit: draining/decommissioning/decommissioned.


pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):

	if !h.isAvailable(nodeID) {
		return pgerror.Newf(pgcode.CannotConnectNow,
			"not using n%d due to liveness: not available", errors.Safe(nodeID))

not using n%d since it is not available? The reference to liveness is perhaps best dropped. Your call.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

not using n%d since it is not available? The reference to liveness is perhaps best dropped. Your call.

you don't need errors.Safe here. NodeID is already safe.

@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from a578a7c to 1dc810c Compare June 24, 2021 12:55
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One implication here is that ranges with a leaseholder on a decommissioning node won't get DistSQL processors scheduled locally until the leases have been moved elsewhere. This may negatively affect latency for these ranges until they've been moved. There is a tradeoff here between latency of small/fast DistSQL flows and stability of longer-running DistSQL flows. This change picks stability (since the motivation was rangefeed planning), but it may cause a performance cliff for smaller queries that now have to do table reads across the network.

@cockroachdb/sql-execution Would like to get your take on this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/kv/kvserver/liveness/liveness.go, line 655 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: draining/decommissioning/decommissioned.

decommissioned is already implied by available (as defined by IsAvailable), but I spelled this out.


pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):

Previously, knz (kena) wrote…

you don't need errors.Safe here. NodeID is already safe.

Updated. Thanks @knz, this code was there from before, but I removed the Safe() calls.

@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from 1dc810c to f266aea Compare June 24, 2021 15:14
@erikgrinaker erikgrinaker requested a review from a team June 24, 2021 15:15
@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from f266aea to eaff814 Compare June 25, 2021 11:37
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to be conservative here, and keep scheduling of DistSQL flows onto decommissioning/draining nodes to avoid a latency cliff. This makes a backport less risky. Added a TODO comment to consider changing this later.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

The DistSQL planner and consistency queue did not take the nodes'
decommissioned or draining states into account, which in particular
could cause spurious errors when interacting with decommissioned nodes.

This patch adds convenience methods for checking node availability and
draining states, and avoids scheduling DistSQL flows on
unavailable nodes and consistency checks on unavailable/draining nodes.

Release note (bug fix): Avoid interacting with decommissioned nodes
during DistSQL planning and consistency checking.
@erikgrinaker erikgrinaker force-pushed the check-decommissioned branch from eaff814 to 78688ea Compare June 25, 2021 12:54
@erikgrinaker
Copy link
Contributor Author

bors r=tbg,knz

@craig
Copy link
Contributor

craig bot commented Jun 25, 2021

Build succeeded:

@erikgrinaker erikgrinaker deleted the check-decommissioned branch June 28, 2021 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants