-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: check node decommissioned/draining state for DistSQL/consistency #66632
Conversation
72b77de
to
eb6d210
Compare
@cockroachdb/kv Any opinions on adding |
eb6d210
to
b869775
Compare
0eb64b9
to
2f42de7
Compare
2f42de7
to
a578a7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/liveness/liveness.go, line 655 at r1 (raw file):
// IsAvailableNotDraining returns whether or not the specified node is available // to serve requests and is not draining/decommissioning. Note that draining
nit: draining/decommissioning/decommissioned.
pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):
if !h.isAvailable(nodeID) { return pgerror.Newf(pgcode.CannotConnectNow, "not using n%d due to liveness: not available", errors.Safe(nodeID))
not using n%d since it is not available
? The reference to liveness is perhaps best dropped. Your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
not using n%d since it is not available
? The reference to liveness is perhaps best dropped. Your call.
you don't need errors.Safe
here. NodeID is already safe.
a578a7c
to
1dc810c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One implication here is that ranges with a leaseholder on a decommissioning node won't get DistSQL processors scheduled locally until the leases have been moved elsewhere. This may negatively affect latency for these ranges until they've been moved. There is a tradeoff here between latency of small/fast DistSQL flows and stability of longer-running DistSQL flows. This change picks stability (since the motivation was rangefeed planning), but it may cause a performance cliff for smaller queries that now have to do table reads across the network.
@cockroachdb/sql-execution Would like to get your take on this.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/kv/kvserver/liveness/liveness.go, line 655 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: draining/decommissioning/decommissioned.
decommissioned is already implied by available (as defined by IsAvailable
), but I spelled this out.
pkg/sql/distsql_physical_planner.go, line 862 at r1 (raw file):
Previously, knz (kena) wrote…
you don't need
errors.Safe
here. NodeID is already safe.
Updated. Thanks @knz, this code was there from before, but I removed the Safe()
calls.
1dc810c
to
f266aea
Compare
f266aea
to
eaff814
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to be conservative here, and keep scheduling of DistSQL flows onto decommissioning/draining nodes to avoid a latency cliff. This makes a backport less risky. Added a TODO
comment to consider changing this later.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
The DistSQL planner and consistency queue did not take the nodes' decommissioned or draining states into account, which in particular could cause spurious errors when interacting with decommissioned nodes. This patch adds convenience methods for checking node availability and draining states, and avoids scheduling DistSQL flows on unavailable nodes and consistency checks on unavailable/draining nodes. Release note (bug fix): Avoid interacting with decommissioned nodes during DistSQL planning and consistency checking.
eaff814
to
78688ea
Compare
bors r=tbg,knz |
Build succeeded: |
The DistSQL planner and consistency queue did not take the nodes'
decommissioned or draining states into account, which in particular
could cause spurious errors when interacting with decommissioned nodes.
This patch adds convenience methods for checking node availability and
draining states, and avoids scheduling DistSQL flows on
unavailable nodes and consistency checks on unavailable/draining nodes.
Touches #66586, touches #45123.
Release note (bug fix): Avoid interacting with decommissioned nodes
during DistSQL planning and consistency checking.
/cc @cockroachdb/kv