-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server,rpc: block decommissioned node from interacting with the cluster #55286
Conversation
dca9b45
to
168c490
Compare
Friendly ping @irfansharif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, but while I have you here I had a few questions. Is the following hazard possible? I think it is (though I'm less clear if we care about it, and if so, why.)
- Cluster comprised of n1, n2, n3
- User decommissions n3 through n2
- n2 updates n3's liveness record in KV, marking it as fully decommissioned
- n2 crashes before invoking OnNodeDecommissioned, so it doesn't place the n3 marker on disk
- n2 restarts, and not finding anything on disk to suggest otherwise, connects to n3 (it still hasn't learned through gossip that n3 is decommissioned
- ???
What does this mean for long running migrations? If the orchestrator process (say on n1) is looking to push out feature gates to n2, if n2 still has an open connection to n3 (admittedly the window where this is possible is very short), isn't that not the kind of guarantee want? I'm actually unclear on what happens if the connection to n3 is maintained.
Given the authoritative source of "all decommissioned nodes" is still KV (as opposed to store local keys). If the open connection to n3 is unacceptable (again, unclear on the implications), would we need the orchestrator to inform all live nodes about the current set of decomm'ed nodes? And then wait for each one to guarantee that they've persisted this store local key + disconnected any relevant open connections?
Reviewed 4 of 4 files at r8, 4 of 5 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/keys/printer.go, line 209 at r8 (raw file):
return fmt.Sprintf("<invalid: %s>", err) } return fmt.Sprint(nodeID)
I'm not sure what's custom here in pkg/keys, but what about "n"?
pkg/rpc/context.go, line 1207 at r9 (raw file):
checkVersion(goCtx, ctx.Settings, response.ServerVersion), "version compatibility check failed on ping response") if err != nil {
This is always true.
pkg/server/connectivity_test.go, line 363 at r9 (raw file):
defer tc.Stopper().Stop(ctx) //tc.StopServer(2) // stop n3
Stray line?
pkg/server/node_tombstone_storage.go, line 118 at r9 (raw file):
// We've populated the cache, now write through to disk.
[nit] Remove empty lines here and right below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
re: your comment, yes, the blocklist is best-effort. We discussed this somewhere on the RFC PR (I believe around here), and the outcome was that basically the operator is ultimately still in charge of not bringing back nodes that are not allowed to come back, though we will make it really hard for them to break something in that way (because the decommissioned status usually blocks the nodes). So the question is whether we're "done" after this PR or if we ought to do more. One simple safety hatch is that the orchestrator, upon looking at the liveness table, makes sure any DECOMMISSIONING nodes are not live. We could also push out decommission bits proactively, etc, and actively force a pass through the open connections in the RPC layer, but that seems more trouble than is worth.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/keys/printer.go, line 209 at r8 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I'm not sure what's custom here in pkg/keys, but what about "n"?
I'm not sure what you're suggesting, can you clarify?
pkg/rpc/context.go, line 1207 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This is always true.
No, errors.Wrap(nil, _)
is nil
.
pkg/server/connectivity_test.go, line 363 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Stray line?
Done.
pkg/server/node_tombstone_storage.go, line 118 at r9 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit] Remove empty lines here and right below.
Done.
These keys live in the store-local keyspace and contain nodes known to have been marked as decommissioned (=permanently removed from the cluster). These keys will be used to prevent tombstoned nodes from establishing RPC connections to the cluster (and vice versa). Release note: None
This introduces the component that provides a cached view of the locally persisted node tombstones. This component is populated via NodeLiveness (which in turn listens to Gossip) and queried in the RPC layer, meaning that in effect, after the gossip info has propagated out, nodes will, within a heartbeat timeout (a few seconds) cease communication with the decommissioned node. I verified manually that the refusal message makes it to the logs on the decommissioned node. Some changes to the RPC connection pool were necessary to get the desired behavior. Essentially, the pool holds on to and hands out existing connections by default, meaning that the injected errors would be ignored if they started occurring on an already active connection (as in `TestDecommissionedNodeCannotConnect`). We need to make sure the heartbeat loop actually *returns* these injected errors, which is now the case. Release note: None
Release note: None
168c490
to
40cdd5a
Compare
bors r=irfansharif |
Build succeeded: |
Just meant pretty printing the node ID as "n123" instead of just 123. Eh, wtvs. |
👍 opened new PR for that. |
The test could end up using fully decommissioned nodes for cli commands, which does not work as of cockroachdb#55286. Additionally, decommissioned nodes now become non-live after a short while, so various cli output checks had to be adjusted. Fixes cockroachdb#55581. Release note: None
55560: backupccl: avoid div-by-zero crash on failed node count r=dt a=dt We've seen a report of a node that crashed due to a divide-by-zero hit during metrics collection, specifically when computing the throughput-per-node by dividing the backup size by node count. Since this is only now used for that metric, make a failure to count nodes a warning only for release builds (and fallback to 1), and make any error while counting, or not counting to more than 0, a returned error in non-release builds. Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster. 55809: roachtest: fix decommission/randomize r=irfansharif a=tbg The test could end up using fully decommissioned nodes for cli commands, which does not work as of #55286. Fixes #55581. Release note: None 56019: lexbase: pin `reserved_keywords.go` within Bazel r=irfansharif a=irfansharif It's an auto-generated file that bazel doesn't yet know how to construct within the sandbox. Before this PR `make bazel-generate` would show spurious diffs on a clean checkout without this file present. Now it no longer will. Unfortunately it also means that successful bazel builds require `reserved_keywords.go` being pre-generated ahead of time (it's not checked-in into the repo). Once Bazel is taught to generate this file however, this will no longer be the case. It was just something that I missed in #55687. Release note: None Co-authored-by: David Taylor <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: irfan sharif <[email protected]>
The test could end up using fully decommissioned nodes for cli commands, which does not work as of cockroachdb#55286. Additionally, decommissioned nodes now become non-live after a short while, so various cli output checks had to be adjusted. Fixes cockroachdb#55581. Release note: None
This PR completes the work started in #54936 by introducing a component that
keeps track of decommissioned nodes and hooks it up to the RPC layer, where it
is used to prevent RPC-layer heartbeats between the cluster and decommissioned
nodes. Since these heartbeats are a prerequisite for opening a connection in
either direction, by failing these heartbeats we are effectively preventing
decommissioned nodes from interacting with the cluster.
First commits from #54936. New here:
Release note: None