Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,rpc: block decommissioned node from interacting with the cluster #55286

Merged
merged 3 commits into from
Oct 14, 2020

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 7, 2020

This PR completes the work started in #54936 by introducing a component that
keeps track of decommissioned nodes and hooks it up to the RPC layer, where it
is used to prevent RPC-layer heartbeats between the cluster and decommissioned
nodes. Since these heartbeats are a prerequisite for opening a connection in
either direction, by failing these heartbeats we are effectively preventing
decommissioned nodes from interacting with the cluster.

First commits from #54936. New here:

  • keys: introduce NodeTombstoneKey
  • server: introduce node tombstone storage

Release note: None

@tbg tbg requested a review from irfansharif October 7, 2020 14:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the decom-bit-rpc-step2 branch from dca9b45 to 168c490 Compare October 8, 2020 10:27
@tbg
Copy link
Member Author

tbg commented Oct 13, 2020

Friendly ping @irfansharif

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, but while I have you here I had a few questions. Is the following hazard possible? I think it is (though I'm less clear if we care about it, and if so, why.)

  • Cluster comprised of n1, n2, n3
  • User decommissions n3 through n2
  • n2 updates n3's liveness record in KV, marking it as fully decommissioned
  • n2 crashes before invoking OnNodeDecommissioned, so it doesn't place the n3 marker on disk
  • n2 restarts, and not finding anything on disk to suggest otherwise, connects to n3 (it still hasn't learned through gossip that n3 is decommissioned
  • ???

What does this mean for long running migrations? If the orchestrator process (say on n1) is looking to push out feature gates to n2, if n2 still has an open connection to n3 (admittedly the window where this is possible is very short), isn't that not the kind of guarantee want? I'm actually unclear on what happens if the connection to n3 is maintained.

Given the authoritative source of "all decommissioned nodes" is still KV (as opposed to store local keys). If the open connection to n3 is unacceptable (again, unclear on the implications), would we need the orchestrator to inform all live nodes about the current set of decomm'ed nodes? And then wait for each one to guarantee that they've persisted this store local key + disconnected any relevant open connections?

Reviewed 4 of 4 files at r8, 4 of 5 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/keys/printer.go, line 209 at r8 (raw file):

		return fmt.Sprintf("<invalid: %s>", err)
	}
	return fmt.Sprint(nodeID)

I'm not sure what's custom here in pkg/keys, but what about "n"?


pkg/rpc/context.go, line 1207 at r9 (raw file):

					checkVersion(goCtx, ctx.Settings, response.ServerVersion),
					"version compatibility check failed on ping response")
				if err != nil {

This is always true.


pkg/server/connectivity_test.go, line 363 at r9 (raw file):

	defer tc.Stopper().Stop(ctx)

	//tc.StopServer(2) // stop n3

Stray line?


pkg/server/node_tombstone_storage.go, line 118 at r9 (raw file):

	// We've populated the cache, now write through to disk.

[nit] Remove empty lines here and right below.

@tbg tbg requested a review from irfansharif October 14, 2020 09:23
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

re: your comment, yes, the blocklist is best-effort. We discussed this somewhere on the RFC PR (I believe around here), and the outcome was that basically the operator is ultimately still in charge of not bringing back nodes that are not allowed to come back, though we will make it really hard for them to break something in that way (because the decommissioned status usually blocks the nodes). So the question is whether we're "done" after this PR or if we ought to do more. One simple safety hatch is that the orchestrator, upon looking at the liveness table, makes sure any DECOMMISSIONING nodes are not live. We could also push out decommission bits proactively, etc, and actively force a pass through the open connections in the RPC layer, but that seems more trouble than is worth.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/keys/printer.go, line 209 at r8 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm not sure what's custom here in pkg/keys, but what about "n"?

I'm not sure what you're suggesting, can you clarify?


pkg/rpc/context.go, line 1207 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This is always true.

No, errors.Wrap(nil, _) is nil.


pkg/server/connectivity_test.go, line 363 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Stray line?

Done.


pkg/server/node_tombstone_storage.go, line 118 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Remove empty lines here and right below.

Done.

tbg added 3 commits October 14, 2020 12:38
These keys live in the store-local keyspace and contain nodes known to
have been marked as decommissioned (=permanently removed from the
cluster). These keys will be used to prevent tombstoned nodes from
establishing RPC connections to the cluster (and vice versa).

Release note: None
This introduces the component that provides a cached view of the
locally persisted node tombstones.

This component is populated via NodeLiveness (which in turn listens
to Gossip) and queried in the RPC layer, meaning that in effect,
after the gossip info has propagated out, nodes will, within a
heartbeat timeout (a few seconds) cease communication with the
decommissioned node.

I verified manually that the refusal message makes it to the logs
on the decommissioned node.

Some changes to the RPC connection pool were necessary to get the
desired behavior. Essentially, the pool holds on to and hands out
existing connections by default, meaning that the injected errors
would be ignored if they started occurring on an already active
connection (as in `TestDecommissionedNodeCannotConnect`). We need
to make sure the heartbeat loop actually *returns* these injected
errors, which is now the case.

Release note: None
@tbg tbg force-pushed the decom-bit-rpc-step2 branch from 168c490 to 40cdd5a Compare October 14, 2020 10:38
@tbg
Copy link
Member Author

tbg commented Oct 14, 2020

bors r=irfansharif

@craig
Copy link
Contributor

craig bot commented Oct 14, 2020

Build succeeded:

@craig craig bot merged commit 654a061 into cockroachdb:master Oct 14, 2020
@irfansharif
Copy link
Contributor

I'm not sure what you're suggesting, can you clarify?

Just meant pretty printing the node ID as "n123" instead of just 123. Eh, wtvs.

@tbg
Copy link
Member Author

tbg commented Oct 14, 2020

👍 opened new PR for that.

@tbg tbg deleted the decom-bit-rpc-step2 branch October 14, 2020 15:41
tbg added a commit to tbg/cockroach that referenced this pull request Oct 27, 2020
The test could end up using fully decommissioned nodes for cli commands,
which does not work as of cockroachdb#55286.
Additionally, decommissioned nodes now become non-live after a short
while, so various cli output checks had to be adjusted.

Fixes cockroachdb#55581.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 27, 2020
55560: backupccl: avoid div-by-zero crash on failed node count r=dt a=dt

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.

55809: roachtest: fix decommission/randomize r=irfansharif a=tbg

The test could end up using fully decommissioned nodes for cli commands,
which does not work as of #55286.

Fixes #55581.

Release note: None


56019: lexbase: pin `reserved_keywords.go` within Bazel r=irfansharif a=irfansharif

It's an auto-generated file that bazel doesn't yet know how to construct
within the sandbox. Before this PR `make bazel-generate` would show
spurious diffs on a clean checkout without this file present. Now it
no longer will.

Unfortunately it also means that successful bazel builds require
`reserved_keywords.go` being pre-generated ahead of time (it's not
checked-in into the repo). Once Bazel is taught to generate this file
however, this will no longer be the case. It was just something that I
missed in #55687.

Release note: None

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Nov 4, 2020
The test could end up using fully decommissioned nodes for cli commands,
which does not work as of cockroachdb#55286.
Additionally, decommissioned nodes now become non-live after a short
while, so various cli output checks had to be adjusted.

Fixes cockroachdb#55581.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants