-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: avoid removing nodes that get a new address #34155
Conversation
@petermattis can you hand hold me about where to implement a test for this? Should this be a unit test, a roachtest or something else entirely? (cli acceptance test perhaps?) Where should that test live? (Which directory?) here's the test script I'm working with, for reference: #! /bin/sh
C=$PWD/cockroach
d=$PWD/repro-34120
killall -9 cockroach cockroachshort
rm -rf "$d"
mkdir -p "$d"
cd "$d"
export COCKROACH_SCAN_MAX_IDLE_TIME=20ms
# Start cluster
for n in 1 2 3 4; do
(
mkdir -p d$n
cd d$n
pn=$(expr $n - 1)
$C start --background --listen-addr=localhost:2600$n --http-addr=localhost:808$pn --insecure --join=localhost:26001
)
if test $n = 1; then
# Init cluster
$C init --host localhost:26001 --insecure
fi
# We sleep to ensure node IDs are generated in the same order as directories.
sleep 1
done
# Make store death detected faster.
$C sql --host localhost:26001 --insecure --echo-sql -e "SET CLUSTER SETTING server.time_until_store_dead = '1m15s'"
# Create some ranges
(
for i in `seq 1 10`; do
echo "create table t$i(x int); insert into t$i(x) values(1);"
done
) | $C sql --insecure --host localhost:26001
# Wait for cluster to stabilize
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!
while true; do
sleep 2
lastlines=$(tail -n 100 d1/cockroach-data/logs/cockroach.log | grep -v 'sending hearbeat' | grep -v 'received heartbeat' | tail -n 2 | grep 'runtime stats' | wc -l)
if test $lastlines -lt 2; then
continue
fi
break
done
kill -9 $tailpid
echo "OK"
# Start a load on all tables
(
echo '\set show_times on'
while true; do
sleep 1
for i in `seq 1 10`; do
echo "insert into t$i values(1);"
done
done
) | $C sql --insecure --echo-sql --host localhost:26001 &>sql.log &
# Show what's going on in the first node log's
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!
# Stop the last two nodes.
$C quit --host localhost:26004 --insecure
$C quit --host localhost:26003 --insecure
# Restart with node shift
(
cd d3
$C start --background --listen-addr=localhost:26004 --http-addr=localhost:27004 --insecure
)
wait $tailpid |
(Note that I'm waiting on CI to tell me which other tests need adjusting. I'll do this adjustment myself) |
I'm a bit puzzled about why raft is confused. I'd expect the restarted node to ensure the other nodes have an updated address for it. |
Perhaps the store-not-found error is not causing |
PS I connected with @knz privately on testing. |
Hmm, there is code in |
Ill investigate that thanks.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
pkg/gossip/gossip.go
Outdated
@@ -840,21 +840,8 @@ func (g *Gossip) updateNodeAddress(key string, content roachpb.Value) { | |||
log.Infof(ctx, "removing n%d which was at same address (%s) as new node %v", | |||
oldNodeID, desc.Address, desc) | |||
g.removeNodeDescriptorLocked(oldNodeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this also gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that that gossip entry is known to be stale at that point so it may as well be removed from gossip until a fresh one is received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)
pkg/gossip/gossip.go, line 842 at r1 (raw file):
Previously, knz (kena) wrote…
I figured that that gossip entry is known to be stale at that point so it may as well be removed from gossip until a fresh one is received.
This isn't removing the gossip entry, but the entry from Gossip.nodeDescs
. But that entry might be up to date. Consider the scenario where node 1 and node 2 swap addresses. Node 1 might start up and gossip its new address. This will update the Gossip.nodeDescs
map on every receiving node, and remove the entry for node 2. When node 2 gossips its new address (node 1's old address), this line will remove the nodeDescs
entry for node 1 which is not correct. We could fix this by only removing the nodeDescs
entry if the address matches what is in bootstrapAddrs
.
Thanks for explaining. |
6fdc4a8
to
321a146
Compare
I have modified the PR to make the tests pass. That may be sufficient for the customer who raised the issue. |
@petermattis I also re-analyzed why this code was introduced in the first place in #34120 (comment). Seems to me that it was just a completely incorrect fix. |
Reworked the PR; added a 2nd commit with the fix to #34158. The following warnings remain however, emitted on n1 when n4 is restarted at n3's address:
I'd like to make these warnings go away, or become more descriptive. How come the error I added in the code ("client requested node ID doesn't match") is not printed by grpc? Does it not get transmitted back somehow? |
Never mind, I found the RPC error I was expecting, it was a bit later in the log. The warnings I pasted above are also expected, they come from the node that was shut down entirely. |
111c31b
to
85671d5
Compare
Ok so except for the missing roachtest/roachprod harness, which may take me a day or two, I think the bug fix is ready! (Note: the PR does have unit tests at the level of the RPC dial logic and gossip. Perhaps we can decouple the bug fix and the more comprehensive acceptance test?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GRPC logging is a mess. I think you should ignore it for now, but we should consider bumping it to ERROR:
Line 447 in d591a49
grpcutil.SetSeverity(log.Severity_WARNING) |
Filed #34165. I think that for here, this leaves the circuitbreaker events, which again I would massage separately (taking a node down today yields the same errors).
I would change this line:
cockroach/pkg/rpc/nodedialer/nodedialer.go
Line 206 in 8636a0c
name := fmt.Sprintf("rpc %v->%v", n.rpcContext.Config.Addr, nodeID) |
%v [n%d]
instead of %v -> %v
which is just confusing). The breaker should log only once per minute, so I am confused why you're seeing it log much more frequently.
I think you don't need a migration for the changes to PingRequest since the field is optional and verified by the recipient. I.e. the field will be dropped silently, which works fine.
Other than these remarks the PR looks pretty good to me!
Reviewed 2 of 2 files at r2, 16 of 16 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/gossip/gossip_test.go, line 64 at r2 (raw file):
// TestGossipMoveNode verifies that if a node is moved to a new address, it // gets properly updated in gossip (including that any other node that was
Stale comment.
pkg/rpc/context.go, line 668 at r3 (raw file):
remain
pkg/rpc/context.go, line 674 at r3 (raw file):
GRPDDialNode
pkg/rpc/context.go, line 788 at r3 (raw file):
MaxOffsetNanos: maxOffsetNanos, ClusterID: &clusterID, NodeID: conn.remoteNodeID,
remoteNodeID
won't change out from under us, though? The comment above suggests that.
pkg/rpc/context_test.go, line 1167 at r3 (raw file):
func BenchmarkGRPCDialNode(b *testing.B) { if testing.Short() { b.Skip("TODO: fix benchmark")
Reminder to self
85671d5
to
99b5af2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/gossip/gossip_test.go, line 64 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Stale comment.
Done.
pkg/rpc/context.go, line 668 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
remain
Done.
pkg/rpc/context.go, line 674 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
GRPDDialNode
Done.
pkg/rpc/context.go, line 788 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
remoteNodeID
won't change out from under us, though? The comment above suggests that.
Done.
pkg/rpc/context_test.go, line 1167 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Reminder to self
I copy/pasted the benchmark from above; is there anything I should change in the TODO?
Also done. |
Out of curiosity: there's about 1 minute between the moment the node is restarted at its new address, and the moment grpc dials into it. Where is that 1 minute delay? (The 3 seconds heartbeat ensures that the connection to both of the nodes that are shut down are dropped quickly; however there must be another delay somewhere to pick up the restarted node. I don't know what that delay is.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for @petermattis to give this a thorough look, but it looks good to me.
Also done.
I don't see it, though I might've missed it.
Out of curiosity: there's about 1 minute between the moment the node is restarted at its new address, and the moment grpc dials into it. Where is that 1 minute delay?
What do you mean? Are you perhaps confused by the circuit breaker logging here?
const logPerNodeFailInterval = time.Minute |
A node should be contacted as soon as it's available (with maybe a few sec of delay).
Writing the roachtest will be a little awkward since they're really not made for this sort of thing (moving store directories around). I have wondered before whether some tests should just deploy a single-node bash script, but that will spiral out of control and not in a good way. @petermattis how do you think we should handle this? See #33120 for another thing worth automating that would be awkward to roachtestify.
Reviewed 1 of 16 files at r3, 3 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/rpc/context_test.go, line 1167 at r3 (raw file):
Previously, knz (kena) wrote…
I copy/pasted the benchmark from above; is there anything I should change in the TODO?
I would remove the copy you introduced.
That's not what I see. Instead I see this:
Questions:
|
43987af
to
f5042aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it, though I might've missed it.
forgot to push
What do you mean? Are you perhaps confused by the circuit breaker logging here?
see my previous comment above
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/context_test.go, line 1167 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I would remove the copy you introduced.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing the roachtest will be a little awkward since they're really not made for this sort of thing (moving store directories around). I have wondered before whether some tests should just deploy a single-node bash script, but that will spiral out of control and not in a good way. @petermattis how do you think we should handle this?
My suggestion is to just move the store directories around. roachprod put
already does scp between remote nodes in treedist
mode. There might be a little plumbing, but it should be possible to roachprod run <cluster> -- scp <src> <dest>
. Or perhaps this would be better as roachprod cp <cluster> <node>:<src> <node>:<dest>
. Regardless, I think this can be done in a follow-on PR.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/rpc/context.go, line 671 at r6 (raw file):
// a new one is created without a node ID requirement. func (ctx *Context) GRPCDial(target string) *Connection { return ctx.GRPCDialNode(target, 0)
So gossip will use a different connection that non-gossip? I think this effectively doubles the number of internal network connections. Is that a problem?
Rather that adding the remoteNodeID
to the map key, I was imagining that the Connection
would contain a remoteNodeID
atomic that is populated by the first PingResponse
containing a NodeID
.
pkg/rpc/context.go, line 680 at r6 (raw file):
// a connection already existed with a different node ID requirement // it is first dropped and a new one with the proper node ID // requirement is created.
This last sentence doesn't match the code, unless I'm misunderstanding the code. It seems like an existing connection under a different remote node ID is left untouched.
pkg/rpc/heartbeat.go, line 99 at r6 (raw file):
nodeID = hs.nodeID.Get() } if args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID {
Shouldn't it be an error if the local node doesn't have an ID yet (nodeID == 0
) while the remote node requested a specific ID? I think this check should be args.NodeID != 0 && args.NodeID != nodeID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)
pkg/rpc/context.go, line 671 at r6 (raw file):
Previously, petermattis (Peter Mattis) wrote…
So gossip will use a different connection that non-gossip? I think this effectively doubles the number of internal network connections. Is that a problem?
Rather that adding the
remoteNodeID
to the map key, I was imagining that theConnection
would contain aremoteNodeID
atomic that is populated by the firstPingResponse
containing aNodeID
.
Ok I'll try that.
pkg/rpc/heartbeat.go, line 99 at r6 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Shouldn't it be an error if the local node doesn't have an ID yet (
nodeID == 0
) while the remote node requested a specific ID? I think this check should beargs.NodeID != 0 && args.NodeID != nodeID
.
I have found (at least in some tests) that there are places where a HeartbeatService is instantiated without a valid node ID and I wasn't sure about that, hence this conditional. I'll change it back and report with details.
I added new logging inside the circuit breaker itself. That new logging does not use the log.Every, though maybe it should. To use it we'd need to modify the calls here When adding that logging I wasn't overly concerned about generating too much noise. Perhaps I should have been. |
K8s deployments make it possible for a node to get restarted using an address previously attributed to another node, *while the other node is still alive* (for example, a re-shuffling of node addresses during a rolling restart). Prior to this patch, the gossip code was assuming that if a node starts with an address previously attributed to another node, that other node must be dead, and thus was (incorrectly) *erasing* that node's entry, thereby removing it from the cluster. This scenario can be reproduced like this: - start 4 nodes n1-n4 - stop n3 and n4 - restart n3 with n4's address Prior to this patch, this scenario would yield "n4 removed from the cluster" in other nodes, and n3 was not restarting properly. With the patch, there is a period of time (until `server.time_until_store_dead`) during which Raft is confused to not find n4 at n3's address, but where the cluster otherwise operates normally. After the store time outs, n4 is properly marked as down and the log spam stops. Release note (bug fix): CockroachDB now enables re-starting a node at an address previously allocated for another node.
3719d38
to
33afd7f
Compare
I'm going to fork the RPC change to a separate PR, so that the first change can be merged ahead of full resolution. |
33afd7f
to
5bce267
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)
pkg/rpc/context.go, line 696 at r8 (raw file):
// ensure we're registering the connection we just created for // future use by these other dials. _, _ = ctx.conns.LoadOrStore(connKey{target, 0}, value)
One problem with this approach is that we won't remove this conn if there is a heartbeat or RPC error: we'll only remove the entry with the non-zero remoteNodeID
. I think that can be fixed, but I'm not sure the complexity is worth it. I know I led you down this path, but thinking about it more I think I was in error. One mistake in my earlier thinking is that the number of gossip connections per-node is limited. So we won't be doubling the number of connections, and even when we do we're talking about a relatively small number of connections in the cluster until we reach very large cluster sizes.
I'd prefer to just remove this block of code and let there between 2 connections per node.
Done - I have rewinded this PR to just the first commit, the other is #34197. Without the other PR #34197 we keep the log spam, because of the reasons explained in that PR.
|
Will do in the other PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)
Ok then we'll defer the larger test work to a separate PR. bors r+ |
34155: gossip: avoid removing nodes that get a new address r=knz a=knz Fixes #34120. K8s deployments make it possible for a node to get restarted using an address previously attributed to another node, *while the other node is still alive* (for example, a re-shuffling of node addresses during a rolling restart). Prior to this patch, the gossip code was assuming that if a node starts with an address previously attributed to another node, that other node must be dead, and thus was (incorrectly) *erasing* that node's entry, thereby removing it from the cluster. This scenario can be reproduced like this: - start 4 nodes n1-n4 - stop n3 and n4 - restart n3 with n4's address Prior to this patch, this scenario would yield "n4 removed from the cluster" in other nodes, and n3 was not restarting properly. With the patch, there is a period of time (until `server.time_until_store_dead`) during which Raft is confused to not find n4 at n3's address, but where the cluster otherwise operates normally. After the store time outs, n4 is properly marked as down and the log spam stops. Release note (bug fix): CockroachDB now enables re-starting a node at an address previously allocated for another node. Co-authored-by: Raphael 'kena' Poss <[email protected]>
Build succeeded |
Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See cockroachdb#34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. This behaves as follows: - node ID zero given, no connection cached: creates new connection that doesn't validate NodeID. This is suitable for the initial GRPC handshake during gossip, before node IDs are known. It is also suitable for the CLI commands which do not care about which node they are talking to (and they do not know the node ID yet -- only the RPC address). - nonzero NodeID given, but connection cached with node ID zero: opens new connection, leaves old connection in place (so dialing to node ID zero later still gives the unvalidated conn back.) This is suitable when setting up e.g. Raft clients after the peer node IDs are determined. At this point we want to introduce node ID validation. The old connection remains in place because the gossip code does not react well from having its connection closed from "under it". - zero given, cached with nonzero: will use the cached connection. This is suitable when gossip needs to verify e.g. the health of some remote node known only by its address. In this case it's OK to have it use the connection that is already established. This flexibility suggests that it is possible for clent components to "opt out" of node ID validation by specifying a zero value, in other places than strictly necessary for gossip and CLI commands. In fact, the situation is even more uncomfortable: it requires extra work to set up the node ID and naive test code will be opting out of validation implicitly, without clear feedback. This mis-design is addressed by a subsequent commit. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address.
Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See cockroachdb#34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. This behaves as follows: - node ID zero given, no connection cached: creates new connection that doesn't validate NodeID. This is suitable for the initial GRPC handshake during gossip, before node IDs are known. It is also suitable for the CLI commands which do not care about which node they are talking to (and they do not know the node ID yet -- only the RPC address). - nonzero NodeID given, but connection cached with node ID zero: opens new connection, leaves old connection in place (so dialing to node ID zero later still gives the unvalidated conn back.) This is suitable when setting up e.g. Raft clients after the peer node IDs are determined. At this point we want to introduce node ID validation. The old connection remains in place because the gossip code does not react well from having its connection closed from "under it". - zero given, cached with nonzero: will use the cached connection. This is suitable when gossip needs to verify e.g. the health of some remote node known only by its address. In this case it's OK to have it use the connection that is already established. This flexibility suggests that it is possible for clent components to "opt out" of node ID validation by specifying a zero value, in other places than strictly necessary for gossip and CLI commands. In fact, the situation is even more uncomfortable: it requires extra work to set up the node ID and naive test code will be opting out of validation implicitly, without clear feedback. This mis-design is addressed by a subsequent commit. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address.
34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz Fixes #34158. Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See #34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address. 36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz Fixes #35675. Prior to this patch, this test would fail `stressrace` after a few dozen iterations. With this patch, `stressrace` succeeds thousands of iterations. I have checked that the test logic is preserved: if I change one of the expected statuses in `testData`, the test still fail properly. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>
Fixes #34120.
K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, while the other node
is still alive (for example, a re-shuffling of node addresses during
a rolling restart).
Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) erasing that
node's entry, thereby removing it from the cluster.
This scenario can be reproduced like this:
Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
server.time_until_store_dead
) during which Raft is confused to notfind n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.
Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.