gossip: avoid removing nodes that get a new address #34155

knz · 2019-01-21T17:57:38Z

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, while the other node
is still alive (for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) erasing that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

start 4 nodes n1-n4
stop n3 and n4
restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
server.time_until_store_dead) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.

cockroach-teamcity · 2019-01-21T17:57:45Z

This change is

knz · 2019-01-21T17:58:42Z

@petermattis can you hand hold me about where to implement a test for this? Should this be a unit test, a roachtest or something else entirely? (cli acceptance test perhaps?)

Where should that test live? (Which directory?)

here's the test script I'm working with, for reference:

#! /bin/sh

C=$PWD/cockroach
d=$PWD/repro-34120

killall -9 cockroach cockroachshort

rm -rf "$d"
mkdir -p "$d"
cd "$d"

export COCKROACH_SCAN_MAX_IDLE_TIME=20ms

# Start cluster
for n in 1 2 3 4; do
    (
        mkdir -p d$n
        cd d$n
        pn=$(expr $n - 1)
        $C start --background --listen-addr=localhost:2600$n --http-addr=localhost:808$pn --insecure --join=localhost:26001
    )
    if test $n = 1; then
        # Init cluster
        $C init --host localhost:26001 --insecure
    fi
    # We sleep to ensure node IDs are generated in the same order as directories.
    sleep 1
done

# Make store death detected faster.
$C sql --host localhost:26001 --insecure --echo-sql -e "SET CLUSTER SETTING server.time_until_store_dead = '1m15s'"

# Create some ranges
(
    for i in `seq 1 10`; do
        echo "create table t$i(x int); insert into t$i(x) values(1);"
    done
) | $C sql --insecure --host localhost:26001

# Wait for cluster to stabilize
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!
while true; do
    sleep 2
    lastlines=$(tail -n 100 d1/cockroach-data/logs/cockroach.log | grep -v 'sending hearbeat' | grep -v 'received heartbeat' | tail -n 2 | grep 'runtime stats' | wc -l)
    if test $lastlines -lt 2; then
        continue
    fi
    break
done
kill -9 $tailpid
echo "OK"

# Start a load on all tables
(
    echo '\set show_times on'
    while true; do
        sleep 1
        for i in `seq 1 10`; do
            echo "insert into t$i values(1);"
        done
    done
) | $C sql --insecure --echo-sql --host localhost:26001 &>sql.log &

# Show what's going on in the first node log's
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!

# Stop the last two nodes.
$C quit --host localhost:26004 --insecure
$C quit --host localhost:26003 --insecure

# Restart with node shift
(
    cd d3
    $C start --background --listen-addr=localhost:26004 --http-addr=localhost:27004 --insecure
)

wait $tailpid

knz · 2019-01-21T18:04:57Z

(Note that I'm waiting on CI to tell me which other tests need adjusting. I'll do this adjustment myself)

knz · 2019-01-21T18:21:29Z

I'm a bit puzzled about why raft is confused. I'd expect the restarted node to ensure the other nodes have an updated address for it.

petermattis · 2019-01-21T19:09:40Z

I'm a bit puzzled about why raft is confused. I'd expect the restarted node to ensure the other nodes have an updated address for it.

Perhaps the store-not-found error is not causing storage/raft_transport.go to create a new connection. Note that currently we don't actually do any check that the receiving node is the correct node, even though we do have that information. See RaftTransport.handleRaftRequest.

petermattis · 2019-01-21T19:09:56Z

PS I connected with @knz privately on testing.

petermattis · 2019-01-21T19:21:00Z

Hmm, there is code in Store.HandleRaftResponse that claims we do close down the Raft connection when receiving a StoreNotFoundError. Perhaps that isn't working correctly.

knz · 2019-01-21T19:26:26Z

Ill investigate that thanks. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

tbg · 2019-01-21T21:40:08Z

pkg/gossip/gossip.go

@@ -840,21 +840,8 @@ func (g *Gossip) updateNodeAddress(key string, content roachpb.Value) {
 		log.Infof(ctx, "removing n%d which was at same address (%s) as new node %v",
 			oldNodeID, desc.Address, desc)
 		g.removeNodeDescriptorLocked(oldNodeID)


Why isn't this also gone?

I figured that that gossip entry is known to be stale at that point so it may as well be removed from gossip until a fresh one is received.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)

pkg/gossip/gossip.go, line 842 at r1 (raw file):

Previously, knz (kena) wrote…

I figured that that gossip entry is known to be stale at that point so it may as well be removed from gossip until a fresh one is received.

This isn't removing the gossip entry, but the entry from Gossip.nodeDescs. But that entry might be up to date. Consider the scenario where node 1 and node 2 swap addresses. Node 1 might start up and gossip its new address. This will update the Gossip.nodeDescs map on every receiving node, and remove the entry for node 2. When node 2 gossips its new address (node 1's old address), this line will remove the nodeDescs entry for node 1 which is not correct. We could fix this by only removing the nodeDescs entry if the address matches what is in bootstrapAddrs.

knz · 2019-01-21T22:07:09Z

Thanks for explaining.

knz · 2019-01-22T11:53:57Z

I have modified the PR to make the tests pass. That may be sufficient for the customer who raised the issue.
However I am also working to resolve #34158 to make the fix more ironclad.

tbg · 2019-01-22T11:55:21Z

@petermattis I also re-analyzed why this code was introduced in the first place in #34120 (comment). Seems to me that it was just a completely incorrect fix.

knz · 2019-01-22T14:19:25Z

Reworked the PR; added a 2nd commit with the fix to #34158.
I have verified that it makes the warnings from the test scenario completely go away (raft finds itself happy nearly instantaneously).

The following warnings remain however, emitted on n1 when n4 is restarted at n3's address:

W190122 14:12:54.422039 885 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W190122 14:12:54.442055 2372 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused". Reconnecting...
I190122 14:12:54.442164 2358 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc 127.0.0.1:26001->3 tripped: failed to grpc dial n3 at localhost:26003: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest
connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused"
I190122 14:12:54.442186 2358 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc 127.0.0.1:26001->3 event: BreakerTripped
I190122 14:12:54.442196 2358 rpc/nodedialer/nodedialer.go:95  [ct-client] unable to connect to n3: failed to grpc dial n3 at localhost:26003: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transpo
rt: Error while dialing dial tcp [::1]:26003: connect: connection refused"
W190122 14:12:55.448289 2372 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190122 14:12:55.448312 2372 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26003: grpc: the connection is closing; please retry.
W190122 14:12:55.482671 1289 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190122 14:12:55.482698 1295 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: rpc error: code = Unavailable desc = transport is closing:
W190122 14:12:55.482708 1289 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
W190122 14:12:55.648681 2377 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:26004: connect: connection refused". Reconnecting...
I190122 14:12:55.648764 2314 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc 127.0.0.1:26001->4 tripped: failed to grpc dial n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest
connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:26004: connect: connection refused"
I190122 14:12:55.648777 2314 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc 127.0.0.1:26001->4 event: BreakerTripped
I190122 14:12:55.648784 2314 rpc/nodedialer/nodedialer.go:95  [ct-client] unable to connect to n4: failed to grpc dial n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transpo
rt: Error while dialing dial tcp [::1]:26004: connect: connection refused"
W190122 14:12:56.655188 2377 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190122 14:12:56.655201 2377 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26004: grpc: the connection is closing; please retry.
I190122 14:12:56.720259 105 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc 127.0.0.1:26001->3 event: BreakerReset

I'd like to make these warnings go away, or become more descriptive. How come the error I added in the code ("client requested node ID doesn't match") is not printed by grpc? Does it not get transmitted back somehow?

knz · 2019-01-22T14:36:38Z

Never mind, I found the RPC error I was expecting, it was a bit later in the log.

The warnings I pasted above are also expected, they come from the node that was shut down entirely.

knz · 2019-01-22T14:55:38Z

Ok so except for the missing roachtest/roachprod harness, which may take me a day or two, I think the bug fix is ready!

(Note: the PR does have unit tests at the level of the RPC dial logic and gossip. Perhaps we can decouple the bug fix and the more comprehensive acceptance test?).

tbg

The GRPC logging is a mess. I think you should ignore it for now, but we should consider bumping it to ERROR:

cockroach/pkg/cli/start.go

Line 447 in d591a49

grpcutil.SetSeverity(log.Severity_WARNING)

Filed #34165. I think that for here, this leaves the circuitbreaker events, which again I would massage separately (taking a node down today yields the same errors).
I would change this line:

cockroach/pkg/rpc/nodedialer/nodedialer.go

Line 206 in 8636a0c

name := fmt.Sprintf("rpc %v->%v", n.rpcContext.Config.Addr, nodeID)

so that it explicitly says the nodeid (%v [n%d] instead of %v -> %v which is just confusing). The breaker should log only once per minute, so I am confused why you're seeing it log much more frequently.

I think you don't need a migration for the changes to PingRequest since the field is optional and verified by the recipient. I.e. the field will be dropped silently, which works fine.

Other than these remarks the PR looks pretty good to me!

Reviewed 2 of 2 files at r2, 16 of 16 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/gossip/gossip_test.go, line 64 at r2 (raw file):

// TestGossipMoveNode verifies that if a node is moved to a new address, it
// gets properly updated in gossip (including that any other node that was

Stale comment.

pkg/rpc/context.go, line 668 at r3 (raw file):

remain

pkg/rpc/context.go, line 674 at r3 (raw file):

GRPDDialNode

pkg/rpc/context.go, line 788 at r3 (raw file):

			MaxOffsetNanos: maxOffsetNanos,
			ClusterID:      &clusterID,
			NodeID:         conn.remoteNodeID,

remoteNodeID won't change out from under us, though? The comment above suggests that.

pkg/rpc/context_test.go, line 1167 at r3 (raw file):

func BenchmarkGRPCDialNode(b *testing.B) {
	if testing.Short() {
		b.Skip("TODO: fix benchmark")

Reminder to self

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/gossip/gossip_test.go, line 64 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Stale comment.

Done.

pkg/rpc/context.go, line 668 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

remain

Done.

pkg/rpc/context.go, line 674 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

GRPDDialNode

Done.

pkg/rpc/context.go, line 788 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

remoteNodeID won't change out from under us, though? The comment above suggests that.

Done.

pkg/rpc/context_test.go, line 1167 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Reminder to self

I copy/pasted the benchmark from above; is there anything I should change in the TODO?

knz · 2019-01-22T15:40:07Z

I would change this line:
cockroach/pkg/rpc/nodedialer/nodedialer.go
so that it explicitly says the nodeid (%v [n%d] instead of %v -> %v which is just confusing).

Also done.

knz · 2019-01-22T16:30:41Z

Out of curiosity: there's about 1 minute between the moment the node is restarted at its new address, and the moment grpc dials into it. Where is that 1 minute delay?

(The 3 seconds heartbeat ensures that the connection to both of the nodes that are shut down are dropped quickly; however there must be another delay somewhere to pick up the restarted node. I don't know what that delay is.)

tbg

Let's wait for @petermattis to give this a thorough look, but it looks good to me.

Also done.

I don't see it, though I might've missed it.

Out of curiosity: there's about 1 minute between the moment the node is restarted at its new address, and the moment grpc dials into it. Where is that 1 minute delay?

What do you mean? Are you perhaps confused by the circuit breaker logging here?

cockroach/pkg/rpc/nodedialer/nodedialer.go

Line 38 in 8636a0c

const logPerNodeFailInterval = time.Minute

A node should be contacted as soon as it's available (with maybe a few sec of delay).

Writing the roachtest will be a little awkward since they're really not made for this sort of thing (moving store directories around). I have wondered before whether some tests should just deploy a single-node bash script, but that will spiral out of control and not in a good way. @petermattis how do you think we should handle this? See #33120 for another thing worth automating that would be awkward to roachtestify.

Reviewed 1 of 16 files at r3, 3 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/rpc/context_test.go, line 1167 at r3 (raw file):

Previously, knz (kena) wrote…

I copy/pasted the benchmark from above; is there anything I should change in the TODO?

I would remove the copy you introduced.

knz · 2019-01-22T22:07:51Z

That's not what I see. Instead I see this:

time 0: node n3 gets restarted at n4's previous address
time 0 + epsilon: grpc errors in existing connections from n1 and n2 to n3/n4 - this is expected
time + ~1m: n1 receives error "intended node ID 4 does not match actual node ID 3" - n1 has tried to connect to n4 (the dead node) to its old address, now taken by n3.
time + ~1m: n1 establishes starts streaming snapshots to n3 - this seems independent from the previous step

Questions:

I don't care about the delay between step 2 and 3, although I don't understand it. I had expected a reconnect to n4 (now dead) to happen immediately after the connection wast lost at step 2.
The delay between step 2 and 4 bothers me. I'd have expected n1 to establish a connection to n3 (restarted, fine albeit at new address) within seconds, as soon as an updated address was received (which should be within seconds, thanks to gossip)

knz

I don't see it, though I might've missed it.

forgot to push

What do you mean? Are you perhaps confused by the circuit breaker logging here?

see my previous comment above

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/rpc/context_test.go, line 1167 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I would remove the copy you introduced.

Done.

petermattis

Writing the roachtest will be a little awkward since they're really not made for this sort of thing (moving store directories around). I have wondered before whether some tests should just deploy a single-node bash script, but that will spiral out of control and not in a good way. @petermattis how do you think we should handle this?

My suggestion is to just move the store directories around. roachprod put already does scp between remote nodes in treedist mode. There might be a little plumbing, but it should be possible to roachprod run <cluster> -- scp <src> <dest>. Or perhaps this would be better as roachprod cp <cluster> <node>:<src> <node>:<dest>. Regardless, I think this can be done in a follow-on PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/rpc/context.go, line 671 at r6 (raw file):

// a new one is created without a node ID requirement.
func (ctx *Context) GRPCDial(target string) *Connection {
	return ctx.GRPCDialNode(target, 0)

So gossip will use a different connection that non-gossip? I think this effectively doubles the number of internal network connections. Is that a problem?

Rather that adding the remoteNodeID to the map key, I was imagining that the Connection would contain a remoteNodeID atomic that is populated by the first PingResponse containing a NodeID.

pkg/rpc/context.go, line 680 at r6 (raw file):

// a connection already existed with a different node ID requirement
// it is first dropped and a new one with the proper node ID
// requirement is created.

This last sentence doesn't match the code, unless I'm misunderstanding the code. It seems like an existing connection under a different remote node ID is left untouched.

pkg/rpc/heartbeat.go, line 99 at r6 (raw file):

		nodeID = hs.nodeID.Get()
	}
	if args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID {

Shouldn't it be an error if the local node doesn't have an ID yet (nodeID == 0) while the remote node requested a specific ID? I think this check should be args.NodeID != 0 && args.NodeID != nodeID.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

pkg/rpc/context.go, line 671 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

So gossip will use a different connection that non-gossip? I think this effectively doubles the number of internal network connections. Is that a problem?

Rather that adding the remoteNodeID to the map key, I was imagining that the Connection would contain a remoteNodeID atomic that is populated by the first PingResponse containing a NodeID.

Ok I'll try that.

pkg/rpc/heartbeat.go, line 99 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't it be an error if the local node doesn't have an ID yet (nodeID == 0) while the remote node requested a specific ID? I think this check should be args.NodeID != 0 && args.NodeID != nodeID.

I have found (at least in some tests) that there are places where a HeartbeatService is instantiated without a valid node ID and I wasn't sure about that, hence this conditional. I'll change it back and report with details.

ajwerner · 2019-01-23T14:09:24Z

The breaker should log only once per minute, so I am confused why you're seeing it log much more frequently.

I added new logging inside the circuit breaker itself. That new logging does not use the log.Every, though maybe it should. To use it we'd need to modify the calls here

When adding that logging I wasn't overly concerned about generating too much noise. Perhaps I should have been.

K8s deployments make it possible for a node to get restarted using an address previously attributed to another node, *while the other node is still alive* (for example, a re-shuffling of node addresses during a rolling restart). Prior to this patch, the gossip code was assuming that if a node starts with an address previously attributed to another node, that other node must be dead, and thus was (incorrectly) *erasing* that node's entry, thereby removing it from the cluster. This scenario can be reproduced like this: - start 4 nodes n1-n4 - stop n3 and n4 - restart n3 with n4's address Prior to this patch, this scenario would yield "n4 removed from the cluster" in other nodes, and n3 was not restarting properly. With the patch, there is a period of time (until `server.time_until_store_dead`) during which Raft is confused to not find n4 at n3's address, but where the cluster otherwise operates normally. After the store time outs, n4 is properly marked as down and the log spam stops. Release note (bug fix): CockroachDB now enables re-starting a node at an address previously allocated for another node.

knz · 2019-01-23T18:46:44Z

I'm going to fork the RPC change to a separate PR, so that the first change can be merged ahead of full resolution.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

pkg/rpc/context.go, line 696 at r8 (raw file):

			// ensure we're registering the connection we just created for
			// future use by these other dials.
			_, _ = ctx.conns.LoadOrStore(connKey{target, 0}, value)

One problem with this approach is that we won't remove this conn if there is a heartbeat or RPC error: we'll only remove the entry with the non-zero remoteNodeID. I think that can be fixed, but I'm not sure the complexity is worth it. I know I led you down this path, but thinking about it more I think I was in error. One mistake in my earlier thinking is that the number of gossip connections per-node is limited. So we won't be doubling the number of connections, and even when we do we're talking about a relatively small number of connections in the cluster until we reach very large cluster sizes.

I'd prefer to just remove this block of code and let there between 2 connections per node.

knz · 2019-01-23T18:50:49Z

Done - I have rewinded this PR to just the first commit, the other is #34197.

Without the other PR #34197 we keep the log spam, because of the reasons explained in that PR.

W190123 18:43:46.037282 318 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190123 18:43:46.040255 3813 storage/store.go:3744  [n1,s1,r9/1:/Table/1{3-4}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):4: store 4 was not found
W190123 18:43:46.040267 3826 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190123 18:43:46.053504 3828 storage/store.go:3744  [n1,s1,r9/1:/Table/1{3-4}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):4: store 4 was not found
W190123 18:43:46.053517 3800 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190123 18:43:46.058747 3830 storage/store.go:3744  [n1,s1,r9/1:/Table/1{3-4}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):4: store 4 was not found

knz · 2019-01-23T18:52:59Z

One problem with this approach is ...
I'd prefer to just remove this block of code and let there between 2 connections per node.

Will do in the other PR

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

knz · 2019-01-23T18:59:28Z

Ok then we'll defer the larger test work to a separate PR.

bors r+

34155: gossip: avoid removing nodes that get a new address r=knz a=knz Fixes #34120. K8s deployments make it possible for a node to get restarted using an address previously attributed to another node, *while the other node is still alive* (for example, a re-shuffling of node addresses during a rolling restart). Prior to this patch, the gossip code was assuming that if a node starts with an address previously attributed to another node, that other node must be dead, and thus was (incorrectly) *erasing* that node's entry, thereby removing it from the cluster. This scenario can be reproduced like this: - start 4 nodes n1-n4 - stop n3 and n4 - restart n3 with n4's address Prior to this patch, this scenario would yield "n4 removed from the cluster" in other nodes, and n3 was not restarting properly. With the patch, there is a period of time (until `server.time_until_store_dead`) during which Raft is confused to not find n4 at n3's address, but where the cluster otherwise operates normally. After the store time outs, n4 is properly marked as down and the log spam stops. Release note (bug fix): CockroachDB now enables re-starting a node at an address previously allocated for another node. Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2019-01-23T19:24:37Z

Build succeeded

GitHub CI (Cockroach)

Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See cockroachdb#34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. This behaves as follows: - node ID zero given, no connection cached: creates new connection that doesn't validate NodeID. This is suitable for the initial GRPC handshake during gossip, before node IDs are known. It is also suitable for the CLI commands which do not care about which node they are talking to (and they do not know the node ID yet -- only the RPC address). - nonzero NodeID given, but connection cached with node ID zero: opens new connection, leaves old connection in place (so dialing to node ID zero later still gives the unvalidated conn back.) This is suitable when setting up e.g. Raft clients after the peer node IDs are determined. At this point we want to introduce node ID validation. The old connection remains in place because the gossip code does not react well from having its connection closed from "under it". - zero given, cached with nonzero: will use the cached connection. This is suitable when gossip needs to verify e.g. the health of some remote node known only by its address. In this case it's OK to have it use the connection that is already established. This flexibility suggests that it is possible for clent components to "opt out" of node ID validation by specifying a zero value, in other places than strictly necessary for gossip and CLI commands. In fact, the situation is even more uncomfortable: it requires extra work to set up the node ID and naive test code will be opting out of validation implicitly, without clear feedback. This mis-design is addressed by a subsequent commit. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address.

34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz Fixes #34158. Prior to this patch, it was possible for a RPC client to dial a node ID and get a connection to another node instead. This is because the mapping of node ID -> address may be stale, and a different node could take the address of the intended node from "under" the dialer. (See #34155 for a scenario.) This happened to be "safe" in many cases where it matters because: - RPC requests for distSQL are OK with being served on a different node than intended (with potential performance drop); - RPC requests to the KV layer are OK with being served on a different node than intended (they would route underneath); - RPC requests to the storage layer are rejected by the remote node because the store ID in the request would not match. However this safety is largely accidental, and we should not work with the assumption that any RPC request is safe to be mis-routed. (In fact, we have not audited all the RPC endpoints and cannot establish this safety exists throughout.) This patch works to prevent these mis-routings by adding a check of the intended node ID during RPC heartbeats (including the initial heartbeat), when the intended node ID is known. A new API `GRPCDialNode()` is introduced to establish such connections. Release note (bug fix): CockroachDB now performs fewer attempts to communicate with the wrong node, when a node is restarted with another node's address. 36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz Fixes #35675. Prior to this patch, this test would fail `stressrace` after a few dozen iterations. With this patch, `stressrace` succeeds thousands of iterations. I have checked that the test logic is preserved: if I change one of the expected statuses in `testData`, the test still fail properly. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz requested review from tbg, petermattis and a team January 21, 2019 17:57

tbg reviewed Jan 21, 2019

View reviewed changes

petermattis reviewed Jan 21, 2019

View reviewed changes

tbg mentioned this pull request Jan 22, 2019

gossip: unexpected "n%d has been removed from the cluster" errors #34120

Closed

knz force-pushed the 20190121-fix-node-restart branch from 6fdc4a8 to 321a146 Compare January 22, 2019 11:46

knz requested review from a team January 22, 2019 14:15

knz force-pushed the 20190121-fix-node-restart branch from 111c31b to 85671d5 Compare January 22, 2019 14:46

tbg mentioned this pull request Jan 22, 2019

cli: bump grpc log level from WARNING to ERROR #34165

Closed

tbg reviewed Jan 22, 2019

View reviewed changes

knz force-pushed the 20190121-fix-node-restart branch from 85671d5 to 99b5af2 Compare January 22, 2019 15:30

knz commented Jan 22, 2019

View reviewed changes

tbg reviewed Jan 22, 2019

View reviewed changes

knz force-pushed the 20190121-fix-node-restart branch 2 times, most recently from 43987af to f5042aa Compare January 22, 2019 22:21

knz commented Jan 22, 2019

View reviewed changes

petermattis reviewed Jan 23, 2019

View reviewed changes

knz commented Jan 23, 2019

View reviewed changes

knz force-pushed the 20190121-fix-node-restart branch 2 times, most recently from 3719d38 to 33afd7f Compare January 23, 2019 17:16

knz mentioned this pull request Jan 23, 2019

server,rpc: validate node IDs in RPC heartbeats #34197

Merged

knz force-pushed the 20190121-fix-node-restart branch from 33afd7f to 5bce267 Compare January 23, 2019 18:48

petermattis reviewed Jan 23, 2019

View reviewed changes

petermattis approved these changes Jan 23, 2019

View reviewed changes

knz mentioned this pull request Jan 23, 2019

release-2.1: gossip: avoid removing nodes that get a new address #34198

Merged

craig bot merged commit 5bce267 into cockroachdb:master Jan 23, 2019

knz deleted the 20190121-fix-node-restart branch January 23, 2019 19:27

tbg mentioned this pull request Feb 27, 2019

storage: node gossip loop panic when own node descriptor deleted #32942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gossip: avoid removing nodes that get a new address #34155

gossip: avoid removing nodes that get a new address #34155

knz commented Jan 21, 2019 •

edited

Loading

cockroach-teamcity commented Jan 21, 2019

knz commented Jan 21, 2019 •

edited

Loading

knz commented Jan 21, 2019

knz commented Jan 21, 2019

petermattis commented Jan 21, 2019

petermattis commented Jan 21, 2019

petermattis commented Jan 21, 2019

knz commented Jan 21, 2019 via email

tbg Jan 21, 2019

knz Jan 21, 2019

petermattis left a comment

knz commented Jan 21, 2019

knz commented Jan 22, 2019

tbg commented Jan 22, 2019

knz commented Jan 22, 2019

knz commented Jan 22, 2019

knz commented Jan 22, 2019

tbg left a comment

knz left a comment

knz commented Jan 22, 2019

knz commented Jan 22, 2019

tbg left a comment

knz commented Jan 22, 2019

knz left a comment

petermattis left a comment

knz left a comment

ajwerner commented Jan 23, 2019

knz commented Jan 23, 2019

petermattis left a comment

knz commented Jan 23, 2019

knz commented Jan 23, 2019

petermattis left a comment

knz commented Jan 23, 2019

craig bot commented Jan 23, 2019

gossip: avoid removing nodes that get a new address #34155

gossip: avoid removing nodes that get a new address #34155

Conversation

knz commented Jan 21, 2019 • edited Loading

cockroach-teamcity commented Jan 21, 2019

knz commented Jan 21, 2019 • edited Loading

knz commented Jan 21, 2019

knz commented Jan 21, 2019

petermattis commented Jan 21, 2019

petermattis commented Jan 21, 2019

petermattis commented Jan 21, 2019

knz commented Jan 21, 2019 via email

tbg Jan 21, 2019

Choose a reason for hiding this comment

knz Jan 21, 2019

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

knz commented Jan 21, 2019

knz commented Jan 22, 2019

tbg commented Jan 22, 2019

knz commented Jan 22, 2019

knz commented Jan 22, 2019

knz commented Jan 22, 2019

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Jan 22, 2019

knz commented Jan 22, 2019

tbg left a comment

Choose a reason for hiding this comment

knz commented Jan 22, 2019

knz left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

ajwerner commented Jan 23, 2019

knz commented Jan 23, 2019

petermattis left a comment

Choose a reason for hiding this comment

knz commented Jan 23, 2019

knz commented Jan 23, 2019

petermattis left a comment

Choose a reason for hiding this comment

knz commented Jan 23, 2019

craig bot commented Jan 23, 2019

Build succeeded

knz commented Jan 21, 2019 •

edited

Loading

knz commented Jan 21, 2019 •

edited

Loading