Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: avoid removing nodes that get a new address #34155

Merged
merged 1 commit into from
Jan 23, 2019

Conversation

knz
Copy link
Contributor

@knz knz commented Jan 21, 2019

Fixes #34120.

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, while the other node
is still alive
(for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) erasing that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

  • start 4 nodes n1-n4
  • stop n3 and n4
  • restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
server.time_until_store_dead) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.

@knz knz requested review from tbg, petermattis and a team January 21, 2019 17:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jan 21, 2019

@petermattis can you hand hold me about where to implement a test for this? Should this be a unit test, a roachtest or something else entirely? (cli acceptance test perhaps?)

Where should that test live? (Which directory?)

here's the test script I'm working with, for reference:

#! /bin/sh

C=$PWD/cockroach
d=$PWD/repro-34120

killall -9 cockroach cockroachshort

rm -rf "$d"
mkdir -p "$d"
cd "$d"

export COCKROACH_SCAN_MAX_IDLE_TIME=20ms

# Start cluster
for n in 1 2 3 4; do
    (
        mkdir -p d$n
        cd d$n
        pn=$(expr $n - 1)
        $C start --background --listen-addr=localhost:2600$n --http-addr=localhost:808$pn --insecure --join=localhost:26001
    )
    if test $n = 1; then
        # Init cluster
        $C init --host localhost:26001 --insecure
    fi
    # We sleep to ensure node IDs are generated in the same order as directories.
    sleep 1
done

# Make store death detected faster.
$C sql --host localhost:26001 --insecure --echo-sql -e "SET CLUSTER SETTING server.time_until_store_dead = '1m15s'"

# Create some ranges
(
    for i in `seq 1 10`; do
        echo "create table t$i(x int); insert into t$i(x) values(1);"
    done
) | $C sql --insecure --host localhost:26001

# Wait for cluster to stabilize
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!
while true; do
    sleep 2
    lastlines=$(tail -n 100 d1/cockroach-data/logs/cockroach.log | grep -v 'sending hearbeat' | grep -v 'received heartbeat' | tail -n 2 | grep 'runtime stats' | wc -l)
    if test $lastlines -lt 2; then
        continue
    fi
    break
done
kill -9 $tailpid
echo "OK"

# Start a load on all tables
(
    echo '\set show_times on'
    while true; do
        sleep 1
        for i in `seq 1 10`; do
            echo "insert into t$i values(1);"
        done
    done
) | $C sql --insecure --echo-sql --host localhost:26001 &>sql.log &

# Show what's going on in the first node log's
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!

# Stop the last two nodes.
$C quit --host localhost:26004 --insecure
$C quit --host localhost:26003 --insecure

# Restart with node shift
(
    cd d3
    $C start --background --listen-addr=localhost:26004 --http-addr=localhost:27004 --insecure
)

wait $tailpid

@knz
Copy link
Contributor Author

knz commented Jan 21, 2019

(Note that I'm waiting on CI to tell me which other tests need adjusting. I'll do this adjustment myself)

@knz
Copy link
Contributor Author

knz commented Jan 21, 2019

I'm a bit puzzled about why raft is confused. I'd expect the restarted node to ensure the other nodes have an updated address for it.

@petermattis
Copy link
Collaborator

I'm a bit puzzled about why raft is confused. I'd expect the restarted node to ensure the other nodes have an updated address for it.

Perhaps the store-not-found error is not causing storage/raft_transport.go to create a new connection. Note that currently we don't actually do any check that the receiving node is the correct node, even though we do have that information. See RaftTransport.handleRaftRequest.

@petermattis
Copy link
Collaborator

PS I connected with @knz privately on testing.

@petermattis
Copy link
Collaborator

Hmm, there is code in Store.HandleRaftResponse that claims we do close down the Raft connection when receiving a StoreNotFoundError. Perhaps that isn't working correctly.

@knz
Copy link
Contributor Author

knz commented Jan 21, 2019 via email

@@ -840,21 +840,8 @@ func (g *Gossip) updateNodeAddress(key string, content roachpb.Value) {
log.Infof(ctx, "removing n%d which was at same address (%s) as new node %v",
oldNodeID, desc.Address, desc)
g.removeNodeDescriptorLocked(oldNodeID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this also gone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that that gossip entry is known to be stale at that point so it may as well be removed from gossip until a fresh one is received.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)


pkg/gossip/gossip.go, line 842 at r1 (raw file):

Previously, knz (kena) wrote…

I figured that that gossip entry is known to be stale at that point so it may as well be removed from gossip until a fresh one is received.

This isn't removing the gossip entry, but the entry from Gossip.nodeDescs. But that entry might be up to date. Consider the scenario where node 1 and node 2 swap addresses. Node 1 might start up and gossip its new address. This will update the Gossip.nodeDescs map on every receiving node, and remove the entry for node 2. When node 2 gossips its new address (node 1's old address), this line will remove the nodeDescs entry for node 1 which is not correct. We could fix this by only removing the nodeDescs entry if the address matches what is in bootstrapAddrs.

@knz
Copy link
Contributor Author

knz commented Jan 21, 2019

Thanks for explaining.

@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

I have modified the PR to make the tests pass. That may be sufficient for the customer who raised the issue.
However I am also working to resolve #34158 to make the fix more ironclad.

@tbg
Copy link
Member

tbg commented Jan 22, 2019

@petermattis I also re-analyzed why this code was introduced in the first place in #34120 (comment). Seems to me that it was just a completely incorrect fix.

@knz knz requested review from a team January 22, 2019 14:15
@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

Reworked the PR; added a 2nd commit with the fix to #34158.
I have verified that it makes the warnings from the test scenario completely go away (raft finds itself happy nearly instantaneously).

The following warnings remain however, emitted on n1 when n4 is restarted at n3's address:

W190122 14:12:54.422039 885 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W190122 14:12:54.442055 2372 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused". Reconnecting...
I190122 14:12:54.442164 2358 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc 127.0.0.1:26001->3 tripped: failed to grpc dial n3 at localhost:26003: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest
connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused"
I190122 14:12:54.442186 2358 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc 127.0.0.1:26001->3 event: BreakerTripped
I190122 14:12:54.442196 2358 rpc/nodedialer/nodedialer.go:95  [ct-client] unable to connect to n3: failed to grpc dial n3 at localhost:26003: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transpo
rt: Error while dialing dial tcp [::1]:26003: connect: connection refused"
W190122 14:12:55.448289 2372 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190122 14:12:55.448312 2372 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26003: grpc: the connection is closing; please retry.
W190122 14:12:55.482671 1289 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190122 14:12:55.482698 1295 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: rpc error: code = Unavailable desc = transport is closing:
W190122 14:12:55.482708 1289 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
W190122 14:12:55.648681 2377 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:26004: connect: connection refused". Reconnecting...
I190122 14:12:55.648764 2314 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc 127.0.0.1:26001->4 tripped: failed to grpc dial n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest
connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:26004: connect: connection refused"
I190122 14:12:55.648777 2314 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc 127.0.0.1:26001->4 event: BreakerTripped
I190122 14:12:55.648784 2314 rpc/nodedialer/nodedialer.go:95  [ct-client] unable to connect to n4: failed to grpc dial n4 at localhost:26004: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transpo
rt: Error while dialing dial tcp [::1]:26004: connect: connection refused"
W190122 14:12:56.655188 2377 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190122 14:12:56.655201 2377 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26004: grpc: the connection is closing; please retry.
I190122 14:12:56.720259 105 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc 127.0.0.1:26001->3 event: BreakerReset

I'd like to make these warnings go away, or become more descriptive. How come the error I added in the code ("client requested node ID doesn't match") is not printed by grpc? Does it not get transmitted back somehow?

@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

Never mind, I found the RPC error I was expecting, it was a bit later in the log.

The warnings I pasted above are also expected, they come from the node that was shut down entirely.

@knz knz force-pushed the 20190121-fix-node-restart branch from 111c31b to 85671d5 Compare January 22, 2019 14:46
@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

Ok so except for the missing roachtest/roachprod harness, which may take me a day or two, I think the bug fix is ready!

(Note: the PR does have unit tests at the level of the RPC dial logic and gossip. Perhaps we can decouple the bug fix and the more comprehensive acceptance test?).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GRPC logging is a mess. I think you should ignore it for now, but we should consider bumping it to ERROR:

grpcutil.SetSeverity(log.Severity_WARNING)

Filed #34165. I think that for here, this leaves the circuitbreaker events, which again I would massage separately (taking a node down today yields the same errors).
I would change this line:

name := fmt.Sprintf("rpc %v->%v", n.rpcContext.Config.Addr, nodeID)
so that it explicitly says the nodeid (%v [n%d] instead of %v -> %v which is just confusing). The breaker should log only once per minute, so I am confused why you're seeing it log much more frequently.

I think you don't need a migration for the changes to PingRequest since the field is optional and verified by the recipient. I.e. the field will be dropped silently, which works fine.

Other than these remarks the PR looks pretty good to me!

Reviewed 2 of 2 files at r2, 16 of 16 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/gossip/gossip_test.go, line 64 at r2 (raw file):

// TestGossipMoveNode verifies that if a node is moved to a new address, it
// gets properly updated in gossip (including that any other node that was

Stale comment.


pkg/rpc/context.go, line 668 at r3 (raw file):

remain


pkg/rpc/context.go, line 674 at r3 (raw file):

GRPDDialNode


pkg/rpc/context.go, line 788 at r3 (raw file):

			MaxOffsetNanos: maxOffsetNanos,
			ClusterID:      &clusterID,
			NodeID:         conn.remoteNodeID,

remoteNodeID won't change out from under us, though? The comment above suggests that.


pkg/rpc/context_test.go, line 1167 at r3 (raw file):

func BenchmarkGRPCDialNode(b *testing.B) {
	if testing.Short() {
		b.Skip("TODO: fix benchmark")

Reminder to self

@knz knz force-pushed the 20190121-fix-node-restart branch from 85671d5 to 99b5af2 Compare January 22, 2019 15:30
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/gossip/gossip_test.go, line 64 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Stale comment.

Done.


pkg/rpc/context.go, line 668 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

remain

Done.


pkg/rpc/context.go, line 674 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

GRPDDialNode

Done.


pkg/rpc/context.go, line 788 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

remoteNodeID won't change out from under us, though? The comment above suggests that.

Done.


pkg/rpc/context_test.go, line 1167 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Reminder to self

I copy/pasted the benchmark from above; is there anything I should change in the TODO?

@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

I would change this line:
cockroach/pkg/rpc/nodedialer/nodedialer.go
so that it explicitly says the nodeid (%v [n%d] instead of %v -> %v which is just confusing).

Also done.

@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

Out of curiosity: there's about 1 minute between the moment the node is restarted at its new address, and the moment grpc dials into it. Where is that 1 minute delay?

(The 3 seconds heartbeat ensures that the connection to both of the nodes that are shut down are dropped quickly; however there must be another delay somewhere to pick up the restarted node. I don't know what that delay is.)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for @petermattis to give this a thorough look, but it looks good to me.

Also done.

I don't see it, though I might've missed it.

Out of curiosity: there's about 1 minute between the moment the node is restarted at its new address, and the moment grpc dials into it. Where is that 1 minute delay?

What do you mean? Are you perhaps confused by the circuit breaker logging here?

const logPerNodeFailInterval = time.Minute

A node should be contacted as soon as it's available (with maybe a few sec of delay).

Writing the roachtest will be a little awkward since they're really not made for this sort of thing (moving store directories around). I have wondered before whether some tests should just deploy a single-node bash script, but that will spiral out of control and not in a good way. @petermattis how do you think we should handle this? See #33120 for another thing worth automating that would be awkward to roachtestify.

Reviewed 1 of 16 files at r3, 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/rpc/context_test.go, line 1167 at r3 (raw file):

Previously, knz (kena) wrote…

I copy/pasted the benchmark from above; is there anything I should change in the TODO?

I would remove the copy you introduced.

@knz
Copy link
Contributor Author

knz commented Jan 22, 2019

That's not what I see. Instead I see this:

  1. time 0: node n3 gets restarted at n4's previous address
  2. time 0 + epsilon: grpc errors in existing connections from n1 and n2 to n3/n4 - this is expected
  3. time + ~1m: n1 receives error "intended node ID 4 does not match actual node ID 3" - n1 has tried to connect to n4 (the dead node) to its old address, now taken by n3.
  4. time + ~1m: n1 establishes starts streaming snapshots to n3 - this seems independent from the previous step

Questions:

  • I don't care about the delay between step 2 and 3, although I don't understand it. I had expected a reconnect to n4 (now dead) to happen immediately after the connection wast lost at step 2.
  • The delay between step 2 and 4 bothers me. I'd have expected n1 to establish a connection to n3 (restarted, fine albeit at new address) within seconds, as soon as an updated address was received (which should be within seconds, thanks to gossip)

@knz knz force-pushed the 20190121-fix-node-restart branch 2 times, most recently from 43987af to f5042aa Compare January 22, 2019 22:21
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it, though I might've missed it.

forgot to push

What do you mean? Are you perhaps confused by the circuit breaker logging here?

see my previous comment above

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/context_test.go, line 1167 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I would remove the copy you introduced.

Done.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing the roachtest will be a little awkward since they're really not made for this sort of thing (moving store directories around). I have wondered before whether some tests should just deploy a single-node bash script, but that will spiral out of control and not in a good way. @petermattis how do you think we should handle this?

My suggestion is to just move the store directories around. roachprod put already does scp between remote nodes in treedist mode. There might be a little plumbing, but it should be possible to roachprod run <cluster> -- scp <src> <dest>. Or perhaps this would be better as roachprod cp <cluster> <node>:<src> <node>:<dest>. Regardless, I think this can be done in a follow-on PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/rpc/context.go, line 671 at r6 (raw file):

// a new one is created without a node ID requirement.
func (ctx *Context) GRPCDial(target string) *Connection {
	return ctx.GRPCDialNode(target, 0)

So gossip will use a different connection that non-gossip? I think this effectively doubles the number of internal network connections. Is that a problem?

Rather that adding the remoteNodeID to the map key, I was imagining that the Connection would contain a remoteNodeID atomic that is populated by the first PingResponse containing a NodeID.


pkg/rpc/context.go, line 680 at r6 (raw file):

// a connection already existed with a different node ID requirement
// it is first dropped and a new one with the proper node ID
// requirement is created.

This last sentence doesn't match the code, unless I'm misunderstanding the code. It seems like an existing connection under a different remote node ID is left untouched.


pkg/rpc/heartbeat.go, line 99 at r6 (raw file):

		nodeID = hs.nodeID.Get()
	}
	if args.NodeID != 0 && nodeID != 0 && args.NodeID != nodeID {

Shouldn't it be an error if the local node doesn't have an ID yet (nodeID == 0) while the remote node requested a specific ID? I think this check should be args.NodeID != 0 && args.NodeID != nodeID.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)


pkg/rpc/context.go, line 671 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

So gossip will use a different connection that non-gossip? I think this effectively doubles the number of internal network connections. Is that a problem?

Rather that adding the remoteNodeID to the map key, I was imagining that the Connection would contain a remoteNodeID atomic that is populated by the first PingResponse containing a NodeID.

Ok I'll try that.


pkg/rpc/heartbeat.go, line 99 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Shouldn't it be an error if the local node doesn't have an ID yet (nodeID == 0) while the remote node requested a specific ID? I think this check should be args.NodeID != 0 && args.NodeID != nodeID.

I have found (at least in some tests) that there are places where a HeartbeatService is instantiated without a valid node ID and I wasn't sure about that, hence this conditional. I'll change it back and report with details.

@ajwerner
Copy link
Contributor

The breaker should log only once per minute, so I am confused why you're seeing it log much more frequently.

I added new logging inside the circuit breaker itself. That new logging does not use the log.Every, though maybe it should. To use it we'd need to modify the calls here

When adding that logging I wasn't overly concerned about generating too much noise. Perhaps I should have been.

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, *while the other node
is still alive* (for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) *erasing* that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

- start 4 nodes n1-n4
- stop n3 and n4
- restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
`server.time_until_store_dead`) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.
@knz knz force-pushed the 20190121-fix-node-restart branch 2 times, most recently from 3719d38 to 33afd7f Compare January 23, 2019 17:16
@knz
Copy link
Contributor Author

knz commented Jan 23, 2019

I'm going to fork the RPC change to a separate PR, so that the first change can be merged ahead of full resolution.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)


pkg/rpc/context.go, line 696 at r8 (raw file):

			// ensure we're registering the connection we just created for
			// future use by these other dials.
			_, _ = ctx.conns.LoadOrStore(connKey{target, 0}, value)

One problem with this approach is that we won't remove this conn if there is a heartbeat or RPC error: we'll only remove the entry with the non-zero remoteNodeID. I think that can be fixed, but I'm not sure the complexity is worth it. I know I led you down this path, but thinking about it more I think I was in error. One mistake in my earlier thinking is that the number of gossip connections per-node is limited. So we won't be doubling the number of connections, and even when we do we're talking about a relatively small number of connections in the cluster until we reach very large cluster sizes.

I'd prefer to just remove this block of code and let there between 2 connections per node.

@knz
Copy link
Contributor Author

knz commented Jan 23, 2019

Done - I have rewinded this PR to just the first commit, the other is #34197.

Without the other PR #34197 we keep the log spam, because of the reasons explained in that PR.

W190123 18:43:46.037282 318 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190123 18:43:46.040255 3813 storage/store.go:3744  [n1,s1,r9/1:/Table/1{3-4}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):4: store 4 was not found
W190123 18:43:46.040267 3826 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190123 18:43:46.053504 3828 storage/store.go:3744  [n1,s1,r9/1:/Table/1{3-4}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):4: store 4 was not found
W190123 18:43:46.053517 3800 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190123 18:43:46.058747 3830 storage/store.go:3744  [n1,s1,r9/1:/Table/1{3-4}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):4: store 4 was not found

@knz
Copy link
Contributor Author

knz commented Jan 23, 2019

One problem with this approach is ...
I'd prefer to just remove this block of code and let there between 2 connections per node.

Will do in the other PR

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz, @petermattis, and @tbg)

@knz
Copy link
Contributor Author

knz commented Jan 23, 2019

Ok then we'll defer the larger test work to a separate PR.

bors r+

craig bot pushed a commit that referenced this pull request Jan 23, 2019
34155: gossip: avoid removing nodes that get a new address r=knz a=knz

Fixes #34120.

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, *while the other node
is still alive* (for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) *erasing* that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

- start 4 nodes n1-n4
- stop n3 and n4
- restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
`server.time_until_store_dead`) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 23, 2019

Build succeeded

@craig craig bot merged commit 5bce267 into cockroachdb:master Jan 23, 2019
@knz knz deleted the 20190121-fix-node-restart branch January 23, 2019 19:27
knz added a commit to knz/cockroach that referenced this pull request Apr 15, 2019
Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See cockroachdb#34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

- RPC requests for distSQL are OK with being served on a different
  node than intended (with potential performance drop);
- RPC requests to the KV layer are OK with being served on a different
  node than intended (they would route underneath);
- RPC requests to the storage layer are rejected by the
  remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
`GRPCDialNode()` is introduced to establish such connections.

This behaves as follows:

- node ID zero given, no connection cached: creates new connection
  that doesn't validate NodeID.

  This is suitable for the initial GRPC handshake during gossip,
  before node IDs are known. It is also suitable for the CLI
  commands which do not care about which node they are talking to (and
  they do not know the node ID yet -- only the RPC address).

- nonzero NodeID given, but connection cached with node ID zero: opens
  new connection, leaves old connection in place (so dialing to node
  ID zero later still gives the unvalidated conn back.)

  This is suitable when setting up e.g. Raft clients after the
  peer node IDs are determined. At this point we want to introduce
  node ID validation.

  The old connection remains in place because the gossip code does not
  react well from having its connection closed from "under it".

- zero given, cached with nonzero: will use the cached connection.

  This is suitable when gossip needs to verify e.g. the health of
  some remote node known only by its address. In this case it's OK
  to have it use the connection that is already established.

This flexibility suggests that it is possible for clent components to
"opt out" of node ID validation by specifying a zero value, in other
places than strictly necessary for gossip and CLI commands. In fact,
the situation is even more uncomfortable: it requires extra work
to set up the node ID and naive test code will be opting out of
validation implicitly, without clear feedback. This mis-design is
addressed by a subsequent commit.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.
knz added a commit to knz/cockroach that referenced this pull request Apr 18, 2019
Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See cockroachdb#34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

- RPC requests for distSQL are OK with being served on a different
  node than intended (with potential performance drop);
- RPC requests to the KV layer are OK with being served on a different
  node than intended (they would route underneath);
- RPC requests to the storage layer are rejected by the
  remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
`GRPCDialNode()` is introduced to establish such connections.

This behaves as follows:

- node ID zero given, no connection cached: creates new connection
  that doesn't validate NodeID.

  This is suitable for the initial GRPC handshake during gossip,
  before node IDs are known. It is also suitable for the CLI
  commands which do not care about which node they are talking to (and
  they do not know the node ID yet -- only the RPC address).

- nonzero NodeID given, but connection cached with node ID zero: opens
  new connection, leaves old connection in place (so dialing to node
  ID zero later still gives the unvalidated conn back.)

  This is suitable when setting up e.g. Raft clients after the
  peer node IDs are determined. At this point we want to introduce
  node ID validation.

  The old connection remains in place because the gossip code does not
  react well from having its connection closed from "under it".

- zero given, cached with nonzero: will use the cached connection.

  This is suitable when gossip needs to verify e.g. the health of
  some remote node known only by its address. In this case it's OK
  to have it use the connection that is already established.

This flexibility suggests that it is possible for clent components to
"opt out" of node ID validation by specifying a zero value, in other
places than strictly necessary for gossip and CLI commands. In fact,
the situation is even more uncomfortable: it requires extra work
to set up the node ID and naive test code will be opting out of
validation implicitly, without clear feedback. This mis-design is
addressed by a subsequent commit.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.
craig bot pushed a commit that referenced this pull request Apr 24, 2019
34197: server,rpc: validate node IDs in RPC heartbeats r=tbg a=knz

Fixes #34158.

Prior to this patch, it was possible for a RPC client to dial a node
ID and get a connection to another node instead. This is because the
mapping of node ID -> address may be stale, and a different node could
take the address of the intended node from "under" the dialer.

(See #34155 for a scenario.)

This happened to be "safe" in many cases where it matters because:

- RPC requests for distSQL are OK with being served on a different
  node than intended (with potential performance drop);
- RPC requests to the KV layer are OK with being served on a different
  node than intended (they would route underneath);
- RPC requests to the storage layer are rejected by the
  remote node because the store ID in the request would not match.

However this safety is largely accidental, and we should not work with
the assumption that any RPC request is safe to be mis-routed. (In
fact, we have not audited all the RPC endpoints and cannot establish
this safety exists throughout.)

This patch works to prevent these mis-routings by adding a check of
the intended node ID during RPC heartbeats (including the initial
heartbeat), when the intended node ID is known. A new API
`GRPCDialNode()` is introduced to establish such connections.

Release note (bug fix): CockroachDB now performs fewer attempts to
communicate with the wrong node, when a node is restarted with another
node's address.

36952: storage: deflake TestNodeLivenessStatusMap r=tbg a=knz

Fixes #35675.

Prior to this patch, this test would fail `stressrace` after a few
dozen iterations.

With this patch, `stressrace` succeeds thousands of iterations.

I have checked that the test logic is preserved: if I change one of
the expected statuses in `testData`, the test still fail properly.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gossip: unexpected "n%d has been removed from the cluster" errors
5 participants