Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: unexpected "n%d has been removed from the cluster" errors #34120

Closed
petermattis opened this issue Jan 18, 2019 · 26 comments · Fixed by #34155
Closed

gossip: unexpected "n%d has been removed from the cluster" errors #34120

petermattis opened this issue Jan 18, 2019 · 26 comments · Fixed by #34155
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@petermattis
Copy link
Collaborator

A customer reports a test cluster getting wedged (no queries going through) during simulation of a multi-node outage. The scenario is a 15-node cluster, configured with 15-way replication (confirmed that every zone config has 15-way replication). During the test, a set of "ping" queries are being run against each node in the cluster. These queries are simple SELECT * FROM <table> LIMIT 1 for each of the customers tables (~10). 5 of the nodes are taken down and left down for 5 minutes so that the cluster reports them as dead. These 5 nodes are then brought back up. Fairly quickly, 2 of these restarted nodes start showing up in the logs of every node with the error in the title. For example, n7 has been removed from the cluster. Several minutes of waiting, and the "ping" queries suddenly stop returning. It appears to happen to one particular table.

It is unclear if the has been removed from the cluster errors are related to the wedge, but they are disconcerting as they absolutely shouldn't happen. The code which generates this error is Gossip.getNodeDescriptorLocked. The error is only returned if a node descriptor is present in the info store, but contains a zero node ID or an empty address. It is extremely curious that the node descriptor is not in the cached Gossip.nodeDescs map.

This problem was seen on 2.1.3 (to be precise, the problem was seen on a custom 2.1.X binary which happens to be the same SHA as what became 2.1.3). The cluster has a relatively modest amount of data (a little over 100 ranges).

@petermattis petermattis added the C-investigation Further steps needed to qualify. C-label will change. label Jan 18, 2019
@petermattis petermattis self-assigned this Jan 18, 2019
@petermattis
Copy link
Collaborator Author

Perusing logs that the customer provided. On one node I see:

W190118 19:47:39.434514 284 server/status/recorder.go:388  [n1,summaries] n6 has been removed from the cluster

Looking at that time frame in the n6 logs I find:

I190118 19:47:31.747467 22 gossip/gossip.go:395  [n6] NodeDescriptor set to node_id:6 address:<network_field:"tcp" address_field:"cockroachdb-2.cockroachdb.default.svc.cluster.local:26257" > attrs:<> locality:<> ServerVersion:<major_val:2 minor_val:1 patch:0 unstable:0 > build_tag:"v2.1.2-35-g0c87b11" started_at:1547840851742669721
...
I190118 19:47:37.317456 24 storage/stores.go:261  [n6] wrote 14 node addresses to persistent storage
I190118 19:47:37.317484 24 gossip/gossip.go:808  [n6] removed n6 from gossip

This is super suspicious. A node should never remove itself from gossip like that.

@petermattis
Copy link
Collaborator Author

I've reproduced the removed n* from gossip messages.

  1. Start a 9-node cluster
  2. Stop nodes 7-9
  3. Re-arrange the data directories so that node 7 uses node 9's directory, node 8 uses node 7's, and node 9 uses node 8's
  4. Restart nodes 7-9

Fairly quickly you'll see the removed n* from gossip messages in the logs. I also see this funkiness on the network latency page:

screen shot 2019-01-20 at 12 49 51 pm

I'm not sure if that means that n9 can't talk to anyone, or nothing can talk to n9. Regardless, it isn't good and seems to persist for a long time. In fact, I bet it persists until the node gossips its node descriptor again which occurs on a 1h interval.

I believe the problem is coming from this code in gossip/gossip.go:

	// If the new node's address conflicts with another node's address, then it
	// must be the case that the new node has replaced the previous one. Remove
	// it from our set of tracked descriptors to ensure we don't attempt to
	// connect to its previous identity (as came up in issue #10266).
	oldNodeID, ok := g.bootstrapAddrs[desc.Address]
	if ok && oldNodeID != unknownNodeID && oldNodeID != desc.NodeID {
		log.Infof(ctx, "removing n%d which was at same address (%s) as new node %v",
			oldNodeID, desc.Address, desc)
		g.removeNodeDescriptorLocked(oldNodeID)

		// Deleting the local copy isn't enough to remove the node from the gossip
		// network. We also have to clear it out in the infoStore by overwriting
		// it with an empty descriptor, which can be represented as just an empty
		// byte array due to how protocol buffers are serialized.
		// Calling addInfoLocked here is somewhat recursive since
		// updateNodeAddress is typically called in response to the infoStore
		// being updated but won't lead to deadlock because it's called
		// asynchronously.
		key := MakeNodeIDKey(oldNodeID)
		var emptyProto []byte
		if err := g.addInfoLocked(key, emptyProto, NodeDescriptorTTL); err != nil {
			log.Errorf(ctx, "failed to empty node descriptor for n%d: %s", oldNodeID, err)
		}
	}

This code is fairly old, arriving 2 years ago in #10544.

@petermattis
Copy link
Collaborator Author

Something I forgot to mention in the original message, the customer is running on kubernetes. The log files in the debug zip provide evidence that hostnames are moving around the nodes in the cluster when the nodes are rebooted. For example, the hostname cockroachdb-2 has been used by n1, n3, n6, n8, and n9 at different times.

@petermattis
Copy link
Collaborator Author

Unsurprisingly, this is a problem on master too.

@petermattis
Copy link
Collaborator Author

I think this is effectively creating an asymmetric network partition. The gossiped node descriptors are used to locate the address for a node. If the node descriptor is deleted from gossip, other nodes in the system won't be able to talk to the node, though the node will be able to talk to them. This likely will cause all sorts of badness in the system.

@petermattis
Copy link
Collaborator Author

The node address conflict detection code exists for a good reason. See #10266. I might try reproducing that original problem and see if there is a different approach to remedy it.

Another option is more of a hack: whenever a node receives an info which deletes its node descriptor, it can re-gossip its node descriptor.

I need to think about this more.

Cc @bdarnell and @tbg to see if they have better thoughts.

@tbg
Copy link
Member

tbg commented Jan 20, 2019

I have to look at this in more detail, but the "deleting from Gossip" code has always bothered me. Why exactly was #10266 such a big problem? A node (say n1) gets wiped, another node (say n2) takes its address, so the cluster may try to connect to n2 when it wants n1. But n1 is gone, so how can cluster health depend on connections to it? The real problem (though I didn't see this mentioned as such in #10266) is that it can actually be dangerous to send stuff to n2 when it was meant for n1 (unintentionally byzantine?)

We never look up addresses by nodeID, it's always the other way around. So, in theory, I tend to think that leaving the old entry in Gossip should work fine, though I would add a NodeID check -- the connection that is supposedly connecting to nX should actually be confirmed as connecting to nX.

In this issue, the problem becomes that we're removing a nodeID that is actually still active (albeit under a different address), and we unconditionally nuke (overwrite) it, right? That should also be solved if we removed the nuking (which seems a little too unconditional -- it's not even attempting to check which descriptor it overrides; that desc wouldn't have the bootstrapAddr any more in this case, though fixing that will only "mostly" fix the issue).

I'm not sure if that means that n9 can't talk to anyone, or nothing can talk to n9

One day I'll polish #33299 and then you will know.

@petermattis
Copy link
Collaborator Author

I have to look at this in more detail, but the "deleting from Gossip" code has always bothered me. Why exactly was #10266 such a big problem? A node (say n1) gets wiped, another node (say n2) takes its address, so the cluster may try to connect to n2 when it wants n1. But n1 is gone, so how can cluster health depend on connections to it? The real problem (though I didn't see this mentioned as such in #10266) is that it can actually be dangerous to send stuff to n2 when it was meant for n1 (unintentionally byzantine?)

I don't know. I'm going to spend some time reproducing that scenario and see what badness emerges. It is possible that we've since fixed something else that was causing problems in this area. 2 years ago was pre-1.0.

@petermattis
Copy link
Collaborator Author

PS The scenario described above reproduces the "missing address" log message on #29035 on 2.1.3, but not on master. I'm pretty sure one of my changes to be more defensive about never returning empty addresses is responsible. I'll track down which one and backport to 2.1.

@petermattis
Copy link
Collaborator Author

We never look up addresses by nodeID, it's always the other way around. So, in theory, I tend to think that leaving the old entry in Gossip should work fine, though I would add a NodeID check -- the connection that is supposedly connecting to nX should actually be confirmed as connecting to nX.

Eh? We lookup the address by nodeID all the time. See Gossip.GetNodeIDAddress and the usage of Gossip as an AddressResolver. Am I completely misunderstanding what you are saying?

@tbg
Copy link
Member

tbg commented Jan 21, 2019

No, I just wrote the opposite of what I was trying to say. We always look up NodeID -> Address. Sorry about the confusion. Does the rest check out? If a node is gone, nobody should care about what we store under its NodeID, though any goroutines now knowing that node X is dead shouldn't accidentally talk to node Y thinking they've got node X.

@knz
Copy link
Contributor

knz commented Jan 21, 2019

I was able to repro this with 4 nodes. Here's a script that repros the problem all the time in under 2minutes:

#! /bin/sh

C=$PWD/cockroach
d=$PWD/repro-34120

killall -9 cockroach cockroachshort

rm -rf "$d"
mkdir -p "$d"
cd "$d"

# Start cluster
for n in 1 2 3 4; do
    (
        mkdir -p d$n
        cd d$n
        $C start --background --listen-addr=localhost:2600$n --http-addr=localhost:2700$n --insecure --join=localhost:26001
    )
done
# Init cluster
$C init --host localhost:26001 --insecure

# Create some ranges
(
    for i in `seq 1 10`; do
        echo "create table t$i(x int); insert into t$i(x) values(1);"
    done
) | $C sql --insecure --host localhost:26001

# Wait for cluster to stabilize
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!
while true; do
    sleep 2
    lastlines=$(tail -n 2 d1/cockroach-data/logs/cockroach.log | grep 'runtime stats' | wc -l)
    if test $lastlines -lt 2; then
        continue
    fi
    break
done
kill -9 $tailpid
echo "OK"

# Stop the last two nodes.
$C quit --host localhost:26003 --insecure
$C quit --host localhost:26004 --insecure

# Restart with node shift
(
    cd d3
    $C start --background --listen-addr=localhost:26004 --http-addr=localhost:27004 --insecure
)

# Start a load on all tables
(
    while true; do
        sleep 1
        for i in `seq 1 10`; do
            echo "select * from t$i limit 1"
        done
    done
) | $C sql --insecure --host localhost:26001 &>sql.log &

exec tail -f d1/cockroach-data/logs/cockroach.log

@petermattis
Copy link
Collaborator Author

Yes, I think the rest checks out. In this issue, the old node still exists, but we've nuked its address record. In the #10266, the old node no longer existed. I'm not sure why that was problematic. I'm not able to reproduce the problem that was seen there.

The real problem (though I didn't see this mentioned as such in #10266) is that it can actually be dangerous to send stuff to n2 when it was meant for n1 (unintentionally byzantine?)

Why is this dangerous? BatchRequest.Header contains the target replica, including the store ID. If the request is directed to the wrong node, we'll get an error when trying to lookup the store. Ditto for Raft messages. DistSQL might be an exception, but I think it should operate perfectly fine if requests are directed to the wrong node, though it is possible we'll get some odd error if the node addressing changes in the middle of handling a request.

@petermattis
Copy link
Collaborator Author

@knz Your bash skills are greater than mine. For posterity, here is what I was using to reproduce:

#!/bin/bash

set -euo pipefail

roachprod wipe local
roachprod start local --sequential --env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms
roachprod sql local:1 -- -e "ALTER RANGE default CONFIGURE ZONE USING num_replicas = 9"
roachprod sql local:1 -- -e "ALTER RANGE meta CONFIGURE ZONE USING num_replicas = 9"
roachprod sql local:1 -- -e "ALTER RANGE system CONFIGURE ZONE USING num_replicas = 9"
roachprod sql local:1 -- -e "ALTER RANGE liveness CONFIGURE ZONE USING num_replicas = 9"
roachprod sql local:1 -- -e "ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 9"
roachprod sql local:1 -- -e "ALTER TABLE system.public.jobs CONFIGURE ZONE USING num_replicas = 9"

while :; do
    n=$(roachprod sql local:1 -- \
                  -e "SELECT count(*) FROM crdb_internal.ranges WHERE array_length(replicas,1) != 9" \
                  --format csv | tail -1)
    echo "$n under-replicated ranges"
    if [ $n == 0 ]; then
        break
    fi
    sleep 1
done

roachprod sql local:1 -- -e "SET CLUSTER SETTING server.time_until_store_dead = '90s'"
roachprod stop local:7-9
# Re-arrange the node directories so that nodes start on different ports.
pushd ~/local
mv 9 tmp
mv 8 9
mv 7 8
mv tmp 7
popd
sleep 120
roachprod start local:7-9 --sequential --env=COCKROACH_SCAN_MAX_IDLE_TIME=20ms

@petermattis
Copy link
Collaborator Author

And here is a script to try and reproduce the problem seen in #10266:

#!/bin/bash

set -euo pipefail

roachprod wipe local
roachprod start local --sequential
roachprod sql local:1 -- -e "CREATE DATABASE foo;"

while :; do
    n=$(roachprod sql local:1 -- \
                  -e "SELECT count(*) FROM crdb_internal.ranges WHERE array_length(replicas,1) != 3" \
                  --format csv | tail -1)
    echo "$n under-replicated ranges"
    if [ $n == 0 ]; then
        break
    fi
    sleep 1
done


roachprod stop local:3
roachprod wipe local:3
roachprod start local:3

for i in $(seq 1 3); do
    roachprod sql local:$i -- -e "SHOW DATABASES"
done

@knz
Copy link
Contributor

knz commented Jan 21, 2019

Removing the offending check triggers the following log spam until the removing node/store is detected to be dead:

W190121 17:14:21.687930 2493 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190121 17:14:21.886785 2580 storage/store.go:3744  [n1,s1] raft error: node 4 claims to not contain store 4 for replica (n4,s4):?: store 4 was not found

I guess it makes sense? Not sure whether it matters.

@knz
Copy link
Contributor

knz commented Jan 21, 2019

Ok so it's worse than this. The spam above continues far beyond the store death timeout.

In fact, the old node never becomes dead and both the restarted node and the dead node are disconnected.

I think when the node starts it needs to refresh its record in KV? Will investigate.

@petermattis
Copy link
Collaborator Author

We could make an effort to mitigate that spam. We could use that error as a signal to clear out the node addressing info in Gossip. Or we could change the offending check to clear the node address for oldNodeID if it matches the address in bootstrapAddrs.

@petermattis
Copy link
Collaborator Author

In fact, the old node never becomes dead and both the restarted node and the dead node are disconnected.

Can you elaborate on this? What do you mean that the restarted node becomes disconnected?

I think when the node starts it needs to refresh its record in KV? Will investigate.

FYI, the node addressing is all based on gossiped info. KV isn't involved.

@knz
Copy link
Contributor

knz commented Jan 21, 2019

So remember I am starting 4 nodes. In this experiment I have the following map:

Node ID server number
n1 1
n2 3
n3 4
n4 2

Here's the gossip map after that from the perspective of n1 / server 1, with the check from #34120 (comment) removed:

gossip client (0/3 cur/max conns)
gossip server (1/3 cur/max conns, infos 704/219 sent/received, bytes 180754B/62934B sent/received)
  4: localhost:26002 (59s)
gossip connectivity
  n1 [sentinel];
  n2 -> n4; n4 -> n1;

So the two nodes n2/server3 and n3/server4 have lost connectivity.

And also the log spam does not stop, which means that the remaining nodes (n1/n4) are still attempting to talk to the remaining store(s) via a wrong address, in this instance n3, from n1's logs:

W190121 17:18:19.849946 2524 storage/store.go:3744  [n1,s1] raft error: node 3 claims to not contain store 3 for replica (n3,s3):?: store 3 was not found
W190121 17:18:19.849964 2522 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 3: store 3 was not found:

If I look at the log file from n3/server4 I see what I'm expecting (node is dead and wasn't restarted):

W190121 17:17:59.071044 208 storage/raft_transport.go:583  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
W190121 17:17:59.071061 367 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190121 17:17:59.071080 367 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
W190121 17:17:59.074234 650 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused". Reconnecting...
I190121 17:17:59.247856 709 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n3] circuitbreaker: rpc 127.0.0.1:26004->2 tripped: failed to grpc dial n2 at localhost:26003: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused"
I190121 17:17:59.247880 709 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n3] circuitbreaker: rpc 127.0.0.1:26004->2 event: BreakerTripped
I190121 17:17:59.247894 709 rpc/nodedialer/nodedialer.go:95  [ct-client] unable to connect to n2: failed to grpc dial n2 at localhost:26003: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [::1]:26003: connect: connection refused"
W190121 17:18:00.079127 650 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26003 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190121 17:18:00.079153 650 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26003: grpc: the connection is closing; please retry.
I190121 17:18:00.124214 727 storage/store.go:1062  [drain] waiting for 6 replicas to transfer their lease away
I190121 17:18:00.136966 654 util/stop/stopper.go:536  quiescing; tasks left:
3      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190121 17:18:00.136989 23 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26004 0  <nil>}. Err :connection error: desc = "transport: failed to write client preface: io: read/write on closed pipe". Reconnecting...
W190121 17:18:00.136999 23 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I190121 17:18:00.137080 654 util/stop/stopper.go:536  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190121 17:18:00.137084 449 storage/raft_transport.go:583  [n3] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190121 17:18:00.137090 158 gossip/gossip.go:1491  [n3] no incoming or outgoing connections
I190121 17:18:00.137107 654 util/stop/stopper.go:536  quiescing; tasks left:
1      [async] closedts-subscription
W190121 17:18:00.137146 209 storage/raft_transport.go:583  [n3] while processing outgoing Raft queue to node 4: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190121 17:18:00.138459 216 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n3] circuitbreaker: rpc 127.0.0.1:26004->4 tripped: failed to grpc dial n4 at localhost:26002: context canceled
I190121 17:18:00.138466 216 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n3] circuitbreaker: rpc 127.0.0.1:26004->4 event: BreakerTripped
I190121 17:18:00.138471 216 rpc/nodedialer/nodedialer.go:95  [n3] unable to connect to n4: failed to grpc dial n4 at localhost:26002: context canceled
I190121 17:18:00.138482 192 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n3] circuitbreaker: rpc 127.0.0.1:26004->1 tripped: failed to grpc dial n1 at localhost:26001: context canceled
I190121 17:18:00.138489 192 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n3] circuitbreaker: rpc 127.0.0.1:26004->1 event: BreakerTripped
I190121 17:18:00.138494 192 rpc/nodedialer/nodedialer.go:95  [n3] unable to connect to n1: failed to grpc dial n1 at localhost:26001: context canceled
I190121 17:18:00.138549 654 storage/engine/rocksdb.go:706  closing rocksdb instance at "/data/home/kena/cockroach/repro-34120/d4/cockroach-data/cockroach-temp904241024"
I190121 17:18:00.139556 654 storage/engine/rocksdb.go:706  closing rocksdb instance at "/data/home/kena/cockroach/repro-34120/d4/cockroach-data"

Then if I look at the log file from n2/server3:

I190121 17:18:00.860309 37 server/node.go:708  [n2] node connected via gossip and verified as part of cluster "f38ebd86-a16b-4576-a61b-7f19cfbf55f4"
I190121 17:18:00.860325 37 server/node.go:537  [n2] node=2: started with [<no-attributes>=/data/home/kena/cockroach/repro-34120/d3/cockroach-data] engine(s) and attributes []
I190121 17:18:00.860443 37 server/server.go:1542  [n2] starting http server at 127.0.0.1:27004 (use: localhost:27004)
I190121 17:18:00.860452 37 server/server.go:1544  [n2] starting grpc/postgres server at 127.0.0.1:26004
I190121 17:18:00.860460 37 server/server.go:1545  [n2] advertising CockroachDB node at localhost:26004
W190121 17:18:00.860475 37 jobs/registry.go:316  [n2] unable to get node liveness: node not in the liveness table
W190121 17:18:00.914795 332 storage/store.go:1465  [n2,s2,r6/2:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n2,s2):2 not lease holder; lease holder unknown
W190121 17:18:01.016642 332 storage/store.go:1465  [n2,s2,r6/2:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n2,s2):2 not lease holder; lease holder unknown
I190121 17:18:01.073116 175 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc 127.0.0.1:26004->1 tripped: failed to resolve n1: unable to look up descriptor for n1
I190121 17:18:01.073117 168 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc 127.0.0.1:26004->4 tripped: failed to resolve n4: unable to look up descriptor for n4
I190121 17:18:01.073143 168 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc 127.0.0.1:26004->4 event: BreakerTripped
I190121 17:18:01.073134 175 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc 127.0.0.1:26004->1 event: BreakerTripped
I190121 17:18:01.073167 168 rpc/nodedialer/nodedialer.go:95  [n2] unable to connect to n4: failed to resolve n4: unable to look up descriptor for n4
I190121 17:18:01.073183 175 rpc/nodedialer/nodedialer.go:95  [n2] unable to connect to n1: failed to resolve n1: unable to look up descriptor for n1
I190121 17:18:01.073186 174 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc 127.0.0.1:26004->3 tripped: failed to resolve n3: unable to look up descriptor for n3
I190121 17:18:01.073194 174 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc 127.0.0.1:26004->3 event: BreakerTripped
I190121 17:18:01.073200 174 rpc/nodedialer/nodedialer.go:95  [n2] unable to connect to n3: failed to resolve n3: unable to look up descriptor for n3
W190121 17:18:01.223543 332 storage/store.go:1465  [n2,s2,r6/2:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n2,s2):2 not lease holder; lease holder unknown
I190121 17:18:01.402606 421 gossip/client.go:128  [n2] started gossip client to localhost:26002
I190121 17:18:01.586042 54 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc 127.0.0.1:26004->1 event: BreakerReset
W190121 17:18:01.589560 480 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n3,s3):2
W190121 17:18:01.619270 332 storage/store.go:1465  [n2,s2,r6/2:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n2,s2):2 not lease holder; lease holder unknown
I190121 17:18:01.623626 37 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc 127.0.0.1:26004->4 event: BreakerReset
I190121 17:18:01.624023 37 server/server.go:1610  [n2] done ensuring all necessary migrations have run
I190121 17:18:01.624033 37 server/server.go:1613  [n2] serving sql connections
[...]
I190121 17:18:01.638501 487 sql/event_log.go:135  [n2] Event: "node_restart", target: 2, info: {Descriptor:{NodeID:2 Address:localhost:26004 Attrs: Locality: ServerVersion:2.1-4 BuildTag:v2.2.0-alpha.20181217-712-g792e95bf11-dirty StartedAt:1548091080853848250 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:f38ebd86-a16b-4576-a61b-7f19cfbf55f4 StartedAt:1548091080853848250 LastUp:1548091079041784551}
W190121 17:18:01.645656 70 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n3,s3):3
W190121 17:18:01.944415 515 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:18:02.043922 455 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n3,s3):?
W190121 17:18:02.070951 352 storage/node_liveness.go:516  [n2,hb] slow heartbeat took 1.2s
I190121 17:18:02.070966 352 storage/node_liveness.go:449  [n2,hb] heartbeat failed on epoch increment; retrying
W190121 17:18:02.248877 360 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n3,s3):?
I190121 17:18:02.461925 426 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n2] circuitbreaker: rpc 127.0.0.1:26004->3 event: BreakerReset
W190121 17:18:02.507531 332 storage/store.go:1465  [n2,s2,r6/2:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n2,s2):2 not lease holder; lease holder unknown
I190121 17:18:02.748205 484 server/server_update.go:63  [n2] failed attempt to upgrade cluster version, error: node 3 not running (UNAVAILABLE), cannot determine version
I190121 17:18:02.773712 163 gossip/gossip.go:1505  [n2] node has connected to cluster via gossip
I190121 17:18:02.773842 163 storage/stores.go:262  [n2] wrote 2 node addresses to persistent storage
W190121 17:18:02.947588 362 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:18:03.048906 111 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n3,s3):?
W190121 17:18:03.145323 493 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:18:03.353203 441 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):3: no handler registered for (n3,s3):4
W190121 17:18:03.359529 560 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):2: no handler registered for (n3,s3):4
W190121 17:18:03.359657 113 storage/store.go:3744  [n2,s2,r18/2:/Table/2{2-3}] raft error: node 3 claims to not contain store 3 for replica (n3,s3):4: store 3 was not found
W190121 17:18:03.359673 558 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:18:03.360956 513 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):2: no handler registered for (n3,s3):4
W190121 17:18:03.361080 443 storage/store.go:3744  [n2,s2,r18/2:/Table/2{2-3}] raft error: node 3 claims to not contain store 3 for replica (n3,s3):4: store 3 was not found
W190121 17:18:03.361094 561 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:18:03.443301 578 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n3,s3):2
W190121 17:18:03.445733 595 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n3,s3):2
W190121 17:18:03.842843 610 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n3,s3):?
W190121 17:18:04.045573 584 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:18:04.045685 613 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):?: no handler registered for (n3,s3):?
W190121 17:18:04.205356 365 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):3: no handler registered for (n3,s3):2
[...]
W190121 17:20:30.764058 3919 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:20:30.847088 4362 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:20:30.879195 4407 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190121 17:20:30.880408 4407 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26001: grpc: the connection is closing; please retry.
I190121 17:20:31.014124 345 server/status/runtime.go:464  [n2] runtime stats: 212 MiB RSS, 269 goroutines, 88 MiB/91 MiB/204 MiB GO alloc/idle/total, 20 MiB/29 MiB CGO alloc/total, 317.3 CGO/sec, 1.7/0.3 %(u/s)time, 0.0 %gc (1x), 411 KiB/415 KiB (r/w)net
W190121 17:20:31.046990 4484 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:20:31.167005 4365 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):?: no handler registered for (n3,s3):?
W190121 17:20:31.168354 4059 storage/store.go:3744  [n2,s2] raft error: node 3 claims to not contain store 3 for replica (n3,s3):?: store 3 was not found
W190121 17:20:31.169719 4057 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:20:31.366659 4438 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):?: no handler registered for (n3,s3):?
W190121 17:20:31.368247 4437 storage/store.go:3744  [n2,s2] raft error: node 3 claims to not contain store 3 for replica (n3,s3):?: store 3 was not found
W190121 17:20:31.369402 4176 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:20:31.462744 4140 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:26001: connect: connection refused". Reconnecting...
W190121 17:20:31.663832 347 server/node.go:872  [n2,summaries] health alerts detected: {Alerts:[{StoreID:2 Category:METRICS Description:ranges.underreplicated Value:4 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
W190121 17:20:31.767965 4415 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):?: no handler registered for (n3,s3):?
W190121 17:20:31.770002 4062 storage/store.go:3744  [n2,s2] raft error: node 3 claims to not contain store 3 for replica (n3,s3):?: store 3 was not found
W190121 17:20:31.772561 4060 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:20:31.842366 4120 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:20:32.047633 4470 storage/raft_transport.go:281  unable to accept Raft message from (n4,s4):?: no handler registered for (n3,s3):?
W190121 17:20:32.166881 4486 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):?: no handler registered for (n3,s3):?
W190121 17:20:32.168810 4123 storage/store.go:3744  [n2,s2] raft error: node 3 claims to not contain store 3 for replica (n3,s3):?: store 3 was not found
W190121 17:20:32.171052 4121 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:20:32.367514 4442 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):?: no handler registered for (n3,s3):?
W190121 17:20:32.369487 4126 storage/store.go:3744  [n2,s2] raft error: node 3 claims to not contain store 3 for replica (n3,s3):?: store 3 was not found
W190121 17:20:32.371439 4124 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 3: store 3 was not found:
W190121 17:20:32.463878 4140 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {localhost:26001 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190121 17:20:32.465262 4140 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial localhost:26001: grpc: the connection is closing; please retry.
I190121 17:20:32.492570 484 server/server_update.go:63  [n2] failed attempt to upgrade cluster version, error: node 1 not running (UNAVAILABLE), cannot determine version

@knz
Copy link
Contributor

knz commented Jan 21, 2019

Ok, false alarm; the problem only lasts until server.time_until_store_dead. I thought it was longer than that because I incorrectly tried to set it lower to its default of 5 minutes (I thought I had set it to 5 seconds, but that config was silently rejected because the minimum is 1m15s).

After the store is detected dead, the cluster appears to work fine:

On the restarted node:

W190121 17:42:00.905337 2420 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):3: no handler registered for (n4,s4):2
W190121 17:42:00.914741 2375 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n4,s4):2
W190121 17:42:00.914780 2303 storage/raft_transport.go:281  unable to accept Raft message from (n2,s2):3: no handler registered for (n4,s4):2
W190121 17:42:00.919330 2437 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n4,s4):2
W190121 17:42:00.927325 2305 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n4,s4):2
W190121 17:42:00.931189 2423 storage/raft_transport.go:281  unable to accept Raft message from (n1,s1):1: no handler registered for (n4,s4):2
I190121 17:42:01.622319 2466 storage/store_snapshot.go:762  [n3,replicate,s3,r18/2:/Table/2{2-3}] sending preemptive snapshot d27cb678 at applied index 35
I190121 17:42:01.622497 2466 storage/store_snapshot.go:805  [n3,replicate,s3,r18/2:/Table/2{2-3}] streamed snapshot to (n1,s1):?: kv pairs: 11, log entries: 25, rate-limit: 8.0 MiB/sec, 0.00s
I190121 17:42:01.630593 2466 storage/replica_command.go:809  [n3,replicate,s3,r18/2:/Table/2{2-3}] change replicas (ADD_REPLICA (n1,s1):5): read existing descriptor r18:/Table/2{2-3} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=0]
I190121 17:42:01.654296 2466 storage/replica_raft.go:372  [n3,s3,r18/2:/Table/2{2-3}] proposing ADD_REPLICA((n1,s1):5): updated=[(n4,s4):4 (n3,s3):2 (n2,s2):3 (n1,s1):5] next=6
I190121 17:42:01.656944 2444 storage/replica_command.go:809  [n3,replicate,s3,r18/2:/Table/2{2-3}] change replicas (REMOVE_REPLICA (n4,s4):4): read existing descriptor r18:/Table/2{2-3} [(n4,s4):4, (n3,s3):2, (n2,s2):3, (n1,s1):5, next=6, gen=0]
I190121 17:42:01.667929 2444 storage/replica_raft.go:372  [n3,s3,r18/2:/Table/2{2-3}] proposing REMOVE_REPLICA((n4,s4):4): updated=[(n1,s1):5 (n3,s3):2 (n2,s2):3] next=6
W190121 17:42:01.696874 2394 storage/raft_transport.go:281  unable to accept Raft message from (n3,s3):?: no handler registered for (n4,s4):?
W190121 17:42:01.697015 2485 storage/store.go:3744  [n3,s3] raft error: node 4 claims to not contain store 4 for replica (n4,s4):?: store 4 was not found
W190121 17:42:01.697032 2470 storage/raft_transport.go:583  [n3] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:01.779075 2472 storage/store_snapshot.go:762  [n3,replicate,s3,r28/2:/Table/{59-60}] sending preemptive snapshot 33661f19 at applied index 38
I190121 17:42:01.779249 2472 storage/store_snapshot.go:805  [n3,replicate,s3,r28/2:/Table/{59-60}] streamed snapshot to (n1,s1):?: kv pairs: 13, log entries: 28, rate-limit: 8.0 MiB/sec, 0.00s
I190121 17:42:01.782199 2472 storage/replica_command.go:809  [n3,replicate,s3,r28/2:/Table/{59-60}] change replicas (ADD_REPLICA (n1,s1):5): read existing descriptor r28:/Table/{59-60} [(n4,s4):4, (n3,s3):2, (n2,s2):3, next=5, gen=1]
I190121 17:42:01.791369 2472 storage/replica_raft.go:372  [n3,s3,r28/2:/Table/{59-60}] proposing ADD_REPLICA((n1,s1):5): updated=[(n4,s4):4 (n3,s3):2 (n2,s2):3 (n1,s1):5] next=6
I190121 17:42:01.794482 2476 storage/replica_command.go:809  [n3,replicate,s3,r28/2:/Table/{59-60}] change replicas (REMOVE_REPLICA (n4,s4):4): read existing descriptor r28:/Table/{59-60} [(n4,s4):4, (n3,s3):2, (n2,s2):3, (n1,s1):5, next=6, gen=1]
I190121 17:42:01.804934 2476 storage/replica_raft.go:372  [n3,s3,r28/2:/Table/{59-60}] proposing REMOVE_REPLICA((n4,s4):4): updated=[(n1,s1):5 (n3,s3):2 (n2,s2):3] next=6
I190121 17:42:01.958265 2449 storage/store_snapshot.go:762  [n3,replicate,s3,r9/3:/Table/1{3-4}] sending preemptive snapshot 30976eaf at applied index 279
I190121 17:42:01.959324 2449 storage/store_snapshot.go:805  [n3,replicate,s3,r9/3:/Table/1{3-4}] streamed snapshot to (n2,s2):?: kv pairs: 659, log entries: 133, rate-limit: 8.0 MiB/sec, 0.00s
I190121 17:42:01.963541 2449 storage/replica_command.go:809  [n3,replicate,s3,r9/3:/Table/1{3-4}] change replicas (ADD_REPLICA (n2,s2):4): read existing descriptor r9:/Table/1{3-4} [(n1,s1):1, (n4,s4):2, (n3,s3):3, next=4, gen=0]
I190121 17:42:01.977293 2449 storage/replica_raft.go:372  [n3,s3,r9/3:/Table/1{3-4}] proposing ADD_REPLICA((n2,s2):4): updated=[(n1,s1):1 (n4,s4):2 (n3,s3):3 (n2,s2):4] next=5
I190121 17:42:01.984935 2456 storage/replica_command.go:809  [n3,replicate,s3,r9/3:/Table/1{3-4}] change replicas (REMOVE_REPLICA (n4,s4):2): read existing descriptor r9:/Table/1{3-4} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0]
I190121 17:42:01.998881 2456 storage/replica_raft.go:372  [n3,s3,r9/3:/Table/1{3-4}] proposing REMOVE_REPLICA((n4,s4):2): updated=[(n1,s1):1 (n2,s2):4 (n3,s3):3] next=5
I190121 17:42:02.313576 2490 storage/store_snapshot.go:762  [n3,replicate,s3,r22/3:/Table/5{3-4}] sending preemptive snapshot 2da6c207 at applied index 37
I190121 17:42:02.313740 2490 storage/store_snapshot.go:805  [n3,replicate,s3,r22/3:/Table/5{3-4}] streamed snapshot to (n1,s1):?: kv pairs: 13, log entries: 27, rate-limit: 8.0 MiB/sec, 0.00s
I190121 17:42:02.316398 2490 storage/replica_command.go:809  [n3,replicate,s3,r22/3:/Table/5{3-4}] change replicas (ADD_REPLICA (n1,s1):5): read existing descriptor r22:/Table/5{3-4} [(n4,s4):4, (n2,s2):2, (n3,s3):3, next=5, gen=1]
I190121 17:42:02.326787 2490 storage/replica_raft.go:372  [n3,s3,r22/3:/Table/5{3-4}] proposing ADD_REPLICA((n1,s1):5): updated=[(n4,s4):4 (n2,s2):2 (n3,s3):3 (n1,s1):5] next=6
I190121 17:42:02.328760 2397 storage/replica_command.go:809  [n3,replicate,s3,r22/3:/Table/5{3-4}] change replicas (REMOVE_REPLICA (n4,s4):4): read existing descriptor r22:/Table/5{3-4} [(n4,s4):4, (n2,s2):2, (n3,s3):3, (n1,s1):5, next=6, gen=1]
I190121 17:42:02.340572 2397 storage/replica_raft.go:372  [n3,s3,r22/3:/Table/5{3-4}] proposing REMOVE_REPLICA((n4,s4):4): updated=[(n1,s1):5 (n2,s2):2 (n3,s3):3] next=6
W190121 17:42:06.535917 159 server/node.go:872  [n3,summaries] health alerts detected: {Alerts:[{StoreID:3 Category:METRICS Description:ranges.underreplicated Value:3 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
I190121 17:42:06.577269 157 server/status/runtime.go:464  [n3] runtime stats: 192 MiB RSS, 286 goroutines, 121 MiB/61 MiB/204 MiB GO alloc/idle/total, 24 MiB/33 MiB CGO alloc/total, 856.2 CGO/sec, 2.2/0.6 %(u/s)time, 0.0 %gc (0x), 1.3 MiB/1.3 MiB (r/w)net
I190121 17:42:16.584439 157 server/status/runtime.go:464  [n3] runtime stats: 193 MiB RSS, 286 goroutines, 131 MiB/51 MiB/204 MiB GO alloc/idle/total, 24 MiB/34 MiB CGO alloc/total, 193.7 CGO/sec, 1.0/0.2 %(u/s)time, 0.0 %gc (0x), 224 KiB/227 KiB (r/w)net

On the other nodes in the cluster:

W190121 17:42:00.722502 4026 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190121 17:42:00.725213 4069 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.725226 3772 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:00.730614 4052 storage/replica_raft.go:372  [n1,s1,r29/1:/Table/6{0-1}] proposing ADD_REPLICA((n3,s3):4): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):3 (n3,s3):4] next=5
W190121 17:42:00.730955 3774 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.730965 3921 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190121 17:42:00.736556 4083 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.736574 4070 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:00.737023 4085 storage/store_snapshot.go:762  [n1,replicate,s1,r7/1:/Table/1{1-2}] sending preemptive snapshot a82de01a at applied index 47
I190121 17:42:00.737251 4085 storage/store_snapshot.go:805  [n1,replicate,s1,r7/1:/Table/1{1-2}] streamed snapshot to (n3,s3):?: kv pairs: 20, log entries: 37, rate-limit: 8.0 MiB/sec, 0.00s
I190121 17:42:00.744923 4085 storage/replica_command.go:809  [n1,replicate,s1,r7/1:/Table/1{1-2}] change replicas (ADD_REPLICA (n3,s3):4): read existing descriptor r7:/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, next=4, gen=0]
W190121 17:42:00.755296 4015 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.755307 3886 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:00.761144 4085 storage/replica_raft.go:372  [n1,s1,r7/1:/Table/1{1-2}] proposing ADD_REPLICA((n3,s3):4): updated=[(n1,s1):1 (n2,s2):2 (n4,s4):3 (n3,s3):4] next=5
W190121 17:42:00.762334 4055 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.762345 4088 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:00.764086 4057 storage/store_snapshot.go:762  [n1,replicate,s1,r10/1:/Table/1{4-5}] sending preemptive snapshot 35fc05d3 at applied index 23
I190121 17:42:00.764190 4057 storage/store_snapshot.go:805  [n1,replicate,s1,r10/1:/Table/1{4-5}] streamed snapshot to (n2,s2):?: kv pairs: 9, log entries: 13, rate-limit: 8.0 MiB/sec, 0.00s
I190121 17:42:00.765587 4060 storage/replica_raftstorage.go:805  [n1,s1,r16/?:{-}] applying preemptive snapshot at index 33 (id=0e8a89e3, encoded size=4628, 1 rocksdb batches, 23 log entries)
I190121 17:42:00.767064 4057 storage/replica_command.go:809  [n1,replicate,s1,r10/1:/Table/1{4-5}] change replicas (ADD_REPLICA (n2,s2):4): read existing descriptor r10:/Table/1{4-5} [(n1,s1):1, (n4,s4):2, (n3,s3):3, next=4, gen=0]
I190121 17:42:00.769454 4060 storage/replica_raftstorage.go:811  [n1,s1,r16/?:/Table/2{0-1}] applied preemptive snapshot in 4ms [clear=0ms batch=0ms entries=0ms commit=4ms]
I190121 17:42:00.781507 4057 storage/replica_raft.go:372  [n1,s1,r10/1:/Table/1{4-5}] proposing ADD_REPLICA((n2,s2):4): updated=[(n1,s1):1 (n4,s4):2 (n3,s3):3 (n2,s2):4] next=5
I190121 17:42:00.786478 4100 storage/replica_command.go:809  [n1,replicate,s1,r25/1:/Table/5{6-7}] change replicas (REMOVE_REPLICA (n4,s4):3): read existing descriptor r25:/Table/5{6-7} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=1]
I190121 17:42:00.798231 4100 storage/replica_raft.go:372  [n1,s1,r25/1:/Table/5{6-7}] proposing REMOVE_REPLICA((n4,s4):3): updated=[(n1,s1):1 (n3,s3):2 (n2,s2):4] next=5
I190121 17:42:00.813689 4134 storage/replica_command.go:809  [n1,replicate,s1,r14/1:/Table/1{8-9}] change replicas (REMOVE_REPLICA (n4,s4):2): read existing descriptor r14:/Table/1{8-9} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0]
I190121 17:42:00.887490 4134 storage/replica_raft.go:372  [n1,s1,r14/1:/Table/1{8-9}] proposing REMOVE_REPLICA((n4,s4):2): updated=[(n1,s1):1 (n2,s2):4 (n3,s3):3] next=5
I190121 17:42:00.892425 4151 storage/replica_command.go:809  [n1,replicate,s1,r12/1:/Table/1{6-7}] change replicas (REMOVE_REPLICA (n4,s4):3): read existing descriptor r12:/Table/1{6-7} [(n1,s1):1, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=0]
I190121 17:42:00.903354 4151 storage/replica_raft.go:372  [n1,s1,r12/1:/Table/1{6-7}] proposing REMOVE_REPLICA((n4,s4):3): updated=[(n1,s1):1 (n3,s3):2 (n2,s2):4] next=5
I190121 17:42:00.912892 4155 storage/replica_command.go:809  [n1,replicate,s1,r29/1:/Table/6{0-1}] change replicas (REMOVE_REPLICA (n4,s4):2): read existing descriptor r29:/Table/6{0-1} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n3,s3):4, next=5, gen=1]
W190121 17:42:00.914875 4127 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.914884 4174 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190121 17:42:00.919438 4197 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.919450 4138 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:00.927170 4155 storage/replica_raft.go:372  [n1,s1,r29/1:/Table/6{0-1}] proposing REMOVE_REPLICA((n4,s4):2): updated=[(n1,s1):1 (n3,s3):4 (n2,s2):3] next=5
W190121 17:42:00.927435 4066 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.927448 4139 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
W190121 17:42:00.931291 3964 storage/store.go:3744  [n1,s1,r29/1:/Table/6{0-1}] raft error: node 4 claims to not contain store 4 for replica (n4,s4):2: store 4 was not found
W190121 17:42:00.931302 4211 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 4: store 4 was not found:
I190121 17:42:00.931497 4212 storage/replica_command.go:809  [n1,replicate,s1,r7/1:/Table/1{1-2}] change replicas (REMOVE_REPLICA (n4,s4):3): read existing descriptor r7:/Table/1{1-2} [(n1,s1):1, (n2,s2):2, (n4,s4):3, (n3,s3):4, next=5, gen=0]
I190121 17:42:00.942116 4212 storage/replica_raft.go:372  [n1,s1,r7/1:/Table/1{1-2}] proposing REMOVE_REPLICA((n4,s4):3): updated=[(n1,s1):1 (n2,s2):2 (n3,s3):4] next=5
I190121 17:42:00.951292 4029 storage/replica_command.go:809  [n1,replicate,s1,r10/1:/Table/1{4-5}] change replicas (REMOVE_REPLICA (n4,s4):2): read existing descriptor r10:/Table/1{4-5} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0]
I190121 17:42:00.960100 4029 storage/replica_raft.go:372  [n1,s1,r10/1:/Table/1{4-5}] proposing REMOVE_REPLICA((n4,s4):2): updated=[(n1,s1):1 (n2,s2):4 (n3,s3):3] next=5
I190121 17:42:01.622801 4201 storage/replica_raftstorage.go:805  [n1,s1,r18/?:{-}] applying preemptive snapshot at index 35 (id=d27cb678, encoded size=4819, 1 rocksdb batches, 25 log entries)
I190121 17:42:01.629897 4201 storage/replica_raftstorage.go:811  [n1,s1,r18/?:/Table/2{2-3}] applied preemptive snapshot in 7ms [clear=0ms batch=0ms entries=0ms commit=7ms]
I190121 17:42:01.779506 4275 storage/replica_raftstorage.go:805  [n1,s1,r28/?:{-}] applying preemptive snapshot at index 38 (id=33661f19, encoded size=6155, 1 rocksdb batches, 28 log entries)
I190121 17:42:01.781267 4275 storage/replica_raftstorage.go:811  [n1,s1,r28/?:/Table/{59-60}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=0ms commit=2ms]
I190121 17:42:02.314010 4258 storage/replica_raftstorage.go:805  [n1,s1,r22/?:{-}] applying preemptive snapshot at index 37 (id=2da6c207, encoded size=6175, 1 rocksdb batches, 27 log entries)
I190121 17:42:02.315658 4258 storage/replica_raftstorage.go:811  [n1,s1,r22/?:/Table/5{3-4}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=0ms commit=1ms]
I190121 17:42:10.381851 469 gossip/gossip.go:557  [n1] gossip status (ok, 3 nodes)
gossip client (0/3 cur/max conns)
gossip server (1/3 cur/max conns, infos 946/322 sent/received, bytes 341976B/85157B sent/received)
  2: localhost:26002 (2m0s)

@knz
Copy link
Contributor

knz commented Jan 21, 2019

I'm currently checking the query latency during the unavailability period.

@knz
Copy link
Contributor

knz commented Jan 21, 2019

I verified (using a concurrent SQL client) that the query latencies never exceed ~1ms.

@tbg
Copy link
Member

tbg commented Jan 21, 2019

Why is this dangerous?

I don't know that the Raft traffic is dangerous (it's not), but it irks me more generally that our RPC layer will hand out connections to the wrong nodes. That is bound to bite us at some point and seems unnecessary since a NodeDialer always knows which node to expect, and we have RPC handshakes that can determine the NodeID before handing out connections to clients.

@petermattis
Copy link
Collaborator Author

I don't know that the Raft traffic is dangerous (it's not), but it irks me more generally that our RPC layer will hand out connections to the wrong nodes. That is bound to bite us at some point and seems unnecessary since a NodeDialer always knows which node to expect, and we have RPC handshakes that can determine the NodeID before handing out connections to clients.

Irks me too, though I'm not seeing an actual current problem. I'll file a separate issue about this.

@tbg
Copy link
Member

tbg commented Jan 22, 2019

Revisiting #10266 in the context of #34155, I think the Raft transport badness may just have been a red herring rather than a cause of the badness. As long as the node has replicas on the down node, Raft messages will be sent to it, and they will end up at the wrong node and log these errors (but not do anything else).

I suspect the real problem here was that Gossip is also used to look up addresses for RPCs. Ideally we wouldn't be sending RPCs to a down node, but sometimes we have to try nodes blindly (to discover the leaseholder) or we have it cached (as the previous leaseholder). I suspect we just weren't handling the StoreNotFoundError that would presumably follow from sending an RPC to the wrong node gracefully, and returned it to clients (and eventually SQL).

If this is true, then I'm (even) more convinced that #34155 is a good change. We handle StoreNotFoundError in DistSender already:

case *roachpb.StoreNotFoundError, *roachpb.NodeUnavailableError:

The code has changed dramatically since #10266:

// Extract the detail so it can be included in the error
// message if this is our last replica.
//
// TODO(bdarnell): The last error is not necessarily the best
// one to return; we may want to remember the "best" error
// we've seen (for example, a NotLeaseHolderError conveys more
// information than a RangeNotFound).
err = call.Reply.Error.GoError()

In particular, StoreNotFound is not handled explicitly, even though we retry on all errors. But we also had poor eviction policies for the cache. I wouldn't be surprised if that's the culprit.

We should do some additional testing in this area, and we'll definitely want to silence (rate limit) these errors in the Raft transport. I'd also prefer to do #34158 sooner rather than later so that we don't establish byzantine connections in the first place.

@petermattis petermattis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss and removed C-investigation Further steps needed to qualify. C-label will change. labels Jan 23, 2019
craig bot pushed a commit that referenced this issue Jan 23, 2019
34155: gossip: avoid removing nodes that get a new address r=knz a=knz

Fixes #34120.

K8s deployments make it possible for a node to get restarted using an
address previously attributed to another node, *while the other node
is still alive* (for example, a re-shuffling of node addresses during
a rolling restart).

Prior to this patch, the gossip code was assuming that if a node
starts with an address previously attributed to another node, that
other node must be dead, and thus was (incorrectly) *erasing* that
node's entry, thereby removing it from the cluster.

This scenario can be reproduced like this:

- start 4 nodes n1-n4
- stop n3 and n4
- restart n3 with n4's address

Prior to this patch, this scenario would yield "n4 removed from the
cluster" in other nodes, and n3 was not restarting properly. With the
patch, there is a period of time (until
`server.time_until_store_dead`) during which Raft is confused to not
find n4 at n3's address, but where the cluster otherwise operates
normally. After the store time outs, n4 is properly marked as down and
the log spam stops.

Release note (bug fix): CockroachDB now enables re-starting a node at
an address previously allocated for another node.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in #34155 Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants