-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: unexpected "n%d has been removed from the cluster" errors #34120
Comments
Perusing logs that the customer provided. On one node I see:
Looking at that time frame in the
This is super suspicious. A node should never remove itself from gossip like that. |
I've reproduced the
Fairly quickly you'll see the I'm not sure if that means that I believe the problem is coming from this code in
This code is fairly old, arriving 2 years ago in #10544. |
Something I forgot to mention in the original message, the customer is running on kubernetes. The log files in the debug zip provide evidence that hostnames are moving around the nodes in the cluster when the nodes are rebooted. For example, the hostname |
Unsurprisingly, this is a problem on master too. |
I think this is effectively creating an asymmetric network partition. The gossiped node descriptors are used to locate the address for a node. If the node descriptor is deleted from gossip, other nodes in the system won't be able to talk to the node, though the node will be able to talk to them. This likely will cause all sorts of badness in the system. |
The node address conflict detection code exists for a good reason. See #10266. I might try reproducing that original problem and see if there is a different approach to remedy it. Another option is more of a hack: whenever a node receives an info which deletes its node descriptor, it can re-gossip its node descriptor. I need to think about this more. |
I have to look at this in more detail, but the "deleting from Gossip" code has always bothered me. Why exactly was #10266 such a big problem? A node (say n1) gets wiped, another node (say n2) takes its address, so the cluster may try to connect to n2 when it wants n1. But n1 is gone, so how can cluster health depend on connections to it? The real problem (though I didn't see this mentioned as such in #10266) is that it can actually be dangerous to send stuff to n2 when it was meant for n1 (unintentionally byzantine?) We never look up addresses by nodeID, it's always the other way around. So, in theory, I tend to think that leaving the old entry in Gossip should work fine, though I would add a NodeID check -- the connection that is supposedly connecting to nX should actually be confirmed as connecting to nX. In this issue, the problem becomes that we're removing a nodeID that is actually still active (albeit under a different address), and we unconditionally nuke (overwrite) it, right? That should also be solved if we removed the nuking (which seems a little too unconditional -- it's not even attempting to check which descriptor it overrides; that desc wouldn't have the bootstrapAddr any more in this case, though fixing that will only "mostly" fix the issue).
One day I'll polish #33299 and then you will know. |
I don't know. I'm going to spend some time reproducing that scenario and see what badness emerges. It is possible that we've since fixed something else that was causing problems in this area. 2 years ago was pre-1.0. |
Eh? We lookup the address by nodeID all the time. See |
No, I just wrote the opposite of what I was trying to say. We always look up NodeID -> Address. Sorry about the confusion. Does the rest check out? If a node is gone, nobody should care about what we store under its NodeID, though any goroutines now knowing that node X is dead shouldn't accidentally talk to node Y thinking they've got node X. |
I was able to repro this with 4 nodes. Here's a script that repros the problem all the time in under 2minutes: #! /bin/sh
C=$PWD/cockroach
d=$PWD/repro-34120
killall -9 cockroach cockroachshort
rm -rf "$d"
mkdir -p "$d"
cd "$d"
# Start cluster
for n in 1 2 3 4; do
(
mkdir -p d$n
cd d$n
$C start --background --listen-addr=localhost:2600$n --http-addr=localhost:2700$n --insecure --join=localhost:26001
)
done
# Init cluster
$C init --host localhost:26001 --insecure
# Create some ranges
(
for i in `seq 1 10`; do
echo "create table t$i(x int); insert into t$i(x) values(1);"
done
) | $C sql --insecure --host localhost:26001
# Wait for cluster to stabilize
tail -f d1/cockroach-data/logs/cockroach.log &
tailpid=$!
while true; do
sleep 2
lastlines=$(tail -n 2 d1/cockroach-data/logs/cockroach.log | grep 'runtime stats' | wc -l)
if test $lastlines -lt 2; then
continue
fi
break
done
kill -9 $tailpid
echo "OK"
# Stop the last two nodes.
$C quit --host localhost:26003 --insecure
$C quit --host localhost:26004 --insecure
# Restart with node shift
(
cd d3
$C start --background --listen-addr=localhost:26004 --http-addr=localhost:27004 --insecure
)
# Start a load on all tables
(
while true; do
sleep 1
for i in `seq 1 10`; do
echo "select * from t$i limit 1"
done
done
) | $C sql --insecure --host localhost:26001 &>sql.log &
exec tail -f d1/cockroach-data/logs/cockroach.log
|
Yes, I think the rest checks out. In this issue, the old node still exists, but we've nuked its address record. In the #10266, the old node no longer existed. I'm not sure why that was problematic. I'm not able to reproduce the problem that was seen there.
Why is this dangerous? |
@knz Your bash skills are greater than mine. For posterity, here is what I was using to reproduce:
|
And here is a script to try and reproduce the problem seen in #10266:
|
Removing the offending check triggers the following log spam until the removing node/store is detected to be dead:
I guess it makes sense? Not sure whether it matters. |
Ok so it's worse than this. The spam above continues far beyond the store death timeout. In fact, the old node never becomes dead and both the restarted node and the dead node are disconnected. I think when the node starts it needs to refresh its record in KV? Will investigate. |
We could make an effort to mitigate that spam. We could use that error as a signal to clear out the node addressing info in |
Can you elaborate on this? What do you mean that the restarted node becomes disconnected?
FYI, the node addressing is all based on gossiped info. KV isn't involved. |
So remember I am starting 4 nodes. In this experiment I have the following map:
Here's the gossip map after that from the perspective of
So the two nodes And also the log spam does not stop, which means that the remaining nodes (n1/n4) are still attempting to talk to the remaining store(s) via a wrong address, in this instance
If I look at the log file from
Then if I look at the log file from
|
Ok, false alarm; the problem only lasts until After the store is detected dead, the cluster appears to work fine: On the restarted node:
On the other nodes in the cluster:
|
I'm currently checking the query latency during the unavailability period. |
I verified (using a concurrent SQL client) that the query latencies never exceed ~1ms. |
I don't know that the Raft traffic is dangerous (it's not), but it irks me more generally that our RPC layer will hand out connections to the wrong nodes. That is bound to bite us at some point and seems unnecessary since a NodeDialer always knows which node to expect, and we have RPC handshakes that can determine the NodeID before handing out connections to clients. |
Irks me too, though I'm not seeing an actual current problem. I'll file a separate issue about this. |
Revisiting #10266 in the context of #34155, I think the Raft transport badness may just have been a red herring rather than a cause of the badness. As long as the node has replicas on the down node, Raft messages will be sent to it, and they will end up at the wrong node and log these errors (but not do anything else). I suspect the real problem here was that Gossip is also used to look up addresses for RPCs. Ideally we wouldn't be sending RPCs to a down node, but sometimes we have to try nodes blindly (to discover the leaseholder) or we have it cached (as the previous leaseholder). I suspect we just weren't handling the StoreNotFoundError that would presumably follow from sending an RPC to the wrong node gracefully, and returned it to clients (and eventually SQL). If this is true, then I'm (even) more convinced that #34155 is a good change. We handle cockroach/pkg/kv/dist_sender.go Line 1408 in 079d40a
The code has changed dramatically since #10266: cockroach/pkg/kv/dist_sender.go Lines 1008 to 1015 in 33c94fa
In particular, We should do some additional testing in this area, and we'll definitely want to silence (rate limit) these errors in the Raft transport. I'd also prefer to do #34158 sooner rather than later so that we don't establish byzantine connections in the first place. |
34155: gossip: avoid removing nodes that get a new address r=knz a=knz Fixes #34120. K8s deployments make it possible for a node to get restarted using an address previously attributed to another node, *while the other node is still alive* (for example, a re-shuffling of node addresses during a rolling restart). Prior to this patch, the gossip code was assuming that if a node starts with an address previously attributed to another node, that other node must be dead, and thus was (incorrectly) *erasing* that node's entry, thereby removing it from the cluster. This scenario can be reproduced like this: - start 4 nodes n1-n4 - stop n3 and n4 - restart n3 with n4's address Prior to this patch, this scenario would yield "n4 removed from the cluster" in other nodes, and n3 was not restarting properly. With the patch, there is a period of time (until `server.time_until_store_dead`) during which Raft is confused to not find n4 at n3's address, but where the cluster otherwise operates normally. After the store time outs, n4 is properly marked as down and the log spam stops. Release note (bug fix): CockroachDB now enables re-starting a node at an address previously allocated for another node. Co-authored-by: Raphael 'kena' Poss <[email protected]>
A customer reports a test cluster getting wedged (no queries going through) during simulation of a multi-node outage. The scenario is a 15-node cluster, configured with 15-way replication (confirmed that every zone config has 15-way replication). During the test, a set of "ping" queries are being run against each node in the cluster. These queries are simple
SELECT * FROM <table> LIMIT 1
for each of the customers tables (~10). 5 of the nodes are taken down and left down for 5 minutes so that the cluster reports them as dead. These 5 nodes are then brought back up. Fairly quickly, 2 of these restarted nodes start showing up in the logs of every node with the error in the title. For example,n7 has been removed from the cluster
. Several minutes of waiting, and the "ping" queries suddenly stop returning. It appears to happen to one particular table.It is unclear if the
has been removed from the cluster
errors are related to the wedge, but they are disconcerting as they absolutely shouldn't happen. The code which generates this error isGossip.getNodeDescriptorLocked
. The error is only returned if a node descriptor is present in the info store, but contains a zero node ID or an empty address. It is extremely curious that the node descriptor is not in the cachedGossip.nodeDescs
map.This problem was seen on 2.1.3 (to be precise, the problem was seen on a custom 2.1.X binary which happens to be the same SHA as what became 2.1.3). The cluster has a relatively modest amount of data (a little over 100 ranges).
The text was updated successfully, but these errors were encountered: