You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm testing memberlist in a ~4500 node cluster and it seems I found a nontrivial issue: some nodes seem to not reset their "incarnation" numbers to correct (high enough) values during mass node restarts - and it leads to some nodes thinking that some other nodes are dead for a long time, while actually they aren't.
As I understand, the scenario is the following:
You begin to restart the whole cluster
During restart, it happens so that node 1 marks node 2 dead at large incarnation (say 100) and sends it to gossip (deadNode 2 incarnation=100)
Node 2 restarts and syncs with (initial) 1, but it's already restarted too and it doesn't see node 2 at all, so node 2 leaves its incarnation at 1
Node 3 restarts and it happens so that it first receives new metadata of restarted node 2 with incarnation 1, and AFTER that it receives the deadNode gossip message with node 2 and incarnation 100
Node 3 tries to re-gossip this deadNode 2/100 message, but again - it happens so that it resends it to nodes 4, 5, and 6 which either also restart just at that moment and so don't resend the gossip message further, OR they were just restarted and don't know about node 2 at all (and thus ignore deadNode 2 message)
Anyway, the idea is that the deadNode 2/100 message doesn't get to the newly restarted node 2 and thus node 2 doesn't refute it and doesn't increase its incarnation number to 101
We end up with node 3 thinking that node 2 is dead, while it is not
I can't guarantee that it's the exact case which I observed (due to the lack of logging in this library) but I'm almost sure it's at least very close.
The general idea is that the "incarnation" mechanism is not a guarantee during mass restarts.
My ideas about how it could be fixed:
Add retries/rechecks of dead nodes, do not rely on just 1 "deadNode" message sent to the gossip and then refuted.
Use system time instead of the incarnation number.
Mark node as alive when it pings you.
Request an update from the given node if you receive an "aliveNode" message from it with a smaller incarnation number than you already know.
Re-broadcast deadNode messages even if we don't know about the given node at all.
At least add logs for all "deadNode" / "suspectNode" messages and for all .
A combination of all previous variants. :-)
Generally, it seems that there's plenty of room for improvements in this library, but of course it requires active maintenance. Do you still maintain the library?
The text was updated successfully, but these errors were encountered:
vitalif
changed the title
Incarnations sometimes lead to node not seeing other nodes during mass restarts
Incarnations sometimes lead to nodes not seeing other nodes during mass restarts
Oct 21, 2024
Hi.
I'm testing memberlist in a ~4500 node cluster and it seems I found a nontrivial issue: some nodes seem to not reset their "incarnation" numbers to correct (high enough) values during mass node restarts - and it leads to some nodes thinking that some other nodes are dead for a long time, while actually they aren't.
As I understand, the scenario is the following:
I can't guarantee that it's the exact case which I observed (due to the lack of logging in this library) but I'm almost sure it's at least very close.
The general idea is that the "incarnation" mechanism is not a guarantee during mass restarts.
My ideas about how it could be fixed:
Generally, it seems that there's plenty of room for improvements in this library, but of course it requires active maintenance. Do you still maintain the library?
The text was updated successfully, but these errors were encountered: