-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
harden ClusterShardingLeavingSpec
#5164
harden ClusterShardingLeavingSpec
#5164
Conversation
adding debug logging to `ClusterShardingLeavingSpec`
Capturing more logs here, but want to catch it when something fails since a rebalanced actor is ending up where it shouldn't be |
This might have something to do with it - working on fixing the logging here on the sender side so I can see what's going on |
Looks like the handoff completed successfully from the POV of the exiting node: 1
|
Looks like hand-off was successfully processed on the callsite of the new coordinator as well on node 2:
|
Seeing lots of these on node 2 though - looks like it's the Persistence journal implementation that might be at fault here:
|
There's a real bug here - a shard homed on a non-leaving node is being rebalanced, whereas what should happen is only the shards belonging to the exiting node should be redistributed. Not all shards are "equal" when up for consideration for rebalancing. |
Happens as the coordinator on node 1 is leaving. Real bug. |
Actually I take that back - this happens BEFORE Node 1 leaves but after the |
@@ -193,15 +193,7 @@ private void Join(RoleName from, RoleName to) | |||
{ | |||
Cluster.Join(Node(to).Address); | |||
StartSharding(); | |||
Within(TimeSpan.FromSeconds(15), () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fix it - the problem was that the fourth node never got to start its sharding system until the very end, so it was possible that a rebalance could happen after the ShardLocations
snapshot was taken - which would corrupt the state of the test.
Better to let the entire cluster form all at once so the shard distribution can happen concurrently as the cluster forms, rather than staggering it. This way the sharding system has to do all of its redistribution once, rather than 3 times (once for each join + barrier in the previous code.)
Ok, I think the issues with this spec should be cleared up going forward - this was by far the raciest spec in the entire test suite. |
adding debug logging to
ClusterShardingLeavingSpec