Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harden ClusterShardingLeavingSpec #5164

Conversation

Aaronontheweb
Copy link
Member

adding debug logging to ClusterShardingLeavingSpec

adding debug logging to `ClusterShardingLeavingSpec`
@Aaronontheweb
Copy link
Member Author

Capturing more logs here, but want to catch it when something fails since a rebalanced actor is ending up where it shouldn't be

@Aaronontheweb
Copy link
Member Author

[WARNING][9/6/2021 11:32:20 PM][Thread 0028][akka://PersistentClusterShardingLeavingSpec/system/sharding/EntityCoordinator/singleton/coordinator] DeadLetter from [akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042] to [akka://PersistentClusterShardingLeavingSpec/system/sharding/EntityCoordinator/singleton/coordinator]: <Received dead letter from [akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]: GracefulShutdownRequest([akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042])>

This might have something to do with it - working on fixing the logging here on the sender side so I can see what's going on

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Sep 7, 2021

Looks like the handoff completed successfully from the POV of the exiting node: 1

[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Shard [2] handoff complete
[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Shard [5] handoff complete
[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Shard [8] handoff complete
[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Completed graceful shutdown of region.

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Sep 7, 2021

Looks like hand-off was successfully processed on the callsite of the new coordinator as well on node 2:

[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] BeginHandOffAck for shard [5] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] Remaining shard regions: 1
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] BeginHandOffAck for shard [5] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50667/system/sharding/Entity#825360255]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] All shard regions acked, handing off shard [5].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] BeginHandOffAck for shard [2] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] Remaining shard regions: 1
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] BeginHandOffAck for shard [2] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50667/system/sharding/Entity#825360255]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] All shard regions acked, handing off shard [2].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] BeginHandOffAck for shard [8] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] Remaining shard regions: 1
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] BeginHandOffAck for shard [8] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50667/system/sharding/Entity#825360255]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] All shard regions acked, handing off shard [8].

@Aaronontheweb
Copy link
Member Author

Seeing lots of these on node 2 though - looks like it's the Persistence journal implementation that might be at fault here:

[DEBUG][9/6/2021 11:32:20 PM][Thread 0032][LocalActorRefProvider(akka://PersistentClusterShardingLeavingSpec)] Resolve of path sequence [/temp/z] failed
[DEBUG][9/6/2021 11:32:20 PM][Thread 0028][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator] Shard [2] deallocated after
[WARNING][9/6/2021 11:32:20 PM][Thread 0030][akka://PersistentClusterShardingLeavingSpec/temp/z] DeadLetter from [akka://PersistentClusterShardingLeavingSpec/deadLetters] to [akka://PersistentClusterShardingLeavingSpec/temp/z]: <Received dead letter from [akka://PersistentClusterShardingLeavingSpec/deadLetters]: WriteMessageSuccess<actorInstanceId: 1, message: Persistent<pid: /system/sharding/EntityCoordinator/singleton/coordinator, seqNr: 17, deleted: False, manifest: , sender: , payload: ShardHomeDeallocated(2), writerGuid: 21cd5ed1-2814-42c6-9f65-27b389a2d357>>>

@Aaronontheweb
Copy link
Member Author

There's a real bug here - a shard homed on a non-leaving node is being rebalanced, whereas what should happen is only the shards belonging to the exiting node should be redistributed. Not all shards are "equal" when up for consideration for rebalancing.

@Aaronontheweb
Copy link
Member Author

[DEBUG][9/7/2021 11:07:19 AM][Thread 0028][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50774/system/sharding/EntityCoordinator/singleton/coordinator] Shard [1] deallocation completed successfully.
[DEBUG][9/7/2021 11:07:19 AM][Thread 0028][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50774/system/sharding/EntityCoordinator/singleton/coordinator] Shard [1] deallocated after
[DEBUG][9/7/2021 11:07:19 AM][Thread 0027][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50774/system/sharding/EntityCoordinator/singleton/coordinator] Shard [1] allocated at [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50780/system/sharding/Entity#730072660]]

Happens as the coordinator on node 1 is leaving. Real bug.

@Aaronontheweb
Copy link
Member Author

Actually I take that back - this happens BEFORE Node 1 leaves but after the Locations state is recorded. The rebalance occurs during the initial distribution of shards because Node 4 has nothing allocated on it yet. So the issue is that the earlier state in the test is wrong. Not a bug with sharding - it's a problem with the test, but I think I can fix it.

@@ -193,15 +193,7 @@ private void Join(RoleName from, RoleName to)
{
Cluster.Join(Node(to).Address);
StartSharding();
Within(TimeSpan.FromSeconds(15), () =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fix it - the problem was that the fourth node never got to start its sharding system until the very end, so it was possible that a rebalance could happen after the ShardLocations snapshot was taken - which would corrupt the state of the test.

Better to let the entire cluster form all at once so the shard distribution can happen concurrently as the cluster forms, rather than staggering it. This way the sharding system has to do all of its redistribution once, rather than 3 times (once for each join + barrier in the previous code.)

@Aaronontheweb Aaronontheweb marked this pull request as ready for review September 7, 2021 13:11
@Aaronontheweb
Copy link
Member Author

Ok, I think the issues with this spec should be cleared up going forward - this was by far the raciest spec in the entire test suite.

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) September 7, 2021 13:12
@Aaronontheweb Aaronontheweb merged commit 2a3390c into akkadotnet:dev Sep 7, 2021
@Aaronontheweb Aaronontheweb deleted the fix/MNTR-ClusterShardingLeavingSpec-racy branch September 7, 2021 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant