harden `ClusterShardingLeavingSpec` #5164

Aaronontheweb · 2021-08-02T21:46:37Z

adding debug logging to ClusterShardingLeavingSpec

adding debug logging to `ClusterShardingLeavingSpec`

Aaronontheweb · 2021-08-03T00:20:54Z

Capturing more logs here, but want to catch it when something fails since a rebalanced actor is ending up where it shouldn't be

…torIfInProgress`

Aaronontheweb · 2021-09-07T10:48:29Z

[WARNING][9/6/2021 11:32:20 PM][Thread 0028][akka://PersistentClusterShardingLeavingSpec/system/sharding/EntityCoordinator/singleton/coordinator] DeadLetter from [akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042] to [akka://PersistentClusterShardingLeavingSpec/system/sharding/EntityCoordinator/singleton/coordinator]: <Received dead letter from [akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]: GracefulShutdownRequest([akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042])>

This might have something to do with it - working on fixing the logging here on the sender side so I can see what's going on

Aaronontheweb · 2021-09-07T11:03:58Z

Looks like the handoff completed successfully from the POV of the exiting node: 1

[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Shard [2] handoff complete
[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Shard [5] handoff complete
[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Shard [8] handoff complete
[DEBUG][9/6/2021 11:32:20 PM][Thread 0023][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity] Entity: Completed graceful shutdown of region.

Aaronontheweb · 2021-09-07T11:08:41Z

Looks like hand-off was successfully processed on the callsite of the new coordinator as well on node 2:

[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] BeginHandOffAck for shard [5] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] Remaining shard regions: 1
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] BeginHandOffAck for shard [5] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50667/system/sharding/Entity#825360255]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$a] All shard regions acked, handing off shard [5].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] BeginHandOffAck for shard [2] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] Remaining shard regions: 1
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] BeginHandOffAck for shard [2] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50667/system/sharding/Entity#825360255]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$b] All shard regions acked, handing off shard [2].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] BeginHandOffAck for shard [8] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50663/system/sharding/Entity#1991384042]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] Remaining shard regions: 1
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] BeginHandOffAck for shard [8] received from [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50667/system/sharding/Entity#825360255]].
[DEBUG][9/6/2021 11:32:20 PM][Thread 0026][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator/$c] All shard regions acked, handing off shard [8].

Aaronontheweb · 2021-09-07T11:13:10Z

Seeing lots of these on node 2 though - looks like it's the Persistence journal implementation that might be at fault here:

[DEBUG][9/6/2021 11:32:20 PM][Thread 0032][LocalActorRefProvider(akka://PersistentClusterShardingLeavingSpec)] Resolve of path sequence [/temp/z] failed
[DEBUG][9/6/2021 11:32:20 PM][Thread 0028][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50661/system/sharding/EntityCoordinator/singleton/coordinator] Shard [2] deallocated after
[WARNING][9/6/2021 11:32:20 PM][Thread 0030][akka://PersistentClusterShardingLeavingSpec/temp/z] DeadLetter from [akka://PersistentClusterShardingLeavingSpec/deadLetters] to [akka://PersistentClusterShardingLeavingSpec/temp/z]: <Received dead letter from [akka://PersistentClusterShardingLeavingSpec/deadLetters]: WriteMessageSuccess<actorInstanceId: 1, message: Persistent<pid: /system/sharding/EntityCoordinator/singleton/coordinator, seqNr: 17, deleted: False, manifest: , sender: , payload: ShardHomeDeallocated(2), writerGuid: 21cd5ed1-2814-42c6-9f65-27b389a2d357>>>

Aaronontheweb · 2021-09-07T12:08:00Z

There's a real bug here - a shard homed on a non-leaving node is being rebalanced, whereas what should happen is only the shards belonging to the exiting node should be redistributed. Not all shards are "equal" when up for consideration for rebalancing.

Aaronontheweb · 2021-09-07T12:30:21Z

[DEBUG][9/7/2021 11:07:19 AM][Thread 0028][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50774/system/sharding/EntityCoordinator/singleton/coordinator] Shard [1] deallocation completed successfully.
[DEBUG][9/7/2021 11:07:19 AM][Thread 0028][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50774/system/sharding/EntityCoordinator/singleton/coordinator] Shard [1] deallocated after
[DEBUG][9/7/2021 11:07:19 AM][Thread 0027][akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50774/system/sharding/EntityCoordinator/singleton/coordinator] Shard [1] allocated at [[akka.tcp://PersistentClusterShardingLeavingSpec@localhost:50780/system/sharding/Entity#730072660]]

Happens as the coordinator on node 1 is leaving. Real bug.

Aaronontheweb · 2021-09-07T12:32:31Z

Actually I take that back - this happens BEFORE Node 1 leaves but after the Locations state is recorded. The rebalance occurs during the initial distribution of shards because Node 4 has nothing allocated on it yet. So the issue is that the earlier state in the test is wrong. Not a bug with sharding - it's a problem with the test, but I think I can fix it.

Aaronontheweb · 2021-09-07T12:41:22Z

src/contrib/cluster/Akka.Cluster.Sharding.Tests.MultiNode/ClusterShardingLeavingSpec.cs

@@ -193,15 +193,7 @@ private void Join(RoleName from, RoleName to)
            {
                Cluster.Join(Node(to).Address);
                StartSharding();
-                Within(TimeSpan.FromSeconds(15), () =>


This should fix it - the problem was that the fourth node never got to start its sharding system until the very end, so it was possible that a rebalance could happen after the ShardLocations snapshot was taken - which would corrupt the state of the test.

Better to let the entire cluster form all at once so the shard distribution can happen concurrently as the cluster forms, rather than staggering it. This way the sharding system has to do all of its redistribution once, rather than 3 times (once for each join + barrier in the previous code.)

Aaronontheweb · 2021-09-07T13:12:20Z

Ok, I think the issues with this spec should be cleared up going forward - this was by far the raciest spec in the entire test suite.

harden ClusterShardingLeavingSpec

e1ac3b1

adding debug logging to `ClusterShardingLeavingSpec`

Aaronontheweb added akka-cluster-sharding multi node spec labels Aug 2, 2021

Aaronontheweb added 2 commits August 2, 2021 16:57

added more debug logging

f1a1c0c

Merge branch 'dev' into fix/MNTR-ClusterShardingLeavingSpec-racy

da8cbe8

Aaronontheweb added 4 commits August 2, 2021 19:58

Merge branch 'dev' into fix/MNTR-ClusterShardingLeavingSpec-racy

e5d1baa

Merge branch 'dev' into fix/MNTR-ClusterShardingLeavingSpec-racy

c755b66

Merge branch 'dev' into fix/MNTR-ClusterShardingLeavingSpec-racy

952cc3b

fixed logging bug inside `ShardRegion. SendGracefulShutdownToCoordina…

ae0d703

…torIfInProgress`

fixed KV deconstruction

c0b10df

deprecated custom AsyncWriteProxy Ask implementation

fa1da10

allow all nodes to join at once and start sharding system right away

44e355f

Aaronontheweb commented Sep 7, 2021

View reviewed changes

Aaronontheweb added 2 commits September 7, 2021 08:08

remove FuturesEx

6163cca

Merge branch 'dev' into fix/MNTR-ClusterShardingLeavingSpec-racy

55f6fcf

Aaronontheweb marked this pull request as ready for review September 7, 2021 13:11

Aaronontheweb enabled auto-merge (squash) September 7, 2021 13:12

Aaronontheweb merged commit 2a3390c into akkadotnet:dev Sep 7, 2021

Aaronontheweb deleted the fix/MNTR-ClusterShardingLeavingSpec-racy branch September 7, 2021 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harden `ClusterShardingLeavingSpec` #5164

harden `ClusterShardingLeavingSpec` #5164

Aaronontheweb commented Aug 2, 2021

Aaronontheweb commented Aug 3, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021 •

edited

Loading

Aaronontheweb commented Sep 7, 2021 •

edited

Loading

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

harden ClusterShardingLeavingSpec #5164

harden ClusterShardingLeavingSpec #5164

Conversation

Aaronontheweb commented Aug 2, 2021

Aaronontheweb commented Aug 3, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021 • edited Loading

Aaronontheweb commented Sep 7, 2021 • edited Loading

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb commented Sep 7, 2021

Aaronontheweb Sep 7, 2021

Choose a reason for hiding this comment

Aaronontheweb commented Sep 7, 2021

harden `ClusterShardingLeavingSpec` #5164

harden `ClusterShardingLeavingSpec` #5164

Aaronontheweb commented Sep 7, 2021 •

edited

Loading

Aaronontheweb commented Sep 7, 2021 •

edited

Loading