ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143

andyfurnival · 2021-07-14T11:34:29Z

Wrapped DistributedData Replicator in a Backoff supervisor, so temporary durable store failures don't require a full system restart to recover.

I didn't add this directly into Replicator.Props as this changed the behaviour of many of the multi-node tests, and the trade-off of reworking tests for supporting a supervisor didn't make a whole lot of sense

I had considered putting the supervisor directly on the LmdbDurableStore, however we still run the risk of losing the Replicator with no means of recover (even when durable store isn't used)

Fixed a slightly annoying missing end bracket in ddata logs for GSet

… store failures don't require a full system restart to recover. Fixed a slightly annoying missing end bracket in ddata logs for GSet

Aaronontheweb

See my comment about breaking changes during upgrades

Aaronontheweb · 2021-07-14T13:39:20Z

src/contrib/cluster/Akka.DistributedData/GSet.cs

@@ -191,7 +191,7 @@ public override string ToString()
        {
            var sb = new StringBuilder("GSet(");
            sb.AppendJoin(", ", Elements);
-
+            sb.Append(')');


Aaronontheweb · 2021-07-14T13:39:28Z

src/contrib/cluster/Akka.DistributedData/DistributedData.cs

@@ -83,10 +85,23 @@ public DistributedData(ExtendedActorSystem system)
            else
            {
                var name = config.GetString("name", null);
-                Replicator = system.ActorOf(Akka.DistributedData.Replicator.Props(_settings), name);
+                Replicator = system.ActorOf(GetSupervisedReplicator(_settings, name), name+"Supervisor");


Looks simple enough

Just to be clear, the new HOCON flag should switch between these 2 implementation, depending on its value, correct?

Aaronontheweb · 2021-07-14T14:38:04Z

src/contrib/cluster/Akka.DistributedData/DistributedData.cs

            }
        }

+        private static Props GetSupervisedReplicator(ReplicatorSettings settings, string name) => BackoffSupervisor.Props(


The impact this has on the actor path should not affect replication - HOWEVER, when introducing Akka.NET v1.4.22 into a cluster that is running v1.4.21, the replication system will break temporarily due to:

https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.DistributedData/Replicator.cs#L1104

We should introduce a HOCON config setting:

akka.cluster.distributed-data.recreate-on-failure = off

When that setting is off, we don't use the BackoffSupervisor - and we will keep that setting to off by default. That way this change can be introduced in a non-breaking fashion. When you want to use the BackoffSupervisor, set this setting to on.

I've done that change, to default to the original implementation, with a setting to enable backoff.

Aaronontheweb · 2021-07-14T14:52:59Z

I think the tests introduced here might be a little racy - probably need to allow more time for the first seed node to self-join

Wrapped DistributedData in a Backoff supervisor, so temporary durable…

66534c5

… store failures don't require a full system restart to recover. Fixed a slightly annoying missing end bracket in ddata logs for GSet

andyfurnival mentioned this pull request Jul 14, 2021

ddata replicator stops but doesn't look like it can be restarted easily #5129

Closed

Removed address from ActorSelection in test

72f32ae

Aaronontheweb approved these changes Jul 14, 2021

View reviewed changes

Merge branch 'dev' into feature/ddata_replicator_backoff

23ae3ef

Aaronontheweb enabled auto-merge (squash) July 14, 2021 14:44

Merge branch 'dev' into feature/ddata_replicator_backoff

7931f41

Aaronontheweb merged commit 731107f into akkadotnet:dev Jul 14, 2021

Arkatufus mentioned this pull request Jul 14, 2021

Add a HOCON setting to control DData replicator restart on failure #5145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143

ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143

andyfurnival commented Jul 14, 2021

Aaronontheweb left a comment

Aaronontheweb Jul 14, 2021

Aaronontheweb Jul 14, 2021

Arkatufus Jul 14, 2021

Aaronontheweb Jul 14, 2021

Aaronontheweb Jul 14, 2021

Aaronontheweb Jul 14, 2021

andyfurnival Jul 14, 2021

Aaronontheweb commented Jul 14, 2021

ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143

ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143

Conversation

andyfurnival commented Jul 14, 2021

Aaronontheweb left a comment

Choose a reason for hiding this comment

Aaronontheweb Jul 14, 2021

Choose a reason for hiding this comment

Aaronontheweb Jul 14, 2021

Choose a reason for hiding this comment

Arkatufus Jul 14, 2021

Choose a reason for hiding this comment

Aaronontheweb Jul 14, 2021

Choose a reason for hiding this comment

Aaronontheweb Jul 14, 2021

Choose a reason for hiding this comment

Aaronontheweb Jul 14, 2021

Choose a reason for hiding this comment

andyfurnival Jul 14, 2021

Choose a reason for hiding this comment

Aaronontheweb commented Jul 14, 2021