Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143

Merged

Conversation

andyfurnival
Copy link
Contributor

Wrapped DistributedData Replicator in a Backoff supervisor, so temporary durable store failures don't require a full system restart to recover.

I didn't add this directly into Replicator.Props as this changed the behaviour of many of the multi-node tests, and the trade-off of reworking tests for supporting a supervisor didn't make a whole lot of sense

I had considered putting the supervisor directly on the LmdbDurableStore, however we still run the risk of losing the Replicator with no means of recover (even when durable store isn't used)

Fixed a slightly annoying missing end bracket in ddata logs for GSet

… store failures don't require a full system restart to recover.

Fixed a slightly annoying missing end bracket in ddata logs for GSet
Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about breaking changes during upgrades

@@ -191,7 +191,7 @@ public override string ToString()
{
var sb = new StringBuilder("GSet(");
sb.AppendJoin(", ", Elements);

sb.Append(')');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -83,10 +85,23 @@ public DistributedData(ExtendedActorSystem system)
else
{
var name = config.GetString("name", null);
Replicator = system.ActorOf(Akka.DistributedData.Replicator.Props(_settings), name);
Replicator = system.ActorOf(GetSupervisedReplicator(_settings, name), name+"Supervisor");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks simple enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the new HOCON flag should switch between these 2 implementation, depending on its value, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

}
}

private static Props GetSupervisedReplicator(ReplicatorSettings settings, string name) => BackoffSupervisor.Props(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impact this has on the actor path should not affect replication - HOWEVER, when introducing Akka.NET v1.4.22 into a cluster that is running v1.4.21, the replication system will break temporarily due to:

https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.DistributedData/Replicator.cs#L1104

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should introduce a HOCON config setting:

akka.cluster.distributed-data.recreate-on-failure = off

When that setting is off, we don't use the BackoffSupervisor - and we will keep that setting to off by default. That way this change can be introduced in a non-breaking fashion. When you want to use the BackoffSupervisor, set this setting to on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that change, to default to the original implementation, with a setting to enable backoff.

@Aaronontheweb Aaronontheweb enabled auto-merge (squash) July 14, 2021 14:44
@Aaronontheweb
Copy link
Member

I think the tests introduced here might be a little racy - probably need to allow more time for the first seed node to self-join

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants