-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143
ddata replicator stops but doesn't look like it can be restarted easily #5129 #5143
Conversation
… store failures don't require a full system restart to recover. Fixed a slightly annoying missing end bracket in ddata logs for GSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about breaking changes during upgrades
@@ -191,7 +191,7 @@ public override string ToString() | |||
{ | |||
var sb = new StringBuilder("GSet("); | |||
sb.AppendJoin(", ", Elements); | |||
|
|||
sb.Append(')'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -83,10 +85,23 @@ public DistributedData(ExtendedActorSystem system) | |||
else | |||
{ | |||
var name = config.GetString("name", null); | |||
Replicator = system.ActorOf(Akka.DistributedData.Replicator.Props(_settings), name); | |||
Replicator = system.ActorOf(GetSupervisedReplicator(_settings, name), name+"Supervisor"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks simple enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, the new HOCON flag should switch between these 2 implementation, depending on its value, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
} | ||
} | ||
|
||
private static Props GetSupervisedReplicator(ReplicatorSettings settings, string name) => BackoffSupervisor.Props( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impact this has on the actor path should not affect replication - HOWEVER, when introducing Akka.NET v1.4.22 into a cluster that is running v1.4.21, the replication system will break temporarily due to:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should introduce a HOCON config setting:
akka.cluster.distributed-data.recreate-on-failure = off
When that setting is off, we don't use the BackoffSupervisor
- and we will keep that setting to off
by default. That way this change can be introduced in a non-breaking fashion. When you want to use the BackoffSupervisor
, set this setting to on
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done that change, to default to the original implementation, with a setting to enable backoff.
I think the tests introduced here might be a little racy - probably need to allow more time for the first seed node to self-join |
Wrapped DistributedData Replicator in a Backoff supervisor, so temporary durable store failures don't require a full system restart to recover.
I didn't add this directly into Replicator.Props as this changed the behaviour of many of the multi-node tests, and the trade-off of reworking tests for supporting a supervisor didn't make a whole lot of sense
I had considered putting the supervisor directly on the LmdbDurableStore, however we still run the risk of losing the Replicator with no means of recover (even when durable store isn't used)
Fixed a slightly annoying missing end bracket in ddata logs for GSet