Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Cluster.Tools.Singleton / Akka.Cluster.Sharding: fix duplicate shards caused by incorrect ClusterSingletonManager HandOver #7297

Merged
merged 2 commits into from
Jul 24, 2024

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Jul 23, 2024

Changes

Eliminates the source of #6793, which was caused by using the incorrect ordering methodology when it came to determining which ClusterSingletonManager to hand-over to during member state transitions.

close #6973
close #7196

Supersedes #7287 and #7197

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

Eliminates the source of akkadotnet#6793, which was caused by using the incorrect ordering methodology when it came to determining which `ClusterSingletonManager` to hand-over to during member state transitions.

close akkadotnet#6973
close akkadotnet#7196
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves one of the nastiest bugs we've ever debugged - I knew it would be a stupid problem like this, but tracking it down was a herculean effort. Hundreds of hours spent between @Arkatufus and myself.

_memberAgeComparer = settings.ConsiderAppVersion
? MemberAgeOrdering.DescendingWithAppVersion
: MemberAgeOrdering.Descending;
_memberAgeComparer = Member.AgeOrdering;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix - descending order of members by age means we always get THE YOUNGEST at the front of the membersByAge list. Members MUST BE IN ASCENDING ORDER, OLDEST FIRST. I'm not sure how far back this bug goes - at least to v1.5. I'll add another comment after I've dug that up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed, sticking with DRY especially since Member.AgeOrdering is already thoroughly sanity checked via #7291

_memberAgeComparer = considerAppVersion
? MemberAgeOrdering.DescendingWithAppVersion
: MemberAgeOrdering.Descending;
_memberAgeComparer = Member.AgeOrdering;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses the same fix as the SingletonProxy

else
{
Unhandled(message);
case ClusterEvent.CurrentClusterState state:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReSharper'd this to use a switch statement. No biggie.

@@ -71,11 +71,6 @@ public TestException(string message, Exception innerEx)
: base(message, innerEx)
{
}

protected TestException(SerializationInfo info, StreamingContext context)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete, deleted it to remove a build warning.

@Aaronontheweb
Copy link
Member Author

Going to put this into our test lab with a private build and verify the fix overnight.

@Aaronontheweb
Copy link
Member Author

Looked at the history - apparently this has ALWAYS been a bug with Akka.Cluster.Tools, going all the way back to when the Cluster Singeton was first introduced #1530

@Aaronontheweb
Copy link
Member Author

We'll see if tests and the MNTR need to be adjusted, but our tracing system found definitive proof that the hand off was being screwed up, which lead to two DDataShardCoordinators running at the same time (hence #6973). Going to fire this up in the test lab and see if we can still reproduce that after this.

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants