Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Akka.Cluster / Akka.Remote DeadLetter logging #7149

Merged
merged 11 commits into from
Apr 10, 2024

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Apr 10, 2024

Changes

Various cleanup inside Akka.Cluster and Akka.Remote in order to make it easier to debug cluster gossip issues.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

@Aaronontheweb
Copy link
Member Author

Racy spec: ClusterSpec.A_cluster_must_leave_via_CoordinatedShutdownRun()

Xunit.Sdk.XunitException
Expected removed.PreviousStatus to equal MemberStatus.Exiting(3) by value, but found MemberStatus.Leaving(2).

With configuration:
- Use declared types and members
- Compare enums by value
- Match member by name (or throw)
- Without automatic conversion.
- Be strict about the order of items in byte arrays

   at FluentAssertions.Execution.XUnit2TestFramework.Throw(String message)
   at FluentAssertions.Execution.TestFrameworkProvider.Throw(String message)
   at FluentAssertions.Execution.CollectingAssertionStrategy.ThrowIfAny(IDictionary`2 context)
   at FluentAssertions.Equivalency.EquivalencyValidator.AssertEquality(EquivalencyValidationContext context)
   at FluentAssertions.Primitives.ObjectAssertions.BeEquivalentTo[TExpectation](TExpectation expectation, Func`2 config, String because, Object[] becauseArgs)
   at FluentAssertions.Primitives.ObjectAssertions.BeEquivalentTo[TExpectation](TExpectation expectation, String because, Object[] becauseArgs)
   at Akka.Cluster.Tests.ClusterSpec.A_cluster_must_leave_via_CoordinatedShutdownRun() in C:\Repositories\olympus\akka.net\src\core\Akka.Cluster.Tests\ClusterSpec.cs:line 479
   at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass48_0.<<InvokeTestMethodAsync>b__1>d.MoveNext() in /_/src/xunit.execution/Sdk/Frameworks/Runners/TestInvoker.cs:line 276
--- End of stack trace from previous location ---
   at Xunit.Sdk.ExecutionTimer.AggregateAsync(Func`1 asyncAction) in /_/src/xunit.execution/Sdk/Frameworks/ExecutionTimer.cs:line 48
   at Xunit.Sdk.ExceptionAggregator.RunAsync(Func`1 code) in /_/src/xunit.core/Sdk/ExceptionAggregator.cs:line 90

Somehow, it looks like a state transition was missed between Leaving and Removed - the logs in the cluster daemon correctly record that this node is exiting:

[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [2.11.1] - Node [akka.tcp://[email protected]:57312] is JOINING itself (with roles [], version [2.11.1]) and forming a new cluster
[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - is the new leader among reachable nodes (more leaders may exist)
[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - Leader is moving node [akka.tcp://[email protected]:57312] to [Up]
[INFO][04/10/2024 18:19:08.362Z][Thread 0013][akka.tcp://[email protected]:57312/system/cluster/core/daemon/downingProvider] This node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[INFO][04/10/2024 18:19:08.362Z][Thread 0020][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - Marked address [akka.tcp://[email protected]:57312] as [Leaving]
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - Leader is moving node [akka.tcp://[email protected]:57312] to [Exiting]
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - Exiting completed.
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - Shutting down...
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][Cluster (akka://ClusterSpec2)] Cluster Node [akka.tcp://[email protected]:57312] - Successfully shut down
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][akka://ClusterSpec2/system/cluster/core/publisher] Message [PublishChanges] from [akka://ClusterSpec2/system/cluster/core/daemon#1459903719] to [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] was not delivered. [1] dead letters encountered. If this is not an expected behavior then [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: Akka.Cluster.InternalClusterAction+PublishChanges
[INFO][04/10/2024 18:19:09.360Z][Thread 0013][akka://ClusterSpec2/system/cluster/core/publisher] Message [PublishChanges] from [akka://ClusterSpec2/system/cluster/core/daemon#1459903719] to [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] was not delivered. [2] dead letters encountered. If this is not an expected behavior then [akka://ClusterSpec2/system/cluster/core/publisher#1716656279] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. Message content: Akka.Cluster.InternalClusterAction+PublishChanges
[DEBUG][04/10/2024 18:19:09.362Z][Thread 0014][ActorSystem(ClusterSpec2)] System shutdown initiated
[INFO][04/10/2024 18:19:09.362Z][Thread 0105][remoting-terminator] Shutting down remote daemon.
[INFO][04/10/2024 18:19:09.362Z][Thread 0105][remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.
[DEBUG][04/10/2024 18:19:09.362Z][Thread 0063][ActorSystem(ClusterSpec-130)] System shutdown initiated
[INFO][04/10/2024 18:19:09.363Z][Thread 0013][remoting-terminator] Shutting down remote daemon.
[INFO][04/10/2024 18:19:09.363Z][Thread 0013][remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.
[INFO][04/10/2024 18:19:09.364Z][Thread 0014][remoting-terminator] Remoting shut down.
[DEBUG][04/10/2024 18:19:09.364Z][Thread 0013][EventStream] Shutting down: StandardOutLogger started
[DEBUG][04/10/2024 18:19:09.364Z][Thread 0013][EventStream] All default loggers stopped

That PublishChanges event that gets deadletter'd might be received AFTER this issue has occurred, so I'm not sure how significant that is.

Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed my changes

@@ -44,7 +44,6 @@ public string ShardId(object message)
=> message switch
{
int i => (i % 10).ToString(),
ShardRegion.StartEntity se => (int.Parse(se.EntityId) % numberOfShards).ToString(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed per analyzer warning

@@ -476,7 +476,7 @@ public async Task A_cluster_must_leave_via_CoordinatedShutdownRun()
await probe.ExpectMsgAsync<ClusterEvent.MemberLeft>();
// MemberExited might not be published before MemberRemoved
var removed = (ClusterEvent.MemberRemoved)await probe.FishForMessageAsync(m => m is ClusterEvent.MemberRemoved);
removed.PreviousStatus.Should().BeEquivalentTo(MemberStatus.Exiting);
new [] {MemberStatus.Exiting, MemberStatus.Leaving}.Should().Contain(removed.PreviousStatus);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for racy spec - we'd already implemented this for one of the other "node leaving" specs.

The problem is that the PublishChanges message gets DeadLetter'd between the exiting and the removed statuses - the Removed status is generated, in this case, from the singleton cluster shutting itself down, rather than from gossip messages. This is a timing issue that mostly effects the spec - has no real-world impact. So we accept either of these two "terminating" statuses instead.

@@ -257,11 +257,11 @@ public void A_gossip_must_find_two_oldest_per_role_as_targets_for_Exiting_change
Member a8 = TestMember.Create(new Address("akka.tcp", "sys", "a8", 2552), MemberStatus.Exiting, ImmutableHashSet<string>.Empty.Add("role1"), upNumber: 8);
Member a9 = TestMember.Create(new Address("akka.tcp", "sys", "a9", 2552), MemberStatus.Exiting, ImmutableHashSet<string>.Empty.Add("role2"), upNumber: 9);

IEnumerable<Member> theExiting = new Member[] { a5, a6 };
var theExiting = new Member[] { a5, a6 };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned this up to just use an IReadOnlyCollection<Member> - to avoid some of the multiple enumeration problems GossipTargetsForExitingMembers was susceptible to.

@@ -2339,15 +2338,19 @@ public void LeaderActionsOnConvergence()
}

PublishMembershipState();
GossipExitingMembersToOldest(changedMembers.Where(i => i.Status == MemberStatus.Exiting));
GossipExitingMembersToOldest(changedMembers.Where(i => i.Status == MemberStatus.Exiting).ToArray());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn capture an array and pass it here, in order to solve multiple enumeration problems. No real perf hit here - only happens when leaders are converging on cluster membership status changes.

}

/// <summary>
/// Gossip the Exiting change to the two oldest nodes for quick dissemination to potential Singleton nodes
/// </summary>
/// <param name="exitingMembers"></param>
private void GossipExitingMembersToOldest(IEnumerable<Member> exitingMembers)
private void GossipExitingMembersToOldest(IReadOnlyCollection<Member> exitingMembers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for the multi-enumeration problem.

@@ -711,7 +703,7 @@ private void Gated(bool writerTerminated, bool earlyUngateRequested)
// remote address at the EndpointManager level stopping this actor. In case the remote system becomes reachable
// again it will be immediately quarantined due to out-of-sync system message buffer and becomes quarantined.
// In other words, this action is safe.
if (_bailoutAt != null && _bailoutAt.IsOverdue)
if (_bailoutAt is { IsOverdue: true })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReSharper suggested some pattern matching here.

@@ -1195,7 +1187,7 @@ private void Initializing()
{
// Assert handle == None?
Context.Parent.Tell(
new ReliableDeliverySupervisor.GotUid((int)handle.ProtocolHandle.HandshakeInfo.Uid, RemoteAddress));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary cast.

}

private Receive ReceiveWithAlwaysLogging()
private bool ReceiveWithAlwaysLogging(object message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor perf optimization - avoid allocating new behaviors here if we can help it.

}

private Receive ReceiveWithMaxCountLogging()
private bool ReceiveWithMaxCountLogging(object message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactoring to be just a method, instead of allocating a new Receive each time

@@ -27,7 +27,6 @@ public MessageExtractor(int maxNumberOfShards) : base(maxNumberOfShards)
public override object EntityMessage(object message)
=> message switch
{
ShardingEnvelope e => e.Message,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving build warning by removing automatically handled sharding messages

@Aaronontheweb Aaronontheweb changed the title [WIP] improve Akka.Cluster / Akka.Remote DeadLetter logging improve Akka.Cluster / Akka.Remote DeadLetter logging Apr 10, 2024

public override string ToString()
{
return $"GossipEnvelope(from={From}, to={To}, gossip={Gossip}, deadline={Deadline})";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the primary DeadLetter logging change I wanted to make here - should make it much easier to understand what was lost in the DeadLetter pile in terms of missing gossip messages when there's things shutting down inside cluster nodes.

@Aaronontheweb Aaronontheweb marked this pull request as ready for review April 10, 2024 19:22
Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb merged commit 49fe208 into akkadotnet:dev Apr 10, 2024
7 of 12 checks passed
@Aaronontheweb Aaronontheweb deleted the remoting-deadletters branch April 10, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants