-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265
Comments
And here's an extract from our HOCON:
|
Think I have an idea on how to reproduce this with SQL Server - it's tough to reproduce in this repository with Sqlite because I can't easily simulate a database outage there. Should be doable with SQL Server though. |
I once managed to trigger this scenario by simply executing a long running query that locked the whole EventJournal table. After the query was complete, persistent actors that failed during its execution were still stuck and I had to restart the service to be able to recover them. |
@object https://github.com/akkadotnet/Akka.Persistence.SqlServer/blob/dev/src/Akka.Persistence.SqlServer.Tests/SqlServerFixture.cs - we can add a method to stop the Docker container we use for integration testing in here and a second method to start it again. That would be a pretty robust way of doing it. |
This still stands akkadotnet/Akka.Persistence.SqlServer#114 (comment) |
@ismaelhamed that would do it |
@Aaronontheweb we are not using Docker for integration tests yet but stopping a container looks like a way to reproduce this scenario. |
I'm going to leave this issue open (edit: meaning myself and my team can't get on it right away) for the time being as we're pretty tied up with the 1.4.0 release (trying to get a release candidate with a stable API shipped today) - but I think we can get this reproduced and patched in short order. |
@ismaelhamed I'm still working on a reproduction bug for this, using |
I've finally managed to allocate some time to invesigate this one further. I made a couple of tests to trigger state recovery of 10 000 persistent actors. Here what happens: Test 1: Recover state of 10K persistent actors in a non-clustered environment. Tests creates an actor system and instantiates 10K actors. I ran this tests multiple times and it seems that it works without any errors. Test 2: Recovcer state of 10K persistent actors activated using cluster sharding and full version of actors that spawn additional work. Persistent actors are spread between various shards and the test sends activation requests via Web API that uses cluster sharding to instantiate persistent actors in its respective nodes. Some actors fail to recover its state, here's what's logged: akka://Oddjob/system/akka.persistence.journal.sql-server | Circuit Breaker is open; calls are failing fast followed by a few hundred similar Akka.Persistence.RecoveryTimedOutException log entries. Some of such failed actors can then be recovered later, others are stuck and require restart of the nodes where they live. To investigate this further I can try two directions:
@Aaronontheweb @IgorFedchenko @ismaelhamed what approach do you think it's better to take first? Any other suggestions? |
I experimented more with this, some more observations:
So it looks that I need to build custom versions of Akka persistence DLLs with extra logging to understand better what's going on. |
This could be caused by a bug in |
This problem should be fixed in #4953, the new code only have a single code path for failure and it always increments the counter for each fails. |
This issue looks similar to #3870, however it happens when using latest version of Akka.NET.
OS: Windows Server 2016
Platform: .NET Core 3.1
Akka.NET packages: 1.4.0-beta14 (used in a cluster)
Scenario:
Akka.Persistence.SqlServer.Journal.BatchingSqlServerJournal raises exception with message "Circuit Breaker is open; calls are failing fast", most likely due to a temporary db outage
Attempt to recover state of some persistent actors fail with RecoveryTimedOutException. Here's a typical sequence of events, taken from our log:
The text was updated successfully, but these errors were encountered: