-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quarkus Scheduler stops without trace #41240
Comments
/cc @brunobat (opentelemetry,tracing), @manovotn (scheduler), @mkouba (scheduler), @radcortez (opentelemetry,tracing) |
Do you use the In any case, I would start with a thread dump, taken e.g. by VisualVM. For |
If you enable the TRACE logging for the scheduler:
You should see the "Check triggers" message and also a separate message for each trigger fired:
However, this is not very practical for a production environment. |
I use quarkus-scheduler. And it's in production and running in an AWS Fargate task so it would be a bit impractical |
I see, in that case a thread dump might be helpful. |
Hello, similar issue here running on Google Cloud and native image; for now replaced @ Scheduled(concurrentExecution = Scheduled.ConcurrentExecution.SKIP, every = "1s") by a while(true) in a Thread spawned within constructor class annotated with @ Startup, but want to go back to Scheduled method. |
So the best thing you could do in such a situation is to create a thread dump (as mentioned in #41240 (comment)). In Quarkus 3.12+, there should be a thread with the It could be that the dedicated executor that runs a |
We took threaddumps and heapdumps of three instances of our services which contain a single scheduled method each which did not execute for a week, but has a trigger for every ten seconds. Prior to that, the service had 4 additional days of uptime until the scheduler stopped working. ThreaddumpRunning on Quarkus 3.8.5, the scheduler thread follows the
HeapdumpAnalysing the heap dump, we find that Quarkus never reset the SourceLooking at the source, we assume that something went "really wrong": Lines 27 to 31 in 53b8748
which means that Lines 36 to 38 in 53b8748
For the same reason, the Lines 35 to 37 in 53b8748
This behavior renders the comment on quarkus/extensions/scheduler/runtime/src/main/java/io/quarkus/scheduler/runtime/SimpleScheduler.java Line 445 in 53b8748
ConclusioFrom our perspective,
We assume that the underlying exception might come from Hibernate Reactive, which suggests that the service may reuse the Hibernate session object across different threads, because the service logs that exception also from outside the scheduled method, but that's highly speculative. I'll rest my case 🙏 Perhaps the maintainers can weigh in on the following questions:
If the analysis proved correct, the bug affected at least Quarkus 3.8.5 through 3.12.3. Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus. |
@f4lco Thanks for your great analysis! Indeed, it looks like a bug in the
That's a good point. I think that we should catch
Another workaround is to catch the exception in the scheduled method, i.e. never throw an exception. You could also implement a CDI interceptor to catch the exception. @nicklasweasel Does your application contain |
@f4lco Actually, I looked at the generated body of the public CompletionStage invokeBean(ScheduledExecution var1) {
try {
ArcContainer var2 = Arc.container();
InjectableBean var3 = var2.bean("hHeTHx7NF2PL8PrZVcnWUI_dX80");
((CounterBean)var2.instance(var3).get()).cronJobWithExpressionInConfig();
return CompletableFuture.completedStage((Object)null);
} catch (Throwable var5) {
return CompletableFuture.failedStage(var5);
}
} So an exception thrown from a scheduled method should not be a problem. I've tried to modify our |
In any case, I've created #42127 to make the Invoker's logic more robust. We should at least see the exception stack if something goes really wrong ;-). |
Exactly, I was going to comment the same, that's my understanding, too. That's the part I do not understand - apparently there is more code sitting in between |
- related to quarkusio#41240 (cherry picked from commit 0e075ee)
@mkouba we were unable to collect hard evidence on the exception that causes the scheduler to halt. |
Thanks. Let us know if you see something suspicious. |
@f4lco sorry to drop into the conversation but is it something reported to the Hibernate Reactive project and properly identified? Thanks! |
Hello guys, in our case the solution was to implement a synchronized queue and allow concurrently Scheduled executions consuming from that sync queue while checking the execution in and out for each item. We observe that some executions are left in a WAITING state and simply after some time, we verify if the original request was committed to the data store or not. Unfortunately, we can't extract dumps nor connect to the JVM as the Kubernetes is not managed nor controlled by us. Just to let you know our workaround. Thanks a lot. |
@aferreiraguido could you by any chance dump the stack in the logs? |
Hey @gsmet we will try our best, unfortunately, we are in a platform freeze but once we get anything, we'll drop over here. Please also note that we are now on 2.6.0 and Graalvm 21.3 - also awaiting an update on the platform to newer versions. |
Well, #23324 is an open bug report, with detailed descriptions, stack traces, and known workarounds. I am not sure what kind of info could be missing. I'll make sure to drop a comment such that the ticket gets recent activity @gsmet |
@f4lco Did you observe anything specific? In any case, I don't think that the cause is conntected to the scheduler extension. |
I am sorry, we did not find anything specific. Yes, I can follow, I do not think that activation of the fault for the hanging job sits in the scheduler extension. The root cause - something exploding within the app or Quarkus - sits outside of the scheduler extension, and prior to the patches, the scheduler extension handled the fallout badly, which resulted in the stuck job. |
Nothing I guess ;-). Let's close this one and reopen if needed. |
Describe the bug
This is a bit fluffy but I have a Quarkus Scheduler task running as a cron job every 10 minutes. It can run for days or weeks and then just suddenly stop with no trace in the logs. What would be the way to debug it in order to see if it's some sort of resource starvation?
Expected behavior
The scheduler keeps running (or fails with exceptions)
Actual behavior
The scheduler stops without trace
How to Reproduce?
No response
Output of
uname -a
orver
No response
Output of
java -version
OpenJDK 64-Bit Server VM (build 21.0.1+12-29, mixed mode, sharing)
Quarkus version or git rev
3.6.3
Build tool (ie. output of
mvnw --version
orgradlew --version
)No response
Additional information
No response
The text was updated successfully, but these errors were encountered: