Quarkus Scheduler stops without trace #41240

nicklasweasel · 2024-06-17T06:58:56Z

Describe the bug

This is a bit fluffy but I have a Quarkus Scheduler task running as a cron job every 10 minutes. It can run for days or weeks and then just suddenly stop with no trace in the logs. What would be the way to debug it in order to see if it's some sort of resource starvation?

Expected behavior

The scheduler keeps running (or fails with exceptions)

Actual behavior

The scheduler stops without trace

How to Reproduce?

No response

Output of `uname -a` or `ver`

No response

Output of `java -version`

OpenJDK 64-Bit Server VM (build 21.0.1+12-29, mixed mode, sharing)

Quarkus version or git rev

3.6.3

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2024-06-17T06:58:59Z

/cc @brunobat (opentelemetry,tracing), @manovotn (scheduler), @mkouba (scheduler), @radcortez (opentelemetry,tracing)

mkouba · 2024-06-17T07:17:31Z

Do you use the quarkus-scheduler or the quarkus-quartz extension?

In any case, I would start with a thread dump, taken e.g. by VisualVM. For quarkus-scheduler there's one thread that is used to check all the triggers and then to execute a scheduled method: for blocking scheduled methods the default blocking executor is used, for non-blocking scheduled methods the event loop and for blocking scheduled methods annotated with @RunOnVirtualThread a new virtual thread is used.

mkouba · 2024-06-17T07:23:44Z

If you enable the TRACE logging for the scheduler:

quarkus.log.category."io.quarkus.scheduler".level=TRACE
quarkus.log.category."io.quarkus.scheduler".min-level=TRACE

You should see the "Check triggers" message and also a separate message for each trigger fired:

2024-06-17 09:21:05,000 TRACE [io.qua.sch.run.SimpleScheduler] (pool-8-thread-1) Check triggers at 2024-06-17T09:21:05.000058709+02:00[Europe/Prague]
2024-06-17 09:20:36,002 TRACE [io.qua.sch.run.SimpleScheduler] (pool-8-thread-1) CronTrigger [id=1_org.acme.scheduler.CounterBean#cronJobWithExpressionInConfig, cron=*/5 * * * * ?, gracePeriod=PT1S, timeZone=null] fired

However, this is not very practical for a production environment.

nicklasweasel · 2024-06-19T11:03:56Z

I use quarkus-scheduler.

And it's in production and running in an AWS Fargate task so it would be a bit impractical

mkouba · 2024-06-19T15:57:17Z

And it's in production and running in an AWS Fargate task so it would be a bit impractical

I see, in that case a thread dump might be helpful.

aferreiraguido · 2024-07-11T07:38:07Z

Hello, similar issue here running on Google Cloud and native image; for now replaced @ Scheduled(concurrentExecution = Scheduled.ConcurrentExecution.SKIP, every = "1s") by a while(true) in a Thread spawned within constructor class annotated with @ Startup, but want to go back to Scheduled method.

mkouba · 2024-07-11T08:05:23Z

So the best thing you could do in such a situation is to create a thread dump (as mentioned in #41240 (comment)). In Quarkus 3.12+, there should be a thread with the quarkus-scheduler-trigger-check- prefix in its name (previously it was called something like pool-8-thread-1). This thread is used to check all triggers periodically.

It could be that the dedicated executor that runs a quarkus-scheduler-trigger-check- thread is terminated for some reason. Or the default blocking executor which is used to execute blocking @Scheduled methods is not able to the submitted task. In case, we would need more information.

f4lco · 2024-07-24T15:33:45Z

We took threaddumps and heapdumps of three instances of our services which contain a single scheduled method each which did not execute for a week, but has a trigger for every ten seconds. Prior to that, the service had 4 additional days of uptime until the scheduler stopped working.

Threaddump

Running on Quarkus 3.8.5, the scheduler thread follows the pool-x-thread-y naming convention, and indeed we find a single thread with "pool" in its name:

"pool-5-thread-1" #29 prio=5 os_prio=0 cpu=50980.16ms elapsed=1120078.76s tid=0x000078e5582d08a0 nid=0x20 waiting on condition  [0x000078e540f0d000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000f60f82f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1674)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:1182)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:899)
	at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1062)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1122)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:840)

Heapdump

Analysing the heap dump, we find that Quarkus never reset the running flag of the SkipConcurrentExecutionInvoker. The Quarkus scheduler assumed that the scheduled method executed for more than a week straight, which is unrealistic given that the scheduled method pulls a maximum of 50 records from a database and performs an insignificant amount of work on each record.
We find more evidence when we look at the instance of IntervalTrigger. The interval trigger also has its "running" flag set to true, and a "start" date set to one week ago. Further, the "lastFireTime" contains a current timestamp, which means that the cron pattern did fire, however, the ConcurrentExecution.SKIP policy forbade that the scheduler created a second invocation of the scheduled method next to the one the scheduler assumes to be already executing.

Source

Looking at the source, we assume that something went "really wrong":

quarkus/extensions/scheduler/common/src/main/java/io/quarkus/scheduler/common/runtime/DefaultInvoker.java

Lines 27 to 31 in 53b8748

    
           } catch (RuntimeException e) { 
        
               // Just terminate the context and rethrow the exception if something goes really wrong 
        
               requestContext.terminate(); 
        
               throw e; 
        
           } finally {

which means that DefaultInvoker throws an exception rather than creating a failed CompletionStage.
Without the CompletionStage, the SkipConcurrentExecutionInvoker cannot reset its "running" flag, because there is no whenComplete:

quarkus/extensions/scheduler/common/src/main/java/io/quarkus/scheduler/common/runtime/SkipConcurrentExecutionInvoker.java

Lines 36 to 38 in 53b8748

    
           if (running.compareAndSet(false, true)) { 
        
               return delegate.invoke(execution).whenComplete((r, t) -> running.set(false)); 
        
           }

For the same reason, the StatusEmitterInvoker fails to produce a log statement containing the thrown exception, because there is no completion stage for the whenComplete:

quarkus/extensions/scheduler/common/src/main/java/io/quarkus/scheduler/common/runtime/StatusEmitterInvoker.java

Lines 35 to 37 in 53b8748

    
           return delegate.invoke(execution).whenComplete((v, t) -> { 
        
               if (t != null) { 
        
                   LOG.errorf(t, "Error occurred while executing task for trigger %s", execution.getTrigger());

This behavior renders the comment on SimpleScheduler$ScheduledTask invalid:

quarkus/extensions/scheduler/runtime/src/main/java/io/quarkus/scheduler/runtime/SimpleScheduler.java

Line 445 in 53b8748

// already logged by the StatusEmitterInvoker

Conclusio

From our perspective,

an exception - from Quarkus or the application - must not halt the scheduled method indefinitely
the application logs must contain any exception from the scheduled method
the application logs must contain evidence why the scheduled method failed to dispatch

We assume that the underlying exception might come from Hibernate Reactive, which suggests that the service may reuse the Hibernate session object across different threads, because the service logs that exception also from outside the scheduled method, but that's highly speculative.

I'll rest my case 🙏

Perhaps the maintainers can weigh in on the following questions:

is the analysis plausible? I mean, the main part stems from reading the source.
why would DefaultInvoker only check for RuntimeException? Are we sure that a checked exception will never propagate?
is creating a failed completion stage in DefaultInvoker a plausible fix?
why would DefaultInvoker rethrow the exception in the first place? Is there an advantage to the current approach?

If the analysis proved correct, the bug affected at least Quarkus 3.8.5 through 3.12.3. Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus.

mkouba · 2024-07-25T07:46:59Z

@f4lco Thanks for your great analysis! Indeed, it looks like a bug in the SkipConcurrentExecutionInvoker/DefaultInvoker. I will verify the fix and prepare a PR.

why would DefaultInvoker only check for RuntimeException? Are we sure that a checked exception will never propagate?

That's a good point. I think that we should catch Throwable and return a failed stage.

Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus.

Another workaround is to catch the exception in the scheduled method, i.e. never throw an exception. You could also implement a CDI interceptor to catch the exception.

@nicklasweasel Does your application contain @Scheduled(concurrentExecution = SKIP)?

mkouba · 2024-07-25T08:13:10Z

@f4lco Thanks for your great analysis! Indeed, it looks like a bug in the SkipConcurrentExecutionInvoker/DefaultInvoker. I will verify the fix and prepare a PR.

why would DefaultInvoker only check for RuntimeException? Are we sure that a checked exception will never propagate?

That's a good point. I think that we should catch Throwable and return a failed stage.

Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus.

Another workaround is to catch the exception in the scheduled method, i.e. never throw an exception. You could also implement a CDI interceptor to catch the exception.

@nicklasweasel Does your application contain @Scheduled(concurrentExecution = SKIP)?

@f4lco Actually, I looked at the generated body of the DefaultInvoker#invokeBean() method and we do catch Throwable and return a failed stage:

public CompletionStage invokeBean(ScheduledExecution var1) {
      try {
         ArcContainer var2 = Arc.container();
         InjectableBean var3 = var2.bean("hHeTHx7NF2PL8PrZVcnWUI_dX80");
         ((CounterBean)var2.instance(var3).get()).cronJobWithExpressionInConfig();
         return CompletableFuture.completedStage((Object)null);
      } catch (Throwable var5) {
         return CompletableFuture.failedStage(var5);
      }
   }

So an exception thrown from a scheduled method should not be a problem. I've tried to modify our ConcurrentExecutionSkipTest and the StatusEmitterInvoker logs the failure and SkipConcurrentExecutionInvoker is working correctly.

- related to quarkusio#41240

mkouba · 2024-07-25T09:08:47Z

In any case, I've created #42127 to make the Invoker's logic more robust. We should at least see the exception stack if something goes really wrong ;-).

f4lco · 2024-07-25T11:24:14Z

So an exception thrown from a scheduled method should not be a problem.

Exactly, I was going to comment the same, that's my understanding, too. That's the part I do not understand - apparently there is more code sitting in between DefaultInvoker and the scheduled method which can throw. Without logging the error, we are flying blind - if I can get hold of the masked fault, I'll report back.

Thank you so much for the efforts on #42127 @mkouba ❤️

- related to quarkusio#41240

- related to quarkusio#41240 (cherry picked from commit 0e075ee)

f4lco · 2024-08-08T17:49:14Z

@mkouba we were unable to collect hard evidence on the exception that causes the scheduler to halt.
However, we deployed an isolated changeset, which only contained a workaround solution for #23324, and the service (and scheduled job) ran without hiccup since then.
In our case, the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate with a simpler delete-by-id SQL query. As a result, the "org.hibernate.HibernateException: java.lang.IllegalStateException: HR000069" went away: #23324 (comment)
wonder if it helps 🤷‍♂️

mkouba · 2024-08-09T07:49:32Z

@mkouba we were unable to collect hard evidence on the exception that causes the scheduler to halt. However, we deployed an isolated changeset, which only contained a workaround solution for #23324, and the service (and scheduled job) ran without hiccup since then. In our case, the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate with a simpler delete-by-id SQL query. As a result, the "org.hibernate.HibernateException: java.lang.IllegalStateException: HR000069" went away: #23324 (comment) wonder if it helps 🤷‍♂️

Thanks. Let us know if you see something suspicious.

gsmet · 2024-08-09T08:18:41Z

the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate

@f4lco sorry to drop into the conversation but is it something reported to the Hibernate Reactive project and properly identified? Thanks!

aferreiraguido · 2024-08-09T10:00:00Z

Hello guys, in our case the solution was to implement a synchronized queue and allow concurrently Scheduled executions consuming from that sync queue while checking the execution in and out for each item. We observe that some executions are left in a WAITING state and simply after some time, we verify if the original request was committed to the data store or not. Unfortunately, we can't extract dumps nor connect to the JVM as the Kubernetes is not managed nor controlled by us. Just to let you know our workaround. Thanks a lot.

gsmet · 2024-08-09T10:05:58Z

@aferreiraguido could you by any chance dump the stack in the logs?

aferreiraguido · 2024-08-09T11:36:31Z

Hey @gsmet we will try our best, unfortunately, we are in a platform freeze but once we get anything, we'll drop over here. Please also note that we are now on 2.6.0 and Graalvm 21.3 - also awaiting an update on the platform to newer versions.

f4lco · 2024-08-09T17:02:25Z

the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate

@f4lco sorry to drop into the conversation but is it something reported to the Hibernate Reactive project and properly identified? Thanks!

Well, #23324 is an open bug report, with detailed descriptions, stack traces, and known workarounds. I am not sure what kind of info could be missing. I'll make sure to drop a comment such that the ticket gets recent activity @gsmet

- related to quarkusio#41240

mkouba · 2024-09-24T08:48:11Z

if I can get hold of the masked fault, I'll report back.

@f4lco Did you observe anything specific?

In any case, I don't think that the cause is conntected to the scheduler extension.

f4lco · 2024-10-22T14:16:03Z

if I can get hold of the masked fault, I'll report back.

@f4lco Did you observe anything specific?

In any case, I don't think that the cause is conntected to the scheduler extension.

I am sorry, we did not find anything specific. Yes, I can follow, I do not think that activation of the fault for the hanging job sits in the scheduler extension. The root cause - something exploding within the app or Quarkus - sits outside of the scheduler extension, and prior to the patches, the scheduler extension handled the fallout badly, which resulted in the stuck job.
Now that the patches are in, what would be missing to close this ticket?

mkouba · 2024-10-22T15:06:56Z

Now that the patches are in, what would be missing to close this ticket?

Nothing I guess ;-). Let's close this one and reopen if needed.

nicklasweasel added the kind/bug Something isn't working label Jun 17, 2024

quarkus-bot bot added area/scheduler area/smallrye labels Jun 17, 2024

quarkus-bot bot added the area/tracing label Jun 17, 2024

mkouba removed area/smallrye area/tracing labels Jun 17, 2024

mkouba added a commit to mkouba/quarkus that referenced this issue Jul 25, 2024

Scheduler: make sure an exception never slips through an invokers chain

0e075ee

- related to quarkusio#41240

mkouba mentioned this issue Jul 25, 2024

Scheduler: make sure an exception never slips through an invokers chain #42127

Merged

barreiro pushed a commit to barreiro/quarkus that referenced this issue Jul 25, 2024

Scheduler: make sure an exception never slips through an invokers chain

2b139b3

- related to quarkusio#41240

holly-cummins pushed a commit to holly-cummins/quarkus that referenced this issue Jul 31, 2024

Scheduler: make sure an exception never slips through an invokers chain

480044e

- related to quarkusio#41240

gsmet pushed a commit to gsmet/quarkus that referenced this issue Aug 1, 2024

Scheduler: make sure an exception never slips through an invokers chain

6cefd52

- related to quarkusio#41240 (cherry picked from commit 0e075ee)

f4lco mentioned this issue Aug 9, 2024

Error under load "Detected use of the reactive Session from a different Thread..." #23324

Open

danielsoro pushed a commit to danielsoro/quarkus that referenced this issue Sep 20, 2024

Scheduler: make sure an exception never slips through an invokers chain

a1aee83

- related to quarkusio#41240

mkouba closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quarkus Scheduler stops without trace #41240

Quarkus Scheduler stops without trace #41240

nicklasweasel commented Jun 17, 2024

quarkus-bot bot commented Jun 17, 2024

mkouba commented Jun 17, 2024

mkouba commented Jun 17, 2024

nicklasweasel commented Jun 19, 2024

mkouba commented Jun 19, 2024

aferreiraguido commented Jul 11, 2024

mkouba commented Jul 11, 2024

f4lco commented Jul 24, 2024

mkouba commented Jul 25, 2024

mkouba commented Jul 25, 2024

mkouba commented Jul 25, 2024

f4lco commented Jul 25, 2024

f4lco commented Aug 8, 2024

mkouba commented Aug 9, 2024

gsmet commented Aug 9, 2024

aferreiraguido commented Aug 9, 2024

gsmet commented Aug 9, 2024

aferreiraguido commented Aug 9, 2024

f4lco commented Aug 9, 2024

mkouba commented Sep 24, 2024

f4lco commented Oct 22, 2024

mkouba commented Oct 22, 2024

Quarkus Scheduler stops without trace #41240

Quarkus Scheduler stops without trace #41240

Comments

nicklasweasel commented Jun 17, 2024

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

quarkus-bot bot commented Jun 17, 2024

mkouba commented Jun 17, 2024

mkouba commented Jun 17, 2024

nicklasweasel commented Jun 19, 2024

mkouba commented Jun 19, 2024

aferreiraguido commented Jul 11, 2024

mkouba commented Jul 11, 2024

f4lco commented Jul 24, 2024

Threaddump

Heapdump

Source

Conclusio

mkouba commented Jul 25, 2024

mkouba commented Jul 25, 2024

mkouba commented Jul 25, 2024

f4lco commented Jul 25, 2024

f4lco commented Aug 8, 2024

mkouba commented Aug 9, 2024

gsmet commented Aug 9, 2024

aferreiraguido commented Aug 9, 2024

gsmet commented Aug 9, 2024

aferreiraguido commented Aug 9, 2024

f4lco commented Aug 9, 2024

mkouba commented Sep 24, 2024

f4lco commented Oct 22, 2024

mkouba commented Oct 22, 2024

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)