Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarkus Scheduler stops without trace #41240

Closed
nicklasweasel opened this issue Jun 17, 2024 · 22 comments
Closed

Quarkus Scheduler stops without trace #41240

nicklasweasel opened this issue Jun 17, 2024 · 22 comments
Labels
area/scheduler kind/bug Something isn't working

Comments

@nicklasweasel
Copy link

Describe the bug

This is a bit fluffy but I have a Quarkus Scheduler task running as a cron job every 10 minutes. It can run for days or weeks and then just suddenly stop with no trace in the logs. What would be the way to debug it in order to see if it's some sort of resource starvation?

Expected behavior

The scheduler keeps running (or fails with exceptions)

Actual behavior

The scheduler stops without trace

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

OpenJDK 64-Bit Server VM (build 21.0.1+12-29, mixed mode, sharing)

Quarkus version or git rev

3.6.3

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

Copy link

quarkus-bot bot commented Jun 17, 2024

/cc @brunobat (opentelemetry,tracing), @manovotn (scheduler), @mkouba (scheduler), @radcortez (opentelemetry,tracing)

@mkouba
Copy link
Contributor

mkouba commented Jun 17, 2024

Do you use the quarkus-scheduler or the quarkus-quartz extension?

In any case, I would start with a thread dump, taken e.g. by VisualVM. For quarkus-scheduler there's one thread that is used to check all the triggers and then to execute a scheduled method: for blocking scheduled methods the default blocking executor is used, for non-blocking scheduled methods the event loop and for blocking scheduled methods annotated with @RunOnVirtualThread a new virtual thread is used.

@mkouba
Copy link
Contributor

mkouba commented Jun 17, 2024

If you enable the TRACE logging for the scheduler:

quarkus.log.category."io.quarkus.scheduler".level=TRACE
quarkus.log.category."io.quarkus.scheduler".min-level=TRACE

You should see the "Check triggers" message and also a separate message for each trigger fired:

2024-06-17 09:21:05,000 TRACE [io.qua.sch.run.SimpleScheduler] (pool-8-thread-1) Check triggers at 2024-06-17T09:21:05.000058709+02:00[Europe/Prague]
2024-06-17 09:20:36,002 TRACE [io.qua.sch.run.SimpleScheduler] (pool-8-thread-1) CronTrigger [id=1_org.acme.scheduler.CounterBean#cronJobWithExpressionInConfig, cron=*/5 * * * * ?, gracePeriod=PT1S, timeZone=null] fired

However, this is not very practical for a production environment.

@nicklasweasel
Copy link
Author

I use quarkus-scheduler.

And it's in production and running in an AWS Fargate task so it would be a bit impractical

@mkouba
Copy link
Contributor

mkouba commented Jun 19, 2024

And it's in production and running in an AWS Fargate task so it would be a bit impractical

I see, in that case a thread dump might be helpful.

@aferreiraguido
Copy link

Hello, similar issue here running on Google Cloud and native image; for now replaced @ Scheduled(concurrentExecution = Scheduled.ConcurrentExecution.SKIP, every = "1s") by a while(true) in a Thread spawned within constructor class annotated with @ Startup, but want to go back to Scheduled method.

@mkouba
Copy link
Contributor

mkouba commented Jul 11, 2024

So the best thing you could do in such a situation is to create a thread dump (as mentioned in #41240 (comment)). In Quarkus 3.12+, there should be a thread with the quarkus-scheduler-trigger-check- prefix in its name (previously it was called something like pool-8-thread-1). This thread is used to check all triggers periodically.

It could be that the dedicated executor that runs a quarkus-scheduler-trigger-check- thread is terminated for some reason. Or the default blocking executor which is used to execute blocking @Scheduled methods is not able to the submitted task. In case, we would need more information.

@f4lco
Copy link

f4lco commented Jul 24, 2024

We took threaddumps and heapdumps of three instances of our services which contain a single scheduled method each which did not execute for a week, but has a trigger for every ten seconds. Prior to that, the service had 4 additional days of uptime until the scheduler stopped working.

Threaddump

Running on Quarkus 3.8.5, the scheduler thread follows the pool-x-thread-y naming convention, and indeed we find a single thread with "pool" in its name:

"pool-5-thread-1" #29 prio=5 os_prio=0 cpu=50980.16ms elapsed=1120078.76s tid=0x000078e5582d08a0 nid=0x20 waiting on condition  [0x000078e540f0d000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x00000000f60f82f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1674)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:1182)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take([email protected]/ScheduledThreadPoolExecutor.java:899)
	at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1062)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1122)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:840)

Heapdump

Analysing the heap dump, we find that Quarkus never reset the running flag of the SkipConcurrentExecutionInvoker. The Quarkus scheduler assumed that the scheduled method executed for more than a week straight, which is unrealistic given that the scheduled method pulls a maximum of 50 records from a database and performs an insignificant amount of work on each record.
We find more evidence when we look at the instance of IntervalTrigger. The interval trigger also has its "running" flag set to true, and a "start" date set to one week ago. Further, the "lastFireTime" contains a current timestamp, which means that the cron pattern did fire, however, the ConcurrentExecution.SKIP policy forbade that the scheduler created a second invocation of the scheduled method next to the one the scheduler assumes to be already executing.

Source

Looking at the source, we assume that something went "really wrong":

} catch (RuntimeException e) {
// Just terminate the context and rethrow the exception if something goes really wrong
requestContext.terminate();
throw e;
} finally {

which means that DefaultInvoker throws an exception rather than creating a failed CompletionStage.
Without the CompletionStage, the SkipConcurrentExecutionInvoker cannot reset its "running" flag, because there is no whenComplete:

if (running.compareAndSet(false, true)) {
return delegate.invoke(execution).whenComplete((r, t) -> running.set(false));
}

For the same reason, the StatusEmitterInvoker fails to produce a log statement containing the thrown exception, because there is no completion stage for the whenComplete:

return delegate.invoke(execution).whenComplete((v, t) -> {
if (t != null) {
LOG.errorf(t, "Error occurred while executing task for trigger %s", execution.getTrigger());

This behavior renders the comment on SimpleScheduler$ScheduledTask invalid:

Conclusio

From our perspective,

  • an exception - from Quarkus or the application - must not halt the scheduled method indefinitely
  • the application logs must contain any exception from the scheduled method
  • the application logs must contain evidence why the scheduled method failed to dispatch

We assume that the underlying exception might come from Hibernate Reactive, which suggests that the service may reuse the Hibernate session object across different threads, because the service logs that exception also from outside the scheduled method, but that's highly speculative.

I'll rest my case 🙏

Perhaps the maintainers can weigh in on the following questions:

  • is the analysis plausible? I mean, the main part stems from reading the source.
  • why would DefaultInvoker only check for RuntimeException? Are we sure that a checked exception will never propagate?
  • is creating a failed completion stage in DefaultInvoker a plausible fix?
  • why would DefaultInvoker rethrow the exception in the first place? Is there an advantage to the current approach?

If the analysis proved correct, the bug affected at least Quarkus 3.8.5 through 3.12.3. Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus.

@mkouba
Copy link
Contributor

mkouba commented Jul 25, 2024

@f4lco Thanks for your great analysis! Indeed, it looks like a bug in the SkipConcurrentExecutionInvoker/DefaultInvoker. I will verify the fix and prepare a PR.

why would DefaultInvoker only check for RuntimeException? Are we sure that a checked exception will never propagate?

That's a good point. I think that we should catch Throwable and return a failed stage.

Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus.

Another workaround is to catch the exception in the scheduled method, i.e. never throw an exception. You could also implement a CDI interceptor to catch the exception.

@nicklasweasel Does your application contain @Scheduled(concurrentExecution = SKIP)?

@mkouba
Copy link
Contributor

mkouba commented Jul 25, 2024

@f4lco Thanks for your great analysis! Indeed, it looks like a bug in the SkipConcurrentExecutionInvoker/DefaultInvoker. I will verify the fix and prepare a PR.

why would DefaultInvoker only check for RuntimeException? Are we sure that a checked exception will never propagate?

That's a good point. I think that we should catch Throwable and return a failed stage.

Our workaround consists of a new health check, that fails the liveness probe if the scheduled method failed to execute for 15min, and then restarts Quarkus.

Another workaround is to catch the exception in the scheduled method, i.e. never throw an exception. You could also implement a CDI interceptor to catch the exception.

@nicklasweasel Does your application contain @Scheduled(concurrentExecution = SKIP)?

@f4lco Actually, I looked at the generated body of the DefaultInvoker#invokeBean() method and we do catch Throwable and return a failed stage:

public CompletionStage invokeBean(ScheduledExecution var1) {
      try {
         ArcContainer var2 = Arc.container();
         InjectableBean var3 = var2.bean("hHeTHx7NF2PL8PrZVcnWUI_dX80");
         ((CounterBean)var2.instance(var3).get()).cronJobWithExpressionInConfig();
         return CompletableFuture.completedStage((Object)null);
      } catch (Throwable var5) {
         return CompletableFuture.failedStage(var5);
      }
   }

So an exception thrown from a scheduled method should not be a problem. I've tried to modify our ConcurrentExecutionSkipTest and the StatusEmitterInvoker logs the failure and SkipConcurrentExecutionInvoker is working correctly.

@mkouba
Copy link
Contributor

mkouba commented Jul 25, 2024

In any case, I've created #42127 to make the Invoker's logic more robust. We should at least see the exception stack if something goes really wrong ;-).

@f4lco
Copy link

f4lco commented Jul 25, 2024

So an exception thrown from a scheduled method should not be a problem.

Exactly, I was going to comment the same, that's my understanding, too. That's the part I do not understand - apparently there is more code sitting in between DefaultInvoker and the scheduled method which can throw. Without logging the error, we are flying blind - if I can get hold of the masked fault, I'll report back.

Thank you so much for the efforts on #42127 @mkouba ❤️

barreiro pushed a commit to barreiro/quarkus that referenced this issue Jul 25, 2024
holly-cummins pushed a commit to holly-cummins/quarkus that referenced this issue Jul 31, 2024
gsmet pushed a commit to gsmet/quarkus that referenced this issue Aug 1, 2024
@f4lco
Copy link

f4lco commented Aug 8, 2024

@mkouba we were unable to collect hard evidence on the exception that causes the scheduler to halt.
However, we deployed an isolated changeset, which only contained a workaround solution for #23324, and the service (and scheduled job) ran without hiccup since then.
In our case, the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate with a simpler delete-by-id SQL query. As a result, the "org.hibernate.HibernateException: java.lang.IllegalStateException: HR000069" went away: #23324 (comment)
wonder if it helps 🤷‍♂️

@mkouba
Copy link
Contributor

mkouba commented Aug 9, 2024

@mkouba we were unable to collect hard evidence on the exception that causes the scheduler to halt. However, we deployed an isolated changeset, which only contained a workaround solution for #23324, and the service (and scheduled job) ran without hiccup since then. In our case, the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate with a simpler delete-by-id SQL query. As a result, the "org.hibernate.HibernateException: java.lang.IllegalStateException: HR000069" went away: #23324 (comment) wonder if it helps 🤷‍♂️

Thanks. Let us know if you see something suspicious.

@gsmet
Copy link
Member

gsmet commented Aug 9, 2024

the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate

@f4lco sorry to drop into the conversation but is it something reported to the Hibernate Reactive project and properly identified? Thanks!

@aferreiraguido
Copy link

Hello guys, in our case the solution was to implement a synchronized queue and allow concurrently Scheduled executions consuming from that sync queue while checking the execution in and out for each item. We observe that some executions are left in a WAITING state and simply after some time, we verify if the original request was committed to the data store or not. Unfortunately, we can't extract dumps nor connect to the JVM as the Kubernetes is not managed nor controlled by us. Just to let you know our workaround. Thanks a lot.

@gsmet
Copy link
Member

gsmet commented Aug 9, 2024

@aferreiraguido could you by any chance dump the stack in the logs?

@aferreiraguido
Copy link

Hey @gsmet we will try our best, unfortunately, we are in a platform freeze but once we get anything, we'll drop over here. Please also note that we are now on 2.6.0 and Graalvm 21.3 - also awaiting an update on the platform to newer versions.

@f4lco
Copy link

f4lco commented Aug 9, 2024

the workaround replaced the bugged delete-with-entity implementation from Reactive Hibernate

@f4lco sorry to drop into the conversation but is it something reported to the Hibernate Reactive project and properly identified? Thanks!

Well, #23324 is an open bug report, with detailed descriptions, stack traces, and known workarounds. I am not sure what kind of info could be missing. I'll make sure to drop a comment such that the ticket gets recent activity @gsmet

@mkouba
Copy link
Contributor

mkouba commented Sep 24, 2024

if I can get hold of the masked fault, I'll report back.

@f4lco Did you observe anything specific?

In any case, I don't think that the cause is conntected to the scheduler extension.

@f4lco
Copy link

f4lco commented Oct 22, 2024

if I can get hold of the masked fault, I'll report back.

@f4lco Did you observe anything specific?

In any case, I don't think that the cause is conntected to the scheduler extension.

I am sorry, we did not find anything specific. Yes, I can follow, I do not think that activation of the fault for the hanging job sits in the scheduler extension. The root cause - something exploding within the app or Quarkus - sits outside of the scheduler extension, and prior to the patches, the scheduler extension handled the fallout badly, which resulted in the stuck job.
Now that the patches are in, what would be missing to close this ticket?

@mkouba
Copy link
Contributor

mkouba commented Oct 22, 2024

Now that the patches are in, what would be missing to close this ticket?

Nothing I guess ;-). Let's close this one and reopen if needed.

@mkouba mkouba closed this as completed Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scheduler kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants