W-16941297: Scatter Gather timeout exception #14192

anandkarandikar · 2025-01-31T12:25:38Z

Ticket

Cause

When Scatter Gather times out, the routes that opened db connections, stay open. The clean up is never invoked. Events need to complete in order to be called for their clean up jobs. In customer scenario, <db:select ... /> with a SELECT SLEEP(10) as an Event doesn't complete and thus no disposal of the database connection is called.

Other ideas

StreamingGhostBuster was deemed to have handled this since it's intention is to clean up unclosed stream and its related CursorStreamProvider object. However, in this situation, the reference is a strong reference thus StreamingGhostBuster wouldn't work.

Fix

For timeout handling, we are working with the events that timed out and calling .error(...) for those events.
Calling the .error(...) without the Scheduler pool caused the Scatter Gather to wait until the longest SLEEP(n) completes.
To help alleviate this behavior, this change utilizes timeoutScheduler making the Scatter Gather timeout as expected and present the composite routing exception messages to the user instantly while the SELECT SLEEP(n) query continue to execute. Once those complete, the .error(...) method is called and submitted to timeoutScheduler.
The timeoutScheduler that's created as cpuLightScheduler is incapable of handling nested Scatter Gather's or large number of routes. This fix was tested with 70 routes with almost all of them timing out. Changing this to ioScheduler was capable to handling this scaling issue.

Test Coverage

Currently there are tests in org/mule/runtime/core/internal/routing/forkjoin that have timeout events being raised.
Also leveraging a timeout scenario with Scatter Gather using test-extensions (marvel-extension) to mimic the delayed scenario. This is a similar approach to W-16941297: SG timeout issue mule-integration-tests#2634 but without needing the actual database in the picture, because we need to ensure that the underlying streams are closed.

…kground

…tractForkJoinStrategyFactory

asanguinetti · 2025-01-31T17:00:35Z

...-components/src/main/java/org/mule/runtime/core/internal/routing/AbstractForkJoinRouter.java

-    timeoutScheduler = schedulerService.cpuLightScheduler(SchedulerConfig.config()
+    timeoutScheduler = schedulerService.ioScheduler(SchedulerConfig.config()


I would need a good justification for this change.
The tasks being submitted to that scheduler are better defined by "CPU light" rather than "I/O intensive". If this change is needed for something to work, we definitely need to understand why, because initially it doesn't make sense.

Would having a underlying db connection counted as an I/O category?
We were able to notice that cpuLightScheduler doesn't work when Scatter Gather contains a lot of routes.

I don't think that matters. What matters is what you are doing in the tasks that you submit to the scheduler. For example if the tasks require sleeping or blocking on I/O a lot.
In this case I think the problem is that you are submitting just too many tasks, beyond the estimated capacity for the pool type.

That's possible. I've been scaling the SG with nested SG that maybe really not the actually customer scenario. Given that the bound is dependent on the # of cores, its possible my laptop wasn't able to handle that many cpu bound tasks.

core/src/main/java/org/mule/runtime/core/internal/event/AbstractEventContext.java

anandkarandikar · 2025-02-03T09:32:54Z

...in/java/org/mule/runtime/core/internal/routing/forkjoin/AbstractForkJoinStrategyFactory.java

@@ -146,7 +146,7 @@ private void handleTimeoutExceptionIfPresent(Scheduler timeoutScheduler,
                     EventContext context = pair.getFirst().getContext();
                     if (context instanceof AbstractEventContext) {
                       ((AbstractEventContext) context).forEachChild(ctx -> timeoutScheduler
-                           .submit(() -> ctx.error(error.get().getCause())));
+                           .submit(() -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));


Despite working fine, there were failures in the log

ERROR 2025-01-28 17:26:29,170 [pool-5-thread-2] [processor: ; event: ] org.mule.runtime.core.privileged.processor.MessageProcessors: Uncaught exception in childContextResponseHandler java.lang.ClassCastException: class java.util.concurrent.TimeoutException cannot be cast to class org.mule.runtime.core.privileged.exception.MessagingException (java.util.concurrent.TimeoutException is in module java.base of loader 'bootstrap'; org.mule.runtime.core.privileged.exception.MessagingException is in module [email protected] of loader jdk.internal.loader.Loader @10cc327a) at [email protected]/org.mule.runtime.core.privileged.processor.MessageProcessors.lambda$childContextResponseHandler$14(MessageProcessors.java:582) ~[mule-core-4.9.0-20241025.jar:?] at [email protected]/org.mule.runtime.core.internal.event.AbstractEventContext.signalConsumerSilently(AbstractEventContext.java:310) ~[?:?] at [email protected]/org.mule.runtime.core.internal.event.AbstractEventContext.receiveResponse(AbstractEventContext.java:210) ~[?:?] at [email protected]/org.mule.runtime.core.internal.event.AbstractEventContext.error(AbstractEventContext.java:189) ~[?:?] at [email protected]/org.mule.runtime.core.internal.routing.forkjoin.AbstractForkJoinStrategyFactory.lambda$handleTimeoutExceptionIfPresent$6(AbstractForkJoinStrategyFactory.java:173) ~[?:?]

Therefore, creating a MessageException instance

anandkarandikar · 2025-02-05T05:26:26Z

--validate

asanguinetti · 2025-02-06T13:50:27Z

...in/java/org/mule/runtime/core/internal/routing/forkjoin/AbstractForkJoinStrategyFactory.java

+                     EventContext context = pair.getFirst().getContext();
+                     if (context instanceof AbstractEventContext) {
+                       ((AbstractEventContext) context).forEachChild(ctx -> timeoutScheduler
+                           .submit(() -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));
+                     }
+                   }


What about submitting just one task for all the child contexts?
Something like:

Suggested change

EventContext context = pair.getFirst().getContext();

if (context instanceof AbstractEventContext) {

((AbstractEventContext) context).forEachChild(ctx -> timeoutScheduler

.submit(() -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));

}

}

EventContext context = pair.getFirst().getContext();

if (context instanceof AbstractEventContext) {

timeoutScheduler

.submit(() -> ((AbstractEventContext) context).forEachChild(ctx -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));

}

}

I have tried this change with a scale of larger Scatter Gather routes (~70 db select) with the following scheduler combinations:

cpuLightScheduler

cpuIntensiveScheduler

ioScheduler

With each scheduler I am seeing large number of connections (~34) are unclosed.

However with

((AbstractEventContext) context).forEachChild(ctx -> timeoutScheduler .submit(() -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));

it can handle those 70 connections

I would really like to understand why is this happening. How come scheduling 70 different tasks is working but scheduling just one is not... What is it that we don't know?

If we're purely considering the resources used probably nesting of SG routes that I've created may have caused recursive calls of error(...) which also deals with reentrant locks. Not sure if there if anything there, but throwing ideas around.

asanguinetti · 2025-02-06T13:57:06Z

...-components/src/main/java/org/mule/runtime/core/internal/routing/AbstractForkJoinRouter.java

-    timeoutScheduler = schedulerService.cpuLightScheduler(SchedulerConfig.config()
+    timeoutScheduler = schedulerService.ioScheduler(SchedulerConfig.config()


I don't think that matters. What matters is what you are doing in the tasks that you submit to the scheduler. For example if the tasks require sleeping or blocking on I/O a lot.
In this case I think the problem is that you are submitting just too many tasks, beyond the estimated capacity for the pool type.

anandkarandikar · 2025-02-07T12:25:53Z

--validate

asanguinetti · 2025-02-07T14:17:44Z

...in/java/org/mule/runtime/core/internal/routing/forkjoin/AbstractForkJoinStrategyFactory.java

+                     EventContext context = pair.getFirst().getContext();
+                     if (context instanceof AbstractEventContext) {
+                       ((AbstractEventContext) context).forEachChild(ctx -> timeoutScheduler
+                           .submit(() -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));
+                     }
+                   }


I would really like to understand why is this happening. How come scheduling 70 different tasks is working but scheduling just one is not... What is it that we don't know?

...t/java/org/mule/test/module/extension/streaming/AbstractBytesStreamingExtensionTestCase.java

asanguinetti · 2025-02-07T14:26:53Z

...sions/marvel-extension/src/main/java/org/mule/test/marvel/drstrange/DrStrangeOperations.java

+      CountDownLatch latch = new CountDownLatch(1);
+      latch.await(delay, TimeUnit.MILLISECONDS);


This latch is not useful because nobody is counting down on it... so it is basically the same as a sleep. I gave you some examples on Slack. I know that you are probably working on that right now, but I have to mark this in the review for completeness.

sonarqube-as-a-service · 2025-02-07T18:36:11Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
4 Code Smells

91.3% Coverage
0.0% Duplication

anandkarandikar added 9 commits January 22, 2025 22:33

handling timeout exception

5c2b698

Attempt with ExecutorService to process connection closing in the bac…

d1b7b3a

…kground

Remove changes from AbstractEventContext

5806098

Make ChildEventContext accessible and handle timeout exception in Abs…

ec561bd

…tractForkJoinStrategyFactory

Switch to ioScheduler

89d854e

Change visibility to private for ChildEventContext

59dbd95

Make AbstractEventContext public

fce0083

Use AbstractEventContext instead of ChildEventContext

7540dc8

Remove unused import

f3d2532

anandkarandikar requested a review from a team as a code owner January 31, 2025 12:25

anandkarandikar mentioned this pull request Jan 31, 2025

Attempt at handling timeout exception #14172

Closed

asanguinetti reviewed Jan 31, 2025

View reviewed changes

core/src/main/java/org/mule/runtime/core/internal/event/AbstractEventContext.java Show resolved Hide resolved

Handling ClassCastException failure

b1b1390

anandkarandikar commented Feb 3, 2025

View reviewed changes

Merge branch 'master' into fix/W-16941297

3a90f7d

asanguinetti requested changes Feb 6, 2025

View reviewed changes

anandkarandikar and others added 6 commits February 7, 2025 03:57

Add delay to sayMagicWords

e66cf94

Add tests for Scatter Gather timeout and non timeout scenario

abadb33

Apply formatter

835e844

Merge branch 'master' into fix/W-16941297

eceee29

Add assertPayloadIsIteratorProvider check

8c3e66f

Replace Thread.sleep with CountDownLatch

7c2045a

anandkarandikar added 2 commits February 7, 2025 19:08

Fixing model and schema file for marvel extension

b24c317

Revert the version to @mule.runtime.version@

82e2d38

asanguinetti requested changes Feb 7, 2025

View reviewed changes

Using assertThrows

80b3412

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W-16941297: Scatter Gather timeout exception #14192

W-16941297: Scatter Gather timeout exception #14192

anandkarandikar commented Jan 31, 2025 •

edited

Loading

asanguinetti Jan 31, 2025

anandkarandikar Jan 31, 2025

asanguinetti Feb 6, 2025

anandkarandikar Feb 6, 2025 •

edited

Loading

anandkarandikar Feb 3, 2025 •

edited

Loading

anandkarandikar commented Feb 5, 2025

asanguinetti Feb 6, 2025

anandkarandikar Feb 6, 2025

asanguinetti Feb 7, 2025

anandkarandikar Feb 7, 2025

asanguinetti Feb 6, 2025

anandkarandikar commented Feb 7, 2025

asanguinetti Feb 7, 2025

asanguinetti Feb 7, 2025

sonarqube-as-a-service bot commented Feb 7, 2025

		timeoutScheduler = schedulerService.cpuLightScheduler(SchedulerConfig.config()
		timeoutScheduler = schedulerService.ioScheduler(SchedulerConfig.config()

		CountDownLatch latch = new CountDownLatch(1);
		latch.await(delay, TimeUnit.MILLISECONDS);

W-16941297: Scatter Gather timeout exception #14192

Are you sure you want to change the base?

W-16941297: Scatter Gather timeout exception #14192

Conversation

anandkarandikar commented Jan 31, 2025 • edited Loading

Ticket

Cause

Other ideas

Fix

Test Coverage

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anandkarandikar Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

anandkarandikar Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

anandkarandikar commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anandkarandikar commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqube-as-a-service bot commented Feb 7, 2025

anandkarandikar commented Jan 31, 2025 •

edited

Loading

anandkarandikar Feb 6, 2025 •

edited

Loading

anandkarandikar Feb 3, 2025 •

edited

Loading