Join task callBackDuration #3594

manan164 · 2023-04-26T16:57:32Z

Pull Request type

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

The implemented approach of exponential backoff in join task calbackDuration.

Describe the new behavior from this PR, and why it's needed
Issue #
#3436

v1r3n · 2023-04-27T07:17:27Z

cc: @james-deee

james-deee · 2023-05-02T17:55:19Z

core/src/main/java/com/netflix/conductor/core/execution/tasks/Join.java

+        if (index == 0) {
+            return Optional.of(0L);
+        }
+        return Optional.of(Math.min((long) Math.pow(2, index), defaultOffset));


What happens if someone has their systemTaskWorkerCallbackDuration to one second though? It will always still be 1 second, so this won't be an exponential backoff in that scenario. It feels like defaultOffset should be a configurable maximum offset or something like that. Maybe the default 30 seconds?

Hi @james-deee , I think if we are setting systemTaskWorkerCallbackDuration to 1 second then ideally we don't want the exponential backoff thing. Here we are trying to exponential backoff from 1 to 30 seconds.

@manan164 One hand, I see what you're saying, but in practice what we are seeing is that having 1 second doesn't alleviate the problem. The issue is that we have a lot of long running workflows (thus JOINs) in the queue. If we do NOT do an exponential backoff from our setting of 1 second, then our JOIN tasks don't get processed fast enough because the async task worker is trying to process say 10ks JOINs.......... which then makes JOINs that you want to process immediately be backed up and slowed down.

With my proposal, and with what we changed locally, fixes all of our issues with processing the async JOINS.

In our situation we use a duration of 1 second, but we have a cap of 30 seconds in place of that defaultOffset and it completely solves our issue. The result of it is that JOINs on fast running workflows get processed almost instantaneously, and JOINs in the long running workflows get pushed back to the max of 30 seconds..... which is what frees up the processing of the immediate ones.

I really think these are 2 different levers/values that need to be present. Just using 1 second always won't work in heavy workfload Conductor instances where there might be a LOT of joins to process.

Or maybe what you're saying is that we should be using a setting of 30 seconds for the systemTaskWorkerCallbackDuration always, and then it sounds like no one should ever use 1 second for the systemTaskWorkerCallbackDuration value.

Then, I think that this could get confusing, because it would feel like someone would want to set the value to 1 for fast processing, but they wouldnt know that actually using 30 seconds will get the fastest/best processing?

@manan164 maybe a compromise is to use a hardcoded 30 seconds for that defaultOffset. I really think it will be confusing and could cause issues for people to use 1 second for that setting then, because that doesn't actually give them the best performance.

Hi @james-deee , The expectation here is when we use systemTaskWorkerCallbackDuration to 30 seconds ideally we are fine with tasks getting checked every 30 seconds and getting completed but if tasks are getting checked at 1,2,4,8,16 seconds and get completed that should also be fine. But when we use systemTaskWorkerCallbackDuration to 1 seconds which means I want my task to get checked every second. Hope this answers the question.

shekarpcs · 2023-06-02T21:10:28Z

is it merged to the main?

franrull · 2023-06-08T11:59:53Z

Any estimate on when this can be expected to be merged? It is crippling the performance of our workflows 🙏🏼

shekarpcs · 2023-06-08T12:01:23Z

Exactly, our system is throttled. Database is also at high utilization.

…

On Thu, Jun 8, 2023, 8:00 AM Fran González Rull ***@***.***> wrote: Any estimate on when this can be expected to be merged? It is crippling the performance of our workflows 🙏🏼 — Reply to this email directly, view it on GitHub <#3594 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXQ2C7SRMKFI5OV3GL2IU33XKG5EJANCNFSM6AAAAAAXMW2RXE> . You are receiving this because you commented.Message ID: ***@***.***>

VerstraeteBert · 2023-06-08T12:41:22Z

@manan164 @apanicker-nflx Can we get this in?

Sample working changes

a938c6d

manan164 marked this pull request as draft April 26, 2023 16:57

manan164 added 2 commits April 26, 2023 22:33

spotless

a86eba7

fix.

f29657b

apanicker-nflx approved these changes Apr 26, 2023

View reviewed changes

manan164 changed the title ~~[Do not merge] Join task callBackDuration~~ Join task callBackDuration Apr 27, 2023

manan164 marked this pull request as ready for review April 27, 2023 18:17

james-deee reviewed May 2, 2023

View reviewed changes

v1r3n mentioned this pull request May 30, 2023

Making JOIN Task Async failing some workflows #3528

Closed

v1r3n merged commit e11a3e7 into Netflix:main Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join task callBackDuration #3594

Join task callBackDuration #3594

manan164 commented Apr 26, 2023 •

edited

Loading

v1r3n commented Apr 27, 2023

james-deee May 2, 2023

manan164 May 5, 2023

james-deee May 5, 2023 •

edited

Loading

james-deee May 5, 2023

james-deee May 5, 2023

manan164 May 17, 2023

shekarpcs commented Jun 2, 2023

franrull commented Jun 8, 2023

shekarpcs commented Jun 8, 2023 via email

VerstraeteBert commented Jun 8, 2023

Join task callBackDuration #3594

Join task callBackDuration #3594

Conversation

manan164 commented Apr 26, 2023 • edited Loading

Changes in this PR

v1r3n commented Apr 27, 2023

james-deee May 2, 2023

Choose a reason for hiding this comment

manan164 May 5, 2023

Choose a reason for hiding this comment

james-deee May 5, 2023 • edited Loading

Choose a reason for hiding this comment

james-deee May 5, 2023

Choose a reason for hiding this comment

james-deee May 5, 2023

Choose a reason for hiding this comment

manan164 May 17, 2023

Choose a reason for hiding this comment

shekarpcs commented Jun 2, 2023

franrull commented Jun 8, 2023

shekarpcs commented Jun 8, 2023 via email

VerstraeteBert commented Jun 8, 2023

manan164 commented Apr 26, 2023 •

edited

Loading

james-deee May 5, 2023 •

edited

Loading