Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Join task callBackDuration #3594

Merged
merged 3 commits into from
Jun 8, 2023
Merged

Join task callBackDuration #3594

merged 3 commits into from
Jun 8, 2023

Conversation

manan164
Copy link
Contributor

@manan164 manan164 commented Apr 26, 2023

Pull Request type

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

The implemented approach of exponential backoff in join task calbackDuration.

Describe the new behavior from this PR, and why it's needed
Issue #
#3436

@manan164 manan164 marked this pull request as draft April 26, 2023 16:57
@v1r3n
Copy link
Contributor

v1r3n commented Apr 27, 2023

cc: @james-deee

@manan164 manan164 changed the title [Do not merge] Join task callBackDuration Join task callBackDuration Apr 27, 2023
@manan164 manan164 marked this pull request as ready for review April 27, 2023 18:17
if (index == 0) {
return Optional.of(0L);
}
return Optional.of(Math.min((long) Math.pow(2, index), defaultOffset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if someone has their systemTaskWorkerCallbackDuration to one second though? It will always still be 1 second, so this won't be an exponential backoff in that scenario. It feels like defaultOffset should be a configurable maximum offset or something like that. Maybe the default 30 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @james-deee , I think if we are setting systemTaskWorkerCallbackDuration to 1 second then ideally we don't want the exponential backoff thing. Here we are trying to exponential backoff from 1 to 30 seconds.

Copy link
Contributor

@james-deee james-deee May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manan164 One hand, I see what you're saying, but in practice what we are seeing is that having 1 second doesn't alleviate the problem. The issue is that we have a lot of long running workflows (thus JOINs) in the queue. If we do NOT do an exponential backoff from our setting of 1 second, then our JOIN tasks don't get processed fast enough because the async task worker is trying to process say 10ks JOINs.......... which then makes JOINs that you want to process immediately be backed up and slowed down.

With my proposal, and with what we changed locally, fixes all of our issues with processing the async JOINS.

In our situation we use a duration of 1 second, but we have a cap of 30 seconds in place of that defaultOffset and it completely solves our issue. The result of it is that JOINs on fast running workflows get processed almost instantaneously, and JOINs in the long running workflows get pushed back to the max of 30 seconds..... which is what frees up the processing of the immediate ones.

I really think these are 2 different levers/values that need to be present. Just using 1 second always won't work in heavy workfload Conductor instances where there might be a LOT of joins to process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe what you're saying is that we should be using a setting of 30 seconds for the systemTaskWorkerCallbackDuration always, and then it sounds like no one should ever use 1 second for the systemTaskWorkerCallbackDuration value.

Then, I think that this could get confusing, because it would feel like someone would want to set the value to 1 for fast processing, but they wouldnt know that actually using 30 seconds will get the fastest/best processing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manan164 maybe a compromise is to use a hardcoded 30 seconds for that defaultOffset. I really think it will be confusing and could cause issues for people to use 1 second for that setting then, because that doesn't actually give them the best performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @james-deee , The expectation here is when we use systemTaskWorkerCallbackDuration to 30 seconds ideally we are fine with tasks getting checked every 30 seconds and getting completed but if tasks are getting checked at 1,2,4,8,16 seconds and get completed that should also be fine. But when we use systemTaskWorkerCallbackDuration to 1 seconds which means I want my task to get checked every second. Hope this answers the question.

@shekarpcs
Copy link

is it merged to the main?

@franrull
Copy link

franrull commented Jun 8, 2023

Any estimate on when this can be expected to be merged? It is crippling the performance of our workflows 🙏🏼

@shekarpcs
Copy link

shekarpcs commented Jun 8, 2023 via email

@VerstraeteBert
Copy link
Contributor

@manan164 @apanicker-nflx Can we get this in?

@v1r3n v1r3n merged commit e11a3e7 into Netflix:main Jun 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants