-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poller Scaling Decisions #553
base: master
Are you sure you want to change the base?
Conversation
message PollerScalingDecision { | ||
// How many pollers should be added or removed, if any. As of now, server only scales up or down | ||
// by 1. However, SDKs should allow for other values (while staying within defined min/max). | ||
int32 poller_delta = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend an API that is not delta based as it is much less prone to race conditions. Something like target_poller_count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but the problem is it'd require a lot more internal communication among partitions to determine overall load. That can certainly be more accurate, but also has more overhead. This solution has produced some really great results while introducing effectively zero new overhead, which seems like a great place to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our discussion, updated the name/language here to make it clear this is a suggestion / about requests and not "pollers"
@@ -1733,6 +1738,8 @@ message PollNexusTaskQueueResponse { | |||
bytes task_token = 1; | |||
// Embedded request as translated from the incoming frontend request. | |||
temporal.api.nexus.v1.Request request = 2; | |||
// Server-advised information the SDK may use to adjust its poller count. | |||
temporal.api.sdk.v1.PollerScalingDecision poller_scaling_decision = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this scaling decision based on? Are we looking at the in-memory queue length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out the server review temporalio/temporal#7300
92eac93
to
3077782
Compare
3077782
to
f9ba485
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but I'd like to request we don't merge until temporalio/temporal#7300 is approved by the right people just in case we change our mind on the API in that PR. (feel free to use feature branches all around if you think it may be an involved process)
// pollers. | ||
message PollerScalingDecision { | ||
// How many poll requests to suggest should be added or removed, if any. As of now, server only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this really needs spelling out here, but a question in my head after reading this:
Current "poller count" is configured for workflow task queues as one number, and the sdk splits that number between its sticky queue and the wf task queue. Will suggestions from both of those get applied to that one number, which will then be split between both as before? Or are there now two numbers internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there are two separate scaling controllers when this is enabled
@@ -1733,6 +1737,8 @@ message PollNexusTaskQueueResponse { | |||
bytes task_token = 1; | |||
// Embedded request as translated from the incoming frontend request. | |||
temporal.api.nexus.v1.Request request = 2; | |||
// Server-advised information the SDK may use to adjust its poller count. | |||
temporal.api.taskqueue.v1.PollerScalingDecision poller_scaling_decision = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the implementation yet, but does the server report this? My question is that nexus tasks aren't backlogged, they're only synchronous, so maybe the algorithm has to change slightly? It still makes sense to adjust pollers based on traffic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it doesn't at the moment. I'll need to see where I can fit that in.
Co-authored-by: David Reiss <[email protected]>
020b556
to
2339d9f
Compare
READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.
What changed?
Added a proto that is optionally attached to task responses and contains data for the SDK about whether or not pollers should be scaled up or down.
Why?
Part of the worker management effort to simplify configuration of workers for users.
Breaking changes
Nope
Server PR
It's nonbreaking, but the PR is here: temporalio/temporal#7300