Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poller Scaling Decisions #553

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Poller Scaling Decisions #553

wants to merge 2 commits into from

Conversation

Sushisource
Copy link
Member

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

What changed?
Added a proto that is optionally attached to task responses and contains data for the SDK about whether or not pollers should be scaled up or down.

Why?
Part of the worker management effort to simplify configuration of workers for users.

Breaking changes
Nope

Server PR
It's nonbreaking, but the PR is here: temporalio/temporal#7300

message PollerScalingDecision {
// How many pollers should be added or removed, if any. As of now, server only scales up or down
// by 1. However, SDKs should allow for other values (while staying within defined min/max).
int32 poller_delta = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend an API that is not delta based as it is much less prone to race conditions. Something like target_poller_count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but the problem is it'd require a lot more internal communication among partitions to determine overall load. That can certainly be more accurate, but also has more overhead. This solution has produced some really great results while introducing effectively zero new overhead, which seems like a great place to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion, updated the name/language here to make it clear this is a suggestion / about requests and not "pollers"

@@ -1733,6 +1738,8 @@ message PollNexusTaskQueueResponse {
bytes task_token = 1;
// Embedded request as translated from the incoming frontend request.
temporal.api.nexus.v1.Request request = 2;
// Server-advised information the SDK may use to adjust its poller count.
temporal.api.sdk.v1.PollerScalingDecision poller_scaling_decision = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this scaling decision based on? Are we looking at the in-memory queue length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the server review temporalio/temporal#7300

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I'd like to request we don't merge until temporalio/temporal#7300 is approved by the right people just in case we change our mind on the API in that PR. (feel free to use feature branches all around if you think it may be an involved process)

// pollers.
message PollerScalingDecision {
// How many poll requests to suggest should be added or removed, if any. As of now, server only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this really needs spelling out here, but a question in my head after reading this:

Current "poller count" is configured for workflow task queues as one number, and the sdk splits that number between its sticky queue and the wf task queue. Will suggestions from both of those get applied to that one number, which will then be split between both as before? Or are there now two numbers internally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there are two separate scaling controllers when this is enabled

temporal/api/taskqueue/v1/message.proto Outdated Show resolved Hide resolved
@@ -1733,6 +1737,8 @@ message PollNexusTaskQueueResponse {
bytes task_token = 1;
// Embedded request as translated from the incoming frontend request.
temporal.api.nexus.v1.Request request = 2;
// Server-advised information the SDK may use to adjust its poller count.
temporal.api.taskqueue.v1.PollerScalingDecision poller_scaling_decision = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the implementation yet, but does the server report this? My question is that nexus tasks aren't backlogged, they're only synchronous, so maybe the algorithm has to change slightly? It still makes sense to adjust pollers based on traffic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it doesn't at the moment. I'll need to see where I can fit that in.

Co-authored-by: David Reiss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants