Poller Scaling Decisions #553

Sushisource · 2025-02-08T00:45:10Z

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

What changed?
Added a proto that is optionally attached to task responses and contains data for the SDK about whether or not pollers should be scaled up or down.

Why?
Part of the worker management effort to simplify configuration of workers for users.

Breaking changes
Nope

Server PR
It's nonbreaking, but the PR is here: temporalio/temporal#7300

mfateev · 2025-02-09T01:08:13Z

temporal/api/sdk/v1/poller_scaling.proto

+message PollerScalingDecision {
+  // How many pollers should be added or removed, if any. As of now, server only scales up or down
+  // by 1. However, SDKs should allow for other values (while staying within defined min/max).
+  int32 poller_delta = 1;


I would recommend an API that is not delta based as it is much less prone to race conditions. Something like target_poller_count.

We could do that, but the problem is it'd require a lot more internal communication among partitions to determine overall load. That can certainly be more accurate, but also has more overhead. This solution has produced some really great results while introducing effectively zero new overhead, which seems like a great place to be.

Per our discussion, updated the name/language here to make it clear this is a suggestion / about requests and not "pollers"

bergundy · 2025-02-10T14:45:48Z

temporal/api/workflowservice/v1/request_response.proto

@@ -1733,6 +1738,8 @@ message PollNexusTaskQueueResponse {
    bytes task_token = 1;
    // Embedded request as translated from the incoming frontend request.
    temporal.api.nexus.v1.Request request = 2;
+    // Server-advised information the SDK may use to adjust its poller count.
+    temporal.api.sdk.v1.PollerScalingDecision poller_scaling_decision = 3;


What is this scaling decision based on? Are we looking at the in-memory queue length?

Check out the server review temporalio/temporal#7300

temporal/api/sdk/v1/poller_scaling.proto

cretz

This looks good to me, but I'd like to request we don't merge until temporalio/temporal#7300 is approved by the right people just in case we change our mind on the API in that PR. (feel free to use feature branches all around if you think it may be an involved process)

dnr · 2025-02-11T04:56:33Z

temporal/api/taskqueue/v1/message.proto

+// pollers.
+message PollerScalingDecision {
+  // How many poll requests to suggest should be added or removed, if any. As of now, server only


I don't know if this really needs spelling out here, but a question in my head after reading this:

Current "poller count" is configured for workflow task queues as one number, and the sdk splits that number between its sticky queue and the wf task queue. Will suggestions from both of those get applied to that one number, which will then be split between both as before? Or are there now two numbers internally?

No, there are two separate scaling controllers when this is enabled

temporal/api/taskqueue/v1/message.proto

dnr · 2025-02-11T05:02:39Z

temporal/api/workflowservice/v1/request_response.proto

@@ -1733,6 +1737,8 @@ message PollNexusTaskQueueResponse {
    bytes task_token = 1;
    // Embedded request as translated from the incoming frontend request.
    temporal.api.nexus.v1.Request request = 2;
+    // Server-advised information the SDK may use to adjust its poller count.
+    temporal.api.taskqueue.v1.PollerScalingDecision poller_scaling_decision = 3;


I didn't look at the implementation yet, but does the server report this? My question is that nexus tasks aren't backlogged, they're only synchronous, so maybe the algorithm has to change slightly? It still makes sense to adjust pollers based on traffic

Yeah, it doesn't at the moment. I'll need to see where I can fit that in.

Co-authored-by: David Reiss <[email protected]>

Sushisource requested review from a team as code owners February 8, 2025 00:45

This was referenced Feb 8, 2025

Poller Scaling Decisions temporalio/temporal#7300

Open

Add task polling data to task responses #447

Closed

mfateev reviewed Feb 9, 2025

View reviewed changes

bergundy approved these changes Feb 10, 2025

View reviewed changes

cretz reviewed Feb 10, 2025

View reviewed changes

temporal/api/sdk/v1/poller_scaling.proto Outdated Show resolved Hide resolved

Sushisource force-pushed the poller-scaling branch from 92eac93 to 3077782 Compare February 10, 2025 19:13

Add scaling decision proto

f9ba485

Sushisource force-pushed the poller-scaling branch from 3077782 to f9ba485 Compare February 10, 2025 19:36

cretz approved these changes Feb 10, 2025

View reviewed changes

dnr reviewed Feb 11, 2025

View reviewed changes

Typo fix

2339d9f

Co-authored-by: David Reiss <[email protected]>

Sushisource force-pushed the poller-scaling branch from 020b556 to 2339d9f Compare February 12, 2025 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poller Scaling Decisions #553

Poller Scaling Decisions #553

Sushisource commented Feb 8, 2025

mfateev Feb 9, 2025

Sushisource Feb 9, 2025

Sushisource Feb 10, 2025

bergundy Feb 10, 2025

Sushisource Feb 10, 2025

cretz left a comment •

edited

Loading

dnr Feb 11, 2025

Sushisource Feb 12, 2025

dnr Feb 11, 2025

Sushisource Feb 12, 2025

Poller Scaling Decisions #553

Are you sure you want to change the base?

Poller Scaling Decisions #553

Conversation

Sushisource commented Feb 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz left a comment •

edited

Loading