Implement Leader async aggregation. #3564

branlwyd · 2024-12-09T22:31:44Z

This change includes unit tests, but no integration tests -- those will need to come with the Helper async aggregation implementation, as without it we do not have anything to integration test against.

A few implementation notes:

I renamed the report aggregation states to better match their functionality (IMO).
If the Helper does not provide a retry-after header, the Leader will poll each "processing" aggregation job (at most) once per minute.
The retry-after header can specify either a number of seconds, or a specific date. Currently, we only support receiving a number of seconds.

branlwyd · 2024-12-09T22:32:48Z

Part of #3436.

This change includes unit tests, but no integration tests -- those will need to come with the Helper async aggregation implementation, as without it we do not have anything to integration test against. A few implementation notes: * I renamed the report aggregation states to better match their functionality (IMO). * If the Helper does not provide a retry-after header, the Leader will poll each "processing" aggregation job (at most) once per minute. * The retry-after header can specify either a number of seconds, or a specific date. Currently, we only support receiving a number of seconds.

aggregator_core/src/datastore/models.rs

aggregator/src/aggregator/aggregation_job_driver.rs

divergentdave · 2024-12-12T19:53:19Z

aggregator/src/aggregator/aggregation_job_driver.rs

+                .headers()
+                .get(RETRY_AFTER)
+                .map(parse_retry_after)
+                .transpose()?;


We should probably substitute a default poll delay and log a warning rather than fail if we can't parse the Retry-After header. (here and in the other methods)

I disagree here: I think if we get bad input, we should fail loudly rather than continuing on "incorrectly."

aggregator/src/aggregator/aggregation_job_driver.rs

divergentdave · 2024-12-12T20:24:38Z

aggregator/src/aggregator/aggregation_job_driver.rs

+                    report_aggregation_success_counter: self.aggregation_success_counter.clone(),
+                    aggregate_step_failure_counter: self.aggregate_step_failure_counter.clone(),
+                    aggregated_report_share_dimension_histogram: self
+                        .aggregated_report_share_dimension_histogram
+                        .clone(),


We should rethink Prometheus metrics in the context of AggregationJobResp::Processing responses as well. I haven't looked at the metrics in detail yet, but we may want additional counters, or new labels, in order to distinguish forward progress versus mere polling.

I think this is wise, but these metrics only concern themselves with completed report aggregations & failed steps, both of which are agnostic to the aggregation mode. I think we can get pretty far by looking to the request verb & the response code [edit: in our standard HTTP metrics], though these metrics won't give full clarity (i.e. we can differentiate between a Leader's attempt to poll vs an attempt to send a new step via the HTTP verb; but I don't think we can tell whether the Helper responded with a processing or finished response with these metrics). However, I strongly suspect that our current integrations will always use exactly one of asynchronous or synchronous responses, so this may be moot at the moment, at least for us.

branlwyd added the allow-changed-migrations Override the ci-migrations check to allow migrations that have changed. label Dec 9, 2024

branlwyd requested a review from a team as a code owner December 9, 2024 22:31

branlwyd force-pushed the bran/async-aggregation branch from d5191bc to f9f6a25 Compare December 9, 2024 22:33

branlwyd mentioned this pull request Dec 9, 2024

Implement DAP-13 #3436

Open

20 tasks

branlwyd force-pushed the bran/async-aggregation branch from f9f6a25 to 52fd456 Compare December 11, 2024 01:14

divergentdave approved these changes Dec 12, 2024

View reviewed changes

branlwyd enabled auto-merge (squash) December 13, 2024 00:15

PR review.

8dfe337

branlwyd force-pushed the bran/async-aggregation branch from 314c2eb to 8dfe337 Compare December 13, 2024 00:15

branlwyd merged commit e4aab44 into main Dec 13, 2024
8 checks passed

branlwyd deleted the bran/async-aggregation branch December 13, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Leader async aggregation. #3564

Implement Leader async aggregation. #3564

branlwyd commented Dec 9, 2024 •

edited

Loading

branlwyd commented Dec 9, 2024

divergentdave Dec 12, 2024

branlwyd Dec 12, 2024

divergentdave Dec 12, 2024

branlwyd Dec 13, 2024 •

edited

Loading

Implement Leader async aggregation. #3564

Implement Leader async aggregation. #3564

Conversation

branlwyd commented Dec 9, 2024 • edited Loading

branlwyd commented Dec 9, 2024

divergentdave Dec 12, 2024

Choose a reason for hiding this comment

branlwyd Dec 12, 2024

Choose a reason for hiding this comment

divergentdave Dec 12, 2024

Choose a reason for hiding this comment

branlwyd Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

branlwyd commented Dec 9, 2024 •

edited

Loading

branlwyd Dec 13, 2024 •

edited

Loading