-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
page_service: getpage batching: refactor & minor fixes #9792
Conversation
5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
058b35f at 2024-11-21T11:30:41.601Z :recycle: |
## Problem We don't take advantage of queue depth generated by the compute on the pageserver. We can process getpage requests more efficiently by batching them. ## Summary of changes Batch up incoming getpage requests that arrive within a configurable time window (`server_side_batch_timeout`). Then process the entire batch via one `get_vectored` timeline operation. By default, no merging takes place. ## Testing * **Functional**: #9792 * **Performance**: will be done in staging/pre-prod # Refs * #9377 * #9376 Co-authored-by: Christian Schwarz <[email protected]>
The steps in the test work in neon_local + psql but for some reason they don't work in the test. Asked compute team on Slack for help: https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789
a149e89
to
15e21c7
Compare
This reverts commit aa695b2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked it out locally and stepped through it a number of times. Looks correct, but have a look at the comments.
This PR adds two benchmark to demonstrate the effect of server-side getpage request batching added in #9321. For the CPU usage, I found the the `prometheus` crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new `libmetrics` metric and report that. The benchmarks are disabled because [on our benchmark nodes, timer resolution isn't high enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449). They work (no statement about quality) on my bare-metal devbox. They will be refined and enabled once we find a fix. Candidates at time of writing are: - #9822 - #9851 Refs: - Epic: #9376 - Extracted from #9792
The changes in this PR will be largely replaced by which is currently stacked on top. Closing this to avoid one merge-rebase-CI roudtrip. |
This PR refactors the page_service server-side batching code that was recently added in
#9377.
Changes:
handle_pagestream
, instead of in thePageServerHandler
. This adds robustness because it systematically avoids a source of bugs.next_batch
, call itcarry
.&mut Option<Carry>
of in a local variable.batch
back into theself.next_batch
whenever we bail.