page_service: getpage batching: refactor & minor fixes #9792

problame · 2024-11-18T19:27:01Z

This PR refactors the page_service server-side batching code that was recently added in
#9377.

Changes:

Store carried-over batching state in a local variable inside handle_pagestream, instead of in the PageServerHandler. This adds robustness because it systematically avoids a source of bugs.
Instead of next_batch, call it carry.
When starting a new batch read, take into account the time carried over from the previous call.
Build the batch inside the &mut Option<Carry> of in a local variable.
- Before, we would have to make sure that we move the batch back into the self.next_batch whenever we bail.

github-actions · 2024-11-18T20:18:13Z

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)

Flaky tests (1)

Postgres 15

test_pull_timeline[True]: release-arm64

Code coverage* (full report)

functions: 31.4% (7951 of 25322 functions)
lines: 49.3% (63084 of 127831 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
058b35f at 2024-11-21T11:30:41.601Z :recycle:}

## Problem We don't take advantage of queue depth generated by the compute on the pageserver. We can process getpage requests more efficiently by batching them. ## Summary of changes Batch up incoming getpage requests that arrive within a configurable time window (`server_side_batch_timeout`). Then process the entire batch via one `get_vectored` timeline operation. By default, no merging takes place. ## Testing * **Functional**: #9792 * **Performance**: will be done in staging/pre-prod # Refs * #9377 * #9376 Co-authored-by: Christian Schwarz <[email protected]>

The steps in the test work in neon_local + psql but for some reason they don't work in the test. Asked compute team on Slack for help: https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789

=> https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478054b8a3e325735ffa19 => unacceptable

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

This reverts commit b974616.

…e-test

This reverts commit aa695b2.

…chmark

…e-test

VladLazar

Checked it out locally and stepped through it a number of times. Looks correct, but have a look at the comments.

pageserver/src/page_service.rs

This PR adds two benchmark to demonstrate the effect of server-side getpage request batching added in #9321. For the CPU usage, I found the the `prometheus` crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new `libmetrics` metric and report that. The benchmarks are disabled because [on our benchmark nodes, timer resolution isn't high enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449). They work (no statement about quality) on my bare-metal devbox. They will be refined and enabled once we find a fix. Candidates at time of writing are: - #9822 - #9851 Refs: - Epic: #9376 - Extracted from #9792

problame · 2024-11-29T14:22:03Z

The changes in this PR will be largely replaced by

page_service: rewrite batching to work without a timeout #9851

which is currently stacked on top.

Closing this to avoid one merge-rebase-CI roudtrip.

This was referenced Nov 18, 2024

pageserver: batch get page requests and serve them with one vectored get #9377

Open

feat(page_service): timeout-based batching of requests #9321

Merged

problame linked an issue Nov 18, 2024 that may be closed by this pull request

pageserver: batch get page requests and serve them with one vectored get #9377

Open

Base automatically changed from vlad/pageserver-merge-get-page-requests to main November 18, 2024 20:24

problame changed the title ~~WIP: page_service: add basic testcase for merging~~ WIP: page_service: add basic benchmark Nov 18, 2024

problame mentioned this pull request Nov 18, 2024

Epic: get page throughput improvements #9376

Open

problame changed the title ~~WIP: page_service: add basic benchmark~~ WIP: page_service: add basic test & benchmark Nov 18, 2024

problame added 2 commits November 18, 2024 23:57

WIP: page_service: add basic testcase for merging

0689965

The steps in the test work in neon_local + psql but for some reason they don't work in the test. Asked compute team on Slack for help: https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789

got it working and turn it more into a benchmark

15e21c7

problame force-pushed the problame/merge-getpage-test branch from a149e89 to 15e21c7 Compare November 18, 2024 22:57

problame changed the title ~~WIP: page_service: add basic test & benchmark~~ WIP: page_service: add basic sequential scan benchmark Nov 18, 2024

problame added 2 commits November 19, 2024 18:40

compiles

61ff84a

fixes

911946a

problame changed the title ~~WIP: page_service: add basic sequential scan benchmark~~ WIP: batching: bugfixes & benchmark Nov 19, 2024

problame added 6 commits November 20, 2024 12:48

parametrize more test

5cc0059

switch back to tokio::time::sleep, to get the numbers

b974616

=> https://www.notion.so/neondatabase/benchmarking-notes-143f189e004780c4a630cb5f426e39ba?pvs=4#144f189e00478054b8a3e325735ffa19 => unacceptable

make it a proper benchmark

f2de5b5

collect CPU utilization

e80ce97

bench fixups

75041cb

page_service: add benchmark for batching

b695907

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

problame mentioned this pull request Nov 20, 2024

page_service: add benchmark for batching #9820

Merged

problame added 3 commits November 20, 2024 14:22

Revert "switch back to tokio::time::sleep, to get the numbers"

aa695b2

This reverts commit b974616.

Merge branch 'problame/batching-benchmark' into problame/merge-getpag…

88d52b3

…e-test

fixup whitespace stuff

b299eb1

problame changed the base branch from main to problame/batching-benchmark November 20, 2024 13:24

Revert "Revert "switch back to tokio::time::sleep, to get the numbers""

af95320

This reverts commit aa695b2.

problame changed the title ~~WIP: batching: bugfixes & benchmark~~ page_service: refactor & minor fixes for batching code Nov 20, 2024

problame requested a review from VladLazar November 20, 2024 13:36

problame changed the title ~~page_service: refactor & minor fixes for batching code~~ page_service: getpage batching: refactor & minor fixes Nov 20, 2024

problame added 2 commits November 21, 2024 11:16

high-resolution CPU usage

e82deb2

pytest.approx; #9820 (comment)

3375f28

problame marked this pull request as ready for review November 21, 2024 10:24

problame requested a review from a team as a code owner November 21, 2024 10:24

problame added 2 commits November 21, 2024 11:25

Merge remote-tracking branch 'origin/main' into problame/batching-ben…

ff0aa15

…chmark

Merge branch 'problame/batching-benchmark' into problame/merge-getpag…

058b35f

…e-test

VladLazar approved these changes Nov 21, 2024

View reviewed changes

pageserver/src/page_service.rs Show resolved Hide resolved

pageserver/src/page_service.rs Show resolved Hide resolved

pageserver/src/page_service.rs Show resolved Hide resolved

This was referenced Nov 21, 2024

page_service: unit-test batching logic #9834

Open

page_service: batching needless waits for unbatchable requests #9835

Closed

Base automatically changed from problame/batching-benchmark to main November 25, 2024 15:54

problame closed this Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page_service: getpage batching: refactor & minor fixes #9792

page_service: getpage batching: refactor & minor fixes #9792

problame commented Nov 18, 2024 •

edited

Loading

github-actions bot commented Nov 18, 2024 •

edited

Loading

Postgres 15

VladLazar left a comment

problame commented Nov 29, 2024

page_service: getpage batching: refactor & minor fixes #9792

page_service: getpage batching: refactor & minor fixes #9792

Conversation

problame commented Nov 18, 2024 • edited Loading

github-actions bot commented Nov 18, 2024 • edited Loading

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)

Postgres 15

Code coverage* (full report)

VladLazar left a comment

Choose a reason for hiding this comment

problame commented Nov 29, 2024

problame commented Nov 18, 2024 •

edited

Loading

github-actions bot commented Nov 18, 2024 •

edited

Loading