-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve hashtable prefetching #4585
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results. This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled. Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that *some* prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure).
danakj
approved these changes
Nov 25, 2024
bricknerb
pushed a commit
to bricknerb/carbon-lang
that referenced
this pull request
Nov 28, 2024
I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results. This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled. Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that *some* prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure). AMD server benchmark numbers: ``` name old cpu/op new cpu/op delta BM_CompileAPIFileDenseDecls<Phase::Lex>/256 35.0µs ± 2% 34.2µs ± 2% -2.40% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/1024 156µs ± 2% 151µs ± 2% -3.18% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/4096 625µs ± 1% 605µs ± 1% -3.22% (p=0.000 n=19+18) BM_CompileAPIFileDenseDecls<Phase::Lex>/16384 2.79ms ± 1% 2.69ms ± 2% -3.67% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/65536 12.1ms ± 1% 11.6ms ± 1% -4.30% (p=0.000 n=17+18) BM_CompileAPIFileDenseDecls<Phase::Lex>/262144 56.6ms ± 1% 53.8ms ± 1% -5.00% (p=0.000 n=18+17) BM_CompileAPIFileDenseDecls<Phase::Parse>/256 61.1µs ± 2% 61.7µs ± 1% +0.87% (p=0.000 n=19+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 288µs ± 1% 290µs ± 1% +0.55% (p=0.004 n=20+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.16ms ± 1% 1.16ms ± 1% -0.54% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 4.98ms ± 1% 4.91ms ± 1% -1.39% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 20.9ms ± 1% 20.5ms ± 1% -1.86% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 92.1ms ± 1% 90.2ms ± 1% -2.12% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/256 1.16ms ± 2% 1.16ms ± 1% ~ (p=0.931 n=19+19) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 2.17ms ± 2% 2.16ms ± 1% ~ (p=0.247 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 6.07ms ± 1% 6.04ms ± 1% -0.48% (p=0.007 n=19+19) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 22.4ms ± 1% 22.2ms ± 1% -0.99% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 93.3ms ± 1% 92.2ms ± 1% -1.23% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 400ms ± 1% 391ms ± 1% -2.15% (p=0.000 n=20+18) ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results.
This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled.
Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that some prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure).
AMD server benchmark numbers: