Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve hashtable prefetching #4585

Merged
merged 3 commits into from
Nov 26, 2024
Merged

Conversation

chandlerc
Copy link
Contributor

I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results.

This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled.

Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that some prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure).

AMD server benchmark numbers:

name                                              old cpu/op   new cpu/op   delta
BM_CompileAPIFileDenseDecls<Phase::Lex>/256       35.0µs ± 2%  34.2µs ± 2%  -2.40%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024       156µs ± 2%   151µs ± 2%  -3.18%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096       625µs ± 1%   605µs ± 1%  -3.22%  (p=0.000 n=19+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384     2.79ms ± 1%  2.69ms ± 2%  -3.67%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536     12.1ms ± 1%  11.6ms ± 1%  -4.30%  (p=0.000 n=17+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144    56.6ms ± 1%  53.8ms ± 1%  -5.00%  (p=0.000 n=18+17)
BM_CompileAPIFileDenseDecls<Phase::Parse>/256     61.1µs ± 2%  61.7µs ± 1%  +0.87%  (p=0.000 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024     288µs ± 1%   290µs ± 1%  +0.55%  (p=0.004 n=20+20)
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096    1.16ms ± 1%  1.16ms ± 1%  -0.54%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384   4.98ms ± 1%  4.91ms ± 1%  -1.39%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536   20.9ms ± 1%  20.5ms ± 1%  -1.86%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144  92.1ms ± 1%  90.2ms ± 1%  -2.12%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Check>/256     1.16ms ± 2%  1.16ms ± 1%    ~     (p=0.931 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/1024    2.17ms ± 2%  2.16ms ± 1%    ~     (p=0.247 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/4096    6.07ms ± 1%  6.04ms ± 1%  -0.48%  (p=0.007 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/16384   22.4ms ± 1%  22.2ms ± 1%  -0.99%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/65536   93.3ms ± 1%  92.2ms ± 1%  -1.23%  (p=0.000 n=20+18)
BM_CompileAPIFileDenseDecls<Phase::Check>/262144   400ms ± 1%   391ms ± 1%  -2.15%  (p=0.000 n=20+18)

I had removed most but not all of the hashtable prefetching during
development because I wasn't confident in the benchmarking results.
However, I never revisited this once the benchmarking infrastructure
improved and there were solid and stable results.

This factors the two interesting prefetch patterns I've seen for this
style of hashtable into helpers that are always called, and provides
macros that can be used during the build to configure exactly which
prefetch strategies are enabled.

Benchmarking these and gaining confidence is very frustrating -- even
now with the improved infrastructure, the noise is much higher than
I would like. But it seems clear that *some* prefetching is
a significant win. It also seems like enabling both results in too much
prefetch traffic. And the entry group prefetch appears to be
significantly more effective, both for the most interesting of the
microbenchmarks and maybe most importantly for our compilation
benchmarks. There, AMD is helped substantially and M1 seems to be helped
some (although harder to measure).
common/raw_hashtable.h Outdated Show resolved Hide resolved
common/raw_hashtable.h Show resolved Hide resolved
@chandlerc chandlerc added this pull request to the merge queue Nov 26, 2024
Merged via the queue into carbon-language:trunk with commit a2af7ad Nov 26, 2024
8 checks passed
@chandlerc chandlerc deleted the prefetch branch November 26, 2024 10:31
bricknerb pushed a commit to bricknerb/carbon-lang that referenced this pull request Nov 28, 2024
I had removed most but not all of the hashtable prefetching during
development because I wasn't confident in the benchmarking results.
However, I never revisited this once the benchmarking infrastructure
improved and there were solid and stable results.

This factors the two interesting prefetch patterns I've seen for this
style of hashtable into helpers that are always called, and provides
macros that can be used during the build to configure exactly which
prefetch strategies are enabled.

Benchmarking these and gaining confidence is very frustrating -- even
now with the improved infrastructure, the noise is much higher than I
would like. But it seems clear that *some* prefetching is a significant
win. It also seems like enabling both results in too much prefetch
traffic. And the entry group prefetch appears to be significantly more
effective, both for the most interesting of the microbenchmarks and
maybe most importantly for our compilation benchmarks. There, AMD is
helped substantially and M1 seems to be helped some (although harder to
measure).

AMD server benchmark numbers:
```
name                                              old cpu/op   new cpu/op   delta
BM_CompileAPIFileDenseDecls<Phase::Lex>/256       35.0µs ± 2%  34.2µs ± 2%  -2.40%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024       156µs ± 2%   151µs ± 2%  -3.18%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096       625µs ± 1%   605µs ± 1%  -3.22%  (p=0.000 n=19+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384     2.79ms ± 1%  2.69ms ± 2%  -3.67%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536     12.1ms ± 1%  11.6ms ± 1%  -4.30%  (p=0.000 n=17+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144    56.6ms ± 1%  53.8ms ± 1%  -5.00%  (p=0.000 n=18+17)
BM_CompileAPIFileDenseDecls<Phase::Parse>/256     61.1µs ± 2%  61.7µs ± 1%  +0.87%  (p=0.000 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024     288µs ± 1%   290µs ± 1%  +0.55%  (p=0.004 n=20+20)
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096    1.16ms ± 1%  1.16ms ± 1%  -0.54%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384   4.98ms ± 1%  4.91ms ± 1%  -1.39%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536   20.9ms ± 1%  20.5ms ± 1%  -1.86%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144  92.1ms ± 1%  90.2ms ± 1%  -2.12%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Check>/256     1.16ms ± 2%  1.16ms ± 1%    ~     (p=0.931 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/1024    2.17ms ± 2%  2.16ms ± 1%    ~     (p=0.247 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/4096    6.07ms ± 1%  6.04ms ± 1%  -0.48%  (p=0.007 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/16384   22.4ms ± 1%  22.2ms ± 1%  -0.99%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/65536   93.3ms ± 1%  92.2ms ± 1%  -1.23%  (p=0.000 n=20+18)
BM_CompileAPIFileDenseDecls<Phase::Check>/262144   400ms ± 1%   391ms ± 1%  -2.15%  (p=0.000 n=20+18)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants