Improve hashtable prefetching #4585

chandlerc · 2024-11-23T10:10:25Z

I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results.

This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled.

Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that some prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure).

AMD server benchmark numbers:

name                                              old cpu/op   new cpu/op   delta
BM_CompileAPIFileDenseDecls<Phase::Lex>/256       35.0µs ± 2%  34.2µs ± 2%  -2.40%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024       156µs ± 2%   151µs ± 2%  -3.18%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096       625µs ± 1%   605µs ± 1%  -3.22%  (p=0.000 n=19+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384     2.79ms ± 1%  2.69ms ± 2%  -3.67%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536     12.1ms ± 1%  11.6ms ± 1%  -4.30%  (p=0.000 n=17+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144    56.6ms ± 1%  53.8ms ± 1%  -5.00%  (p=0.000 n=18+17)
BM_CompileAPIFileDenseDecls<Phase::Parse>/256     61.1µs ± 2%  61.7µs ± 1%  +0.87%  (p=0.000 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024     288µs ± 1%   290µs ± 1%  +0.55%  (p=0.004 n=20+20)
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096    1.16ms ± 1%  1.16ms ± 1%  -0.54%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384   4.98ms ± 1%  4.91ms ± 1%  -1.39%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536   20.9ms ± 1%  20.5ms ± 1%  -1.86%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144  92.1ms ± 1%  90.2ms ± 1%  -2.12%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Check>/256     1.16ms ± 2%  1.16ms ± 1%    ~     (p=0.931 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/1024    2.17ms ± 2%  2.16ms ± 1%    ~     (p=0.247 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/4096    6.07ms ± 1%  6.04ms ± 1%  -0.48%  (p=0.007 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/16384   22.4ms ± 1%  22.2ms ± 1%  -0.99%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/65536   93.3ms ± 1%  92.2ms ± 1%  -1.23%  (p=0.000 n=20+18)
BM_CompileAPIFileDenseDecls<Phase::Check>/262144   400ms ± 1%   391ms ± 1%  -2.15%  (p=0.000 n=20+18)

I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results. This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled. Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that *some* prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure).

common/raw_hashtable.h

I had removed most but not all of the hashtable prefetching during development because I wasn't confident in the benchmarking results. However, I never revisited this once the benchmarking infrastructure improved and there were solid and stable results. This factors the two interesting prefetch patterns I've seen for this style of hashtable into helpers that are always called, and provides macros that can be used during the build to configure exactly which prefetch strategies are enabled. Benchmarking these and gaining confidence is very frustrating -- even now with the improved infrastructure, the noise is much higher than I would like. But it seems clear that *some* prefetching is a significant win. It also seems like enabling both results in too much prefetch traffic. And the entry group prefetch appears to be significantly more effective, both for the most interesting of the microbenchmarks and maybe most importantly for our compilation benchmarks. There, AMD is helped substantially and M1 seems to be helped some (although harder to measure). AMD server benchmark numbers: ``` name old cpu/op new cpu/op delta BM_CompileAPIFileDenseDecls<Phase::Lex>/256 35.0µs ± 2% 34.2µs ± 2% -2.40% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/1024 156µs ± 2% 151µs ± 2% -3.18% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/4096 625µs ± 1% 605µs ± 1% -3.22% (p=0.000 n=19+18) BM_CompileAPIFileDenseDecls<Phase::Lex>/16384 2.79ms ± 1% 2.69ms ± 2% -3.67% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/65536 12.1ms ± 1% 11.6ms ± 1% -4.30% (p=0.000 n=17+18) BM_CompileAPIFileDenseDecls<Phase::Lex>/262144 56.6ms ± 1% 53.8ms ± 1% -5.00% (p=0.000 n=18+17) BM_CompileAPIFileDenseDecls<Phase::Parse>/256 61.1µs ± 2% 61.7µs ± 1% +0.87% (p=0.000 n=19+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/1024 288µs ± 1% 290µs ± 1% +0.55% (p=0.004 n=20+20) BM_CompileAPIFileDenseDecls<Phase::Parse>/4096 1.16ms ± 1% 1.16ms ± 1% -0.54% (p=0.000 n=17+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/16384 4.98ms ± 1% 4.91ms ± 1% -1.39% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/65536 20.9ms ± 1% 20.5ms ± 1% -1.86% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Parse>/262144 92.1ms ± 1% 90.2ms ± 1% -2.12% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Check>/256 1.16ms ± 2% 1.16ms ± 1% ~ (p=0.931 n=19+19) BM_CompileAPIFileDenseDecls<Phase::Check>/1024 2.17ms ± 2% 2.16ms ± 1% ~ (p=0.247 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Check>/4096 6.07ms ± 1% 6.04ms ± 1% -0.48% (p=0.007 n=19+19) BM_CompileAPIFileDenseDecls<Phase::Check>/16384 22.4ms ± 1% 22.2ms ± 1% -0.99% (p=0.000 n=20+19) BM_CompileAPIFileDenseDecls<Phase::Check>/65536 93.3ms ± 1% 92.2ms ± 1% -1.23% (p=0.000 n=20+18) BM_CompileAPIFileDenseDecls<Phase::Check>/262144 400ms ± 1% 391ms ± 1% -2.15% (p=0.000 n=20+18) ```

github-actions bot requested a review from jonmeow November 23, 2024 10:10

github-actions bot added the toolchain label Nov 23, 2024

danakj approved these changes Nov 25, 2024

View reviewed changes

common/raw_hashtable.h Outdated Show resolved Hide resolved

common/raw_hashtable.h Show resolved Hide resolved

chandlerc added 2 commits November 25, 2024 19:49

fix word

c8e6496

review feedback on naming

2d377f8

chandlerc enabled auto-merge November 26, 2024 09:58

chandlerc added this pull request to the merge queue Nov 26, 2024

Merged via the queue into carbon-language:trunk with commit a2af7ad Nov 26, 2024
8 checks passed

chandlerc deleted the prefetch branch November 26, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve hashtable prefetching #4585

Improve hashtable prefetching #4585

chandlerc commented Nov 23, 2024

Improve hashtable prefetching #4585

Improve hashtable prefetching #4585

Conversation

chandlerc commented Nov 23, 2024