Skip to content

Commit

Permalink
Improve hashtable prefetching (carbon-language#4585)
Browse files Browse the repository at this point in the history
I had removed most but not all of the hashtable prefetching during
development because I wasn't confident in the benchmarking results.
However, I never revisited this once the benchmarking infrastructure
improved and there were solid and stable results.

This factors the two interesting prefetch patterns I've seen for this
style of hashtable into helpers that are always called, and provides
macros that can be used during the build to configure exactly which
prefetch strategies are enabled.

Benchmarking these and gaining confidence is very frustrating -- even
now with the improved infrastructure, the noise is much higher than I
would like. But it seems clear that *some* prefetching is a significant
win. It also seems like enabling both results in too much prefetch
traffic. And the entry group prefetch appears to be significantly more
effective, both for the most interesting of the microbenchmarks and
maybe most importantly for our compilation benchmarks. There, AMD is
helped substantially and M1 seems to be helped some (although harder to
measure).

AMD server benchmark numbers:
```
name                                              old cpu/op   new cpu/op   delta
BM_CompileAPIFileDenseDecls<Phase::Lex>/256       35.0µs ± 2%  34.2µs ± 2%  -2.40%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024       156µs ± 2%   151µs ± 2%  -3.18%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096       625µs ± 1%   605µs ± 1%  -3.22%  (p=0.000 n=19+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384     2.79ms ± 1%  2.69ms ± 2%  -3.67%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536     12.1ms ± 1%  11.6ms ± 1%  -4.30%  (p=0.000 n=17+18)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144    56.6ms ± 1%  53.8ms ± 1%  -5.00%  (p=0.000 n=18+17)
BM_CompileAPIFileDenseDecls<Phase::Parse>/256     61.1µs ± 2%  61.7µs ± 1%  +0.87%  (p=0.000 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/1024     288µs ± 1%   290µs ± 1%  +0.55%  (p=0.004 n=20+20)
BM_CompileAPIFileDenseDecls<Phase::Parse>/4096    1.16ms ± 1%  1.16ms ± 1%  -0.54%  (p=0.000 n=17+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/16384   4.98ms ± 1%  4.91ms ± 1%  -1.39%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/65536   20.9ms ± 1%  20.5ms ± 1%  -1.86%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Parse>/262144  92.1ms ± 1%  90.2ms ± 1%  -2.12%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Check>/256     1.16ms ± 2%  1.16ms ± 1%    ~     (p=0.931 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/1024    2.17ms ± 2%  2.16ms ± 1%    ~     (p=0.247 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/4096    6.07ms ± 1%  6.04ms ± 1%  -0.48%  (p=0.007 n=19+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/16384   22.4ms ± 1%  22.2ms ± 1%  -0.99%  (p=0.000 n=20+19)
BM_CompileAPIFileDenseDecls<Phase::Check>/65536   93.3ms ± 1%  92.2ms ± 1%  -1.23%  (p=0.000 n=20+18)
BM_CompileAPIFileDenseDecls<Phase::Check>/262144   400ms ± 1%   391ms ± 1%  -2.15%  (p=0.000 n=20+18)
```
  • Loading branch information
chandlerc authored and bricknerb committed Nov 28, 2024
1 parent a66d2bf commit 4b136fb
Showing 1 changed file with 58 additions and 6 deletions.
64 changes: 58 additions & 6 deletions common/raw_hashtable.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,23 @@
// order of this observation is also not guaranteed.
namespace Carbon::RawHashtable {

// Which prefetch strategies to enable can be controlled via macros to enable
// doing experiments.
//
// Currently, benchmarking on both modern AMD and ARM CPUs seems to indicate
// that the entry group prefetching is more beneficial than metadata, but that
// benefit is degraded when enabling them both. This determined our current
// default of no metadata prefetch but enabled entry group prefetch.
//
// Override these by defining them as part of the build explicitly to either `0`
// or `1`. If left undefined, the defaults will be supplied.
#ifndef CARBON_ENABLE_PREFETCH_METADATA
#define CARBON_ENABLE_PREFETCH_METADATA 0
#endif
#ifndef CARBON_ENABLE_PREFETCH_ENTRY_GROUP
#define CARBON_ENABLE_PREFETCH_ENTRY_GROUP 1
#endif

// If allocating storage, allocate a minimum of one cacheline of group metadata
// or a minimum of one group, whichever is larger.
constexpr ssize_t MinAllocatedSize = std::max<ssize_t>(64, MaxGroupSize);
Expand Down Expand Up @@ -405,6 +422,29 @@ class ViewImpl {
EntriesOffset(alloc_size_));
}

// Prefetch the metadata prior to probing. This is to overlap any of the
// memory access latency we can with the hashing of a key or other
// latency-bound operation prior to probing.
auto PrefetchMetadata() const -> void {
if constexpr (CARBON_ENABLE_PREFETCH_METADATA) {
// Prefetch with a "low" temporal locality as we're primarily expecting a
// brief use of the metadata and then to return to application code.
__builtin_prefetch(metadata(), /*read*/ 0, /*low-locality*/ 1);
}
}

// Prefetch an entry. This prefetches for read as it is primarily expected to
// be used in the probing path, and writing afterwards isn't especially slowed
// down. We don't want to synthesize writes unless we *know* we're going to
// write.
static auto PrefetchEntryGroup(const EntryT* entry_group) -> void {
if constexpr (CARBON_ENABLE_PREFETCH_ENTRY_GROUP) {
// Prefetch with a "low" temporal locality as we're primarily expecting a
// brief use of the entries and then to return to application code.
__builtin_prefetch(entry_group, /*read*/ 0, /*low-locality*/ 1);
}
}

ssize_t alloc_size_;
Storage* storage_;
};
Expand Down Expand Up @@ -522,6 +562,9 @@ class BaseImpl {
return alloc_size() == small_alloc_size();
}

// Wrapper to call `ViewImplT::PrefetchStorage`, see that method for details.
auto PrefetchStorage() const -> void { view_impl_.PrefetchMetadata(); }

auto Construct(Storage* small_storage) -> void;
auto Destroy() -> void;
auto CopySlotsFrom(const BaseImpl& arg) -> void;
Expand Down Expand Up @@ -688,9 +731,7 @@ template <typename InputKeyT, typename InputValueT, typename InputKeyContextT>
template <typename LookupKeyT>
auto ViewImpl<InputKeyT, InputValueT, InputKeyContextT>::LookupEntry(
LookupKeyT lookup_key, KeyContextT key_context) const -> EntryT* {
// Prefetch with a "low" temporal locality as we're primarily expecting a
// brief use of the storage and then to return to application code.
__builtin_prefetch(storage_, /*read*/ 0, /*low-locality*/ 1);
PrefetchMetadata();

ssize_t local_size = alloc_size_;
CARBON_DCHECK(local_size > 0);
Expand All @@ -707,15 +748,20 @@ auto ViewImpl<InputKeyT, InputValueT, InputKeyContextT>::LookupEntry(
do {
ssize_t group_index = s.index();

// Load the group's metadata and prefetch the entries for this group. The
// prefetch here helps hide key access latency while we're matching the
// metadata.
MetadataGroup g = MetadataGroup::Load(local_metadata, group_index);
EntryT* group_entries = &local_entries[group_index];
PrefetchEntryGroup(group_entries);

// For each group, match the tag against the metadata to extract the
// potentially matching entries within the group.
MetadataGroup g = MetadataGroup::Load(local_metadata, group_index);
auto metadata_matched_range = g.Match(tag);
if (LLVM_LIKELY(metadata_matched_range)) {
// If any entries in this group potentially match based on their metadata,
// walk each candidate and compare its key to see if we have definitively
// found a match.
EntryT* group_entries = &local_entries[group_index];
auto byte_it = metadata_matched_range.begin();
auto byte_end = metadata_matched_range.end();
do {
Expand Down Expand Up @@ -853,6 +899,7 @@ auto BaseImpl<InputKeyT, InputValueT, InputKeyContextT>::InsertImpl(
LookupKeyT lookup_key, KeyContextT key_context)
-> std::pair<EntryT*, bool> {
CARBON_DCHECK(alloc_size() > 0);
PrefetchStorage();

uint8_t* local_metadata = metadata();

Expand All @@ -877,11 +924,16 @@ auto BaseImpl<InputKeyT, InputValueT, InputKeyContextT>::InsertImpl(

for (ProbeSequence s(hash_index, alloc_size());; s.Next()) {
ssize_t group_index = s.index();

// Load the group's metadata and prefetch the entries for this group. The
// prefetch here helps hide key access latency while we're matching the
// metadata.
auto g = MetadataGroup::Load(local_metadata, group_index);
EntryT* group_entries = &local_entries[group_index];
ViewImplT::PrefetchEntryGroup(group_entries);

auto control_byte_matched_range = g.Match(tag);
if (control_byte_matched_range) {
EntryT* group_entries = &local_entries[group_index];
auto byte_it = control_byte_matched_range.begin();
auto byte_end = control_byte_matched_range.end();
do {
Expand Down

0 comments on commit 4b136fb

Please sign in to comment.