Hash table design #1

JacksonAllan · 2023-12-12T16:31:12Z

JacksonAllan
Dec 12, 2023
Maintainer

Hello fellow hash-table developers :) Hope you don’t mind the unsolicited mention.

I’ve created this discussion for two reasons:

To direct your attention to this repository and the associated Reddit post, via which I’ve just published a new performance-oriented hash table in C. As I mention in the post, this table is the product of about a year of part-time research and experimentation with different hash table designs, primarily with variations of Robin Hood, SIMD, in-table chaining, and combinations thereof. Ultimately, I reached the same general conclusion as Malte Skarupke, namely that in-table chaining is a very good general-purpose solution. However, my design is not a carbon copy of his Bytell design. Most prominently, I trade an extra byte of overhead per bucket (i.e. two bytes instead of one) for a performance boost across a range of benchmarks.
To open a space for communication among hash table experts for the sake of sharing experience and ideas in future. I, for one, have a relatively comprehensive benchmark of C (and some C++) hash tables that I hope to open-source and publish within the next few months and that I think will interest the users I've mentioned.

Thanks for reading!

Mentioned users:
@martinus: Developer of two popular Robin Hood-based C++ tables and author of two comprehensive sets of C++ hash table benchmarks.
@Tessil: Developer of a popular Robin Hood-based C++ table, among other tables, and author of a set hash table benchmarks.
@skarupke: Developer of Bytell and author of many posts and a presentation about developing fast hash tables in C++.
@attractivechaos: Developer of the extremely popular Klib/Khash library and author of many hash table-related blog posts.
@camel-cdr: Has experimented with SIMD-based tables in C.
@stclib: Developer of the popular STC C library, which includes an open-addressing hash table.
@P-p-H-d: Developer of the popular M*LIB C library, which includes an open-addressing hash table.
@fowles: One of the developers of Google’s Abseil/Swiss table and the author of two presentations about it.

martinus · 2023-12-12T16:58:31Z

martinus
Dec 12, 2023

I'd say the performance gold standard is currently boost unordered_flat_map

2 replies

cmazakas Dec 12, 2023

Yeah, @joaquintides deserved an easy mention!

JacksonAllan Dec 12, 2023
Maintainer Author

Boost was definitely the one that I was most trying to match. It has the best insert speed out of this group, and I couldn't figure out how it achieves such fast iteration. On my system, Boost also outperforms Abseil rather significantly. But there are a few caveats:

I believe that Boost isn't completely tombstone-free, so the tables I refer to as "hybrids" or "in-table chaining tables" (namely Bytell and Verstable) might have an edge when deletions are frequent.
As I mention in the preface to the benchmark, Boost is among the hash tables not reaching the 0.95 load factor here. While Boost achieves a nice flat line in most benchmarks, we can see it just starting to spike at high load factors in the benchmarks that involve failed lookups. Those spikes would go higher if the table were allowed to reach the 0.95 limit (I experimented with modifying Boost's code to this end, but it still rehashed early). In this regard, only the hybrid tables seem to handle load factors above 0.9 without much of a performance penalty (in relation to their own performance at lower load factors and outside the context of the insertion benchmarks).
The hybrid tables actually outperform Boost in some of the "Time to lookup 1000 existing elements with N elements in map" benchmarks, at least on my system.

martinus · 2023-12-12T17:48:55Z

martinus
Dec 12, 2023

Btw, I saw that you use my older robin_hood map in the benchmark. Could you switch to unordered_dense https://github.com/martinus/unordered_dense instead? It's newer with faster iteration.

16 replies

JacksonAllan Dec 15, 2023
Maintainer Author

@attractivechaos

Perhaps the cache is still hot, so the change doesn't make much difference.

Right. If we perform two lookups of the same key in succession, then the relevant parts of the metadata and bucket arrays should already be cached when the second lookup is performed. Hence, the cost incurred by the second lookup should only be that of the the hash function and the arithmetic performed by the lookup algorithm. The difference would probably be apparent if the hash function were expensive (e.g. string keys).

PS: just realized what you mean by "between each datapoint and the previous one". Actually in the current implementation, for each data point, the hash table grows from size zero. The way you described is better:

Sorry, I struggle to express myself clearly in this regard. Let me try to further explain my chain of thought: If, for each measurement, we record the peak memory usage between the very start of the benchmark and the current time, then I would expect each measurement to be at least as high as the previous one. But what we actually see on the graph is the datapoints jumping up and down (I’m assuming the line connecting datapoints indicates their chronological/key-count order). Hence, I concluded that we must be recording peak memory usage between the current measurement and the previous one, and the highest measurements occur whenever a rehash happened (i.e. two tables existed in parallel) during that interval. But after reading your comment, I’m not too sure.

As to load factor, it is complicated. The ideal load factor varies with implementations: a load factor ideal for one library may not be ideal for another. Due to this, I felt it is hard to control the load factor, so I decided to just evaluate the default load factor set by each developer.

Right, it’s a really difficult issue. My approach – as you’ve seen – is to set each table’s load factor high (0.95), measure frequently, and then try to compare entire plots rather than individual measurements, bearing in mind the range of load factors that each plot covers. But this approach has its own share of weaknesses and caveats. For example:

Not all hash tables follow a power-of-two growth policy. When we add tables that follow a prime-number growth policy, their plots are rather difficult to compare to the others because they're totally unaligned.
Some tables can’t reach the 0.95 load factor, as I mentioned earlier. Consequently, their plots don’t quite align with the others in any of the benchmarks. Worse, in the insertion benchmark, which is cumulative, these tables benefit from a twofold advantage: the high measurements that other tables incur in the 0.875+ range get replaced by low measurements taken from the 0.4375-0.475 range, in which the other tables never get measured. Another approach I considered is to set the max load factor for all tables to the lowest one they all support, but then the question becomes, why should we penalize the tables that can reasonably go into the highest range by omitting it from the benchmarks?
When we set the load factor to 0.95, the graphs only show performance in the range of 0.475 to 0.95. This is almost entirely outside the range in which some tables are even designed to operate. I’m relying on the assumption that a table’s performance at about 0.5 is similar to its performance at lower load factors. That’s probably true for the SIMD tables, but the tables that rearrange keys during inserts – most notably the Robin Hood tables but also Verstable, which may move one key and must hash it to do so – could well be noticeably faster on insert at load factors below 0.5.
Most users will just use the default load factor anyway...

As for your benchmark, if I'm thinking clearly, any one datapoint is particularly vulnerable to the load factor issue because we're measuring peak memory usage. If we measure two tables at the same key count, but that count triggered an extra growth/rehash in only one of them, then that table’s peak memory usage will measure much higher than the other’s – especially because of the parallel tables during the rehash – even if their actually memory efficiency is the same. I’m not sure how to solve this issue completely. But generally, I think that the more frequently you measure, the clearer the big performance picture (for memory usage if not also speed) will be from your graph.

martinus Dec 15, 2023

@joaquintides Thanks for the hint. Nonetheless, both boost and unordered_dense stall with this line.

I just saw that your hash function doesn't actually hash: https://github.com/attractivechaos/udb2/blob/master/common.c#L69-L72

Many hash tables (boost, abseil, unordered_dense) require high quality hash functions. Especially when you set is_avalanching to mark hashes as being of high quality. Ideally, use the default hash.

martinus Dec 15, 2023

Thanks @JacksonAllan for the benchmark! They are very insightful. It would be interesting to somehow mark the 80% fullness in the graphs, to see where the hashtables would actually prefer to resize.

JacksonAllan Dec 15, 2023
Maintainer Author

Many hash tables (boost, abseil, unordered_dense) require high quality hash functions.

The same could be said for Verstable. It stores a 4-bit fragment (a.k.a. "fingerprint") of each key's hash code and uses that data during lookups to skip most unmatching key checks as it traverses the chain of keys belonging the home bucket. Those four bits are the highest bits of the 64-bit hash code, so by returning a 32-bit value, hash_fn essentially disables this functionality, even if those 32 bits are effectively random because the keys are themselves random or the hash function is actually hashing. Verstable can plod along happily in this state, but it will incur more cache misses as it taps into the buckets array more often to do more key comparisons — not to mention the cost of the actual key comparison function. Perhaps it too, like Boost and unordered_dense, should by default apply an extra mixing step to protect against bad hash functions.

Abseil, Boost, F14 (which I didn't include because it doesn't support MinGW), and your unordered_dense all break up the hash code in a similar fashion. If I recall correctly, they all use the low bits for the "fragment" rather than the high bits. But I guess any table that splits hash codes thusly will be vulnerable to some extent, and that's why @attractivechaos is finding that Boost and unordered_dense "stall" when he disables their internal mixers.

It would be interesting to somehow mark the 80% fullness in the graphs, to see where the hashtables would actually prefer to resize.

The range (at least for unordered_dense) on display is 0.475 to 0.95 and the x-axis is linear, so 0.8 should fall about 70% of the way between any trough and subsequent peak, if my math is correct. But I'll do a lower-load-factor run with both your maps anyway and share the results. That should give a clearer picture of how they perform in the range in which they were intended to operate.

attractivechaos Dec 15, 2023

@JacksonAllan With my plot, I usually think a library is represented by a cloud of points. We evaluate the relative positions of clouds to compare libraries. It is not straightforward to compare individual points as you said. Nevertheless, I also found the shape of each curve looks weird probably due to interferences between different runs using the same heap. I just created udb3 hopefully to alleviate the issue. This time, data are inserted to/deleted from a single table with run time and peak memory measured at 11 checkpoints by default. The zigzag shape is caused by rehashing.

Also thank you for explaining why is_avalanching is not working in my case. I tried a few other hash functions with boost. I can confirm is_avalanching helps performance. Nonetheless, because the lower 32 bits are already random and populating higher 32 bits adds extra CPU cycles, the overall performance is not noticeably improved anyway. I couldn't get the desired performance out of unordered_dense. I will leave it out for now.

joaquintides · 2023-12-12T17:52:39Z

joaquintides
Dec 12, 2023

Hi Jackson, thanks for your effort and for your commitment to open innovation! Happy to be part of the discussion. I couldn't find the benchmark code anywhere, maybe you can publish it?

1 reply

JacksonAllan Dec 12, 2023
Maintainer Author

The plan was to eventually publish the benchmark along with an article comparing about a dozen hash tables (most of them in C). I was planning do that after integrating Verstable in to my C generic container library, but since there's some interest in the benchmarks now, I think it will be better to at least share them privately, if not publicly, as soon as possible.

JacksonAllan · 2023-12-12T18:30:55Z

JacksonAllan
Dec 12, 2023
Maintainer Author

Just a quick addition: Reddit seems to be hiding the comment I made actually introducing the hash table in that thread. Most of the information in the hidden comment is repeated here or in the repo's README, but for the sake of completeness, I've recreated it on Pastebin.

0 replies

camel-cdr · 2023-12-12T20:19:30Z

camel-cdr
Dec 12, 2023

Hi, looks great! I ran it against my hash map fuzzer and didn't find any problems. It's basically structured like your test against stl, but using the fuzzers input instead of rand for "randomness".
At some point I'll come around to touching my hash table implementation again, then I'll probably take a closer look at yours. It'll definitely be in my benchmarks.

0 replies

P-p-H-d · 2023-12-13T19:06:06Z

P-p-H-d
Dec 13, 2023

Hi Jackson and thanks for the invitation. I have included Verstable into M*LIB benchmark

3 replies

tylov Jan 17, 2024

Hi P-p-H-d,

Thanks for making this comparison benchmark suite!
It's often surprising how different relative results are on different hardware/platforms.
On my desktop PC with a AMD 5700X cpu, STC beats STL in every tests:

Note 1: For dict[rand_get()] = rand_get() , it's undefined which rand_get() is evaluated first, and can make checksum wrong.
Shockingly, with Mingw64 13.2 -std=g++14, the evaluation is actually opposite of gcc 13.1 on linux!!
Note 2: I am using a different sort algo than the default vec:
#define i_TYPE vec_float, float
#define i_more // extend template param life to next include
#include "stc/vec.h"
#include "stc/algo/quicksort.h"
... vec_float_quicksort(&a1);

I can make a PR, where I fix issues, and add some missing benchmarks for STC.

Linux/WSL:

STL (gcc 13.1 WSL):
List time 465.33 ms for n = 10000000 [r=1717165952]
Array time 644.24 ms for n = 100000000 [r=2242657024]
Rbtree time 592.48 ms for n = 1000000 [r=2959499360]
dict(m) time 773.11 ms for n = 1000000 [r=626692160]
dictBig time 945.28 ms for n = 1000000 [r=500000]
Sort time 737.53 ms for n = 10000000 [r=444]

STC (gcc 13.1 WSL):
List time 337.46 ms for n = 10000000 [r=1717165952]
Array time 262.89 ms for n = 100000000 [r=2242657024]
Rbtree time 478.42 ms for n = 1000000 [r=2959499360]
dict time 25.96 ms for n = 1000000 [r=626692160]
dictBig time 444.03 ms for n = 1000000 [r=500000]
Sort time 732.08 ms for n = 10000000 [r=444]

Windows:

STL (gcc 13.2 mingw64):
List time 1315.99 ms for n = 10000000 [r=1717165952]
Array time 833.55 ms for n = 100000000 [r=2242657024]
Rbtree time 668.78 ms for n = 1000000 [r=2959499360]
dict(m) time 697.05 ms for n = 1000000 [r=2332807200]
dictBig time 1146.88 ms for n = 1000000 [r=500000]
Sort time 768.45 ms for n = 10000000 [r=444]

STC (gcc 13.2 mingw64):
List time 1293.27 ms for n = 10000000 [r=1717165952]
Array time 515.29 ms for n = 100000000 [r=2242657024]
Rbtree time 453.45 ms for n = 1000000 [r=2959499360]
dict time 19.16 ms for n = 1000000 [r=626692160]
dictBig time 930.47 ms for n = 1000000 [r=500000]
Sort time 731.96 ms for n = 10000000 [r=444]

P-p-H-d Jan 19, 2024

Hi tylov ,

Thank you for your report and results.

Don't hesitate to make a PR to add more stuff.
Note 1: I have moved the bench directory to https://github.com/P-p-H-d/c-stl-comparison : it is more at home there.
Note 2: If you want to bench your new sort algorithm, it would be better to add another entry for this new bench (so that we can compare the old and the new sort).

tylov Jan 19, 2024

Hi, great. I have added a few more bench tests, and I'll keep both sort algos). I also added binary_search and lower_bound test for both STC and STL. I'll prepare a PR during the weekend.

JacksonAllan · 2024-01-17T01:48:16Z

JacksonAllan
Jan 17, 2024
Maintainer Author

Hi guys. Sorry for the belated follow-up.

@martinus
@cmazakas
@attractivechaos
@joaquintides
@camel-cdr
@P-p-H-d

I have completely rewritten my benchmarks to rely on templates and C++20 concepts, rather than preprocessors macros, as the modularity/extendibility mechanism. I’ve created a private repo and invited everyone listed above to access it. I’ll be gradually adding other C hash table libraries and refining the code and documentation. Eventually, I’ll publicize the repo and publish a write-up.

I also added khash (Should I instead include khashl?).

I’d also like share several new results:

A few substantive changes to the benchmarks:

I’ve now included both of Martinus’ Robin Hood maps.
The uint64_t key, 256-bit struct value benchmarks have become 64-bit integer key, 448-bit value benchmarks. In other words, because these benchmarks are supposed to measure how the maps perform when copying buckets and traversing the buckets array is expensive, I’ve upped the size of each bucket to 64 bytes so that one cache miss is incurred per bucket traversal, assuming a 64-byte cache line.
The version of Verstable included is the current development version, not the currently published version.

@attractivechaos

I’ve added a pull request to your benchmark to give Verstable the same minimal multiplicative hash function that unordered_dense applies internally by default to protect against weak user-supplied hash functions. As I mentioned earlier, this is necessary to activate its hash code “fragment” or “fingerprint” mechanism, without which it incurs a significant performance hit. I understand if you'd prefer not to make this change – it could be unfair to allow only one library to use a custom hash function. I’m considering adding this functionality to the library as an opt-out mechanism, as in Boost and unordered_dense, but I have mixed feelings about this approach (unlike in C++ maps that follow the std::unordered_map API, users wanting to use a custom hash function with Verstable will necessarily have to read the documentation, which explicitly states that the hash function should provide entropy in the high bits).

Also, note that effectively excluding the cost of the hash function from the benchmarks (by using an identity hash function) could have some perhaps unexpected influence on the results. Some tables – including unordered_dense and Verstable – sometimes re-hash existing keys, so they will get those rehashes for free. But this issue mainly affects simple linear probing hash tables that use backshift deletion without storing hash codes or home bucket indices. At higher load factors, such tables must rehash many keys on deletion, so their deletion performance deteriorates rather spectacularly when the hash function is expensive. STC, in particular, suffers from this problem (@tylov).

5 replies

martinus Jan 17, 2024

Thanks for this benchmark updates! I wonder why abseil is so slow. I think you will get different (better) results for it with NDEBUG, like so:

g++ -I. -std=c++20 -O3 -DNDEBUG -Wall -Wpedantic main.cpp -o build/out

I just gave it a try and the results are quite different.

JacksonAllan Jan 18, 2024
Maintainer Author

@martinus

I just gave it a try

I'm happy you actually got a results file because I realized today that no file is outputted if the executable isn't run from the master directory! As a temporary workaround until I find the best cross-platform means of ensuring that the results are written to the results directory, I've updated the code to output the file to the current working directory.

I think you will get different (better) results for it with NDEBUG

Done, and the results for Abseil are indeed better now. The fundamental problem here is probably that I'm embedding Abseil directly into the project in order to circumvent Abseil's build system—which probably includes NDEBUG by default—because I couldn't be bothered trying to figure it out. I think absl::flat_hash_map isn't really designed to be extracted and used in a standalone manner.

Here's the new results I get with NDEBUG:

My assessment is that Abseil's performance in the insert nonexisting is essentially "fixed". Its performance in benchmarks involving successful lookups is much better, but it still lags pretty far behind Boost. Its performance in benchmarks involving unsuccessful lookups seem unchanged (but was already good). Its performance in the iteration benchmarks has vastly improved but, once again, still lags behind Boost. In summary, if we want a C++ SIMD map, I can't see a compelling reason to choose Abseil over Boost unless we happen to already be using other parts of the Abseil library.

As for why Abseil appears to be slower than Boost, I'm not sure, but I have a stab-in-the-dark theory. One issue that I encountered when working on my own SIMD maps is that while using SSE to scan 16 buckets for an empty or matching metadatum is much faster than checking 16 buckets individually, it is slower than checking one or even two individually. This is significant because 50-65% of keys end up in their home bucket, depending on whether we employ a mechanism to maximize this quality. To get the performance I expected with SIMD maps, I found that I needed to capitalize on the aforementioned fact by doing a preliminary individual check of lookup key's home bucket, before doing any SSE scanning. I don't remember @joaquintides saying anything about this matter in his presentation, but if Boost is doing such a check and Abseil isn't, that could explain why it's outperforming Abseil.

On a side note, running the benchmarks with lower key counts has led to an interesting observation: while Verstable appears to slightly outperform Boost in successful integer-key lookups at high key counts, this is not necessarily true at low key counts. In particular, in the 32-bit integer key, 32-bit value: Time to lookup 1000 existing keys with N keys in the map benchmark for just 200,000 keys, Boost clearly outperforms Verstable. This result is significant because most users' hash tables probably don't contain millions of keys. It could be an indication that while Verstable is better at minimizing cache misses on lookup, Boost's lookup algorithm is computationally more efficient.

attractivechaos Jan 18, 2024

I also added khash (Should I instead include khashl?).

Thank you, @JacksonAllan! khashl is largely the next version of khash, but it hardcodes the load factor to 75%, which doesn't work in your setting. You don't need to evaluate khashl. For a context, both khash and khashl are simple basic implementations without advanced strategies. khash uses quadratic probing with tombstones. khashl uses linear probing without tombstones and it always applies fibonacci hashing to guard against bad hash functions. By the way, you may consider to evaluate an ensemble of hash tables. I use this strategy to store billions of small keys and find it to have a good speed-memory balance without completely preventing insertion during rehashing.

joaquintides Jan 18, 2024

@JacksonAllan:

Here's the new results I get with NDEBUG:

Are you defining NDEBUG for all containers?

As for why Abseil appears to be slower than Boost, I'm not sure, but I have a stab-in-the-dark theory. One issue that I encountered when working on my own SIMD maps is that while using SSE to scan 16 buckets for an empty or matching metadatum is much faster than checking 16 buckets individually, it is slower than checking one or even two individually. This is significant because 50-65% of keys end up in their home bucket, depending on whether we employ a mechanism to maximize this quality. To get the performance I expected with SIMD maps, I found that I needed to capitalize on the aforementioned fact by doing a preliminary individual check of lookup key's home bucket, before doing any SSE scanning. I don't remember @joaquintides saying anything about this matter in his presentation, but if Boost is doing such a check and Abseil isn't, that could explain why it's outperforming Abseil.

Ok, there's much to discuss here :

I don't know why Abseil does not use this pre-checking technique you describe. They may have measured and found out not to there be a gain, or they may have not thought about it, who knows.
Pre-checking can be beneficial for successful lookups, but it's certainly detrimental for unsuccessful lookups --you do the pre-checking and then the SSE scanning. So this is a tradeoff. Have you measured how much unsuccessful lookup is degraded in your container because of this?
Boost does not use pre-checking at lookup, and there's a reason this wouldn't work for us: unlike Abseil, Boost maps elements at group level rather than element level. A way to visualize this is to think of boost::unordered_flat_map has having 15 times fewer buckets, with each bucket being 15 times more populated than in the case of Abseil:

In the image you can see that elements get clustered: a cluster corresponds to elements being mapped to exactly the same bucket, and then being inserted at the first available slot starting with the bucket begin position. An implication of that is that most elements won't be located at the beginning of the group (their "natural" position, so to say) and pre-checking would fail most of the time.

As to why Boost is faster than Abseil, my analyses seem to suggest this is due to Boost using more bits for the reduced hash value stored in the metadata (7.989 vs 7 bits) plus the use of the so-called overflow byte, which dramatically speeds up unsuccessful lookups (this shows very clearly in your graphs). The thing is explained in more detail here.
By contrast, we do use pre-checking for iteration, taking advantage of the fact that, due to clustering, the probability that the next occupied slot is right next to the current one is very high. This shows also clearly in your graphs if you compare Boost and Abseil iteration.

JacksonAllan Jan 24, 2024
Maintainer Author

@joaquintides

Thanks for the detailed response, and sorry about the late reply!

Have you measured how much unsuccessful lookup is degraded in your container because of this?

To be clear, Verstable isn’t using SIMD – it’s essentially a coalesced hashing variant without the “coalescing” aspect (the idea has appeared in academic literature,¹ but AFAIK, Bytell and Verstable are the only existing implementations). So when I talk about what I did with SIMD and pre-checking the home bucket, I’m referring to my unpublished experiments with a number of SIMD-table variations mid last year.

For most open-addressing tables, the performance of unsuccessful lookups is dominated by excessive probing to find an empty bucket (or bucket whose probe length indicates that the lookup key could not exist in the table, in the case of Robin Hood). Therefore, I considered the cost of one extra check during unsuccessful lookups to be insignificant. However, Boost achieves amazing speed in unsuccessful lookups, probably thanks to the overflow-byte mechanism. In this case, the cost of that extra check might become noticeable.

Boost does not use pre-checking at lookup, and there's a reason this wouldn't work for us: unlike Abseil, Boost maps elements at group level rather than element level.

I did implement one or two Boost-esque maps, but I still used the pre-check mechanism. This I did by assigning each key a preferred bucket (hash % bucket-group-size) within its bucket group. On insert, the key would be inserted into that bucket if it was empty. Of course, this mechanism disrupts the clustering pattern shown in your image, and it’s now clear to me that this clustering is the reason that Boost achieves such good iteration performance. Also, I implemented the pre-check mechanism under the assumption that it is necessary to achieve good look-up performance, which Boost’s results in my benchmarks now show was incorrect.

Unfortunately, too much time has now passed for me to remember the exact results I achieved with all those experiments. I moved away from SIMD because I wanted “true” deletions (i.e. deletions without a residual impact on table performance that eventually triggers a complete rehash, whether in the form of tombstones or a Boost-esque overflow/bloom-filter byte).² But when I made that decision, I was only testing against Abseil – not Boost – and I wasn’t testing in the low-key-count (i.e. below 200, 000) range.

Right now, I’d like to see whether I can improve Verstable’s performance in that low-key-count range (or, more accurately, its performance when the entire table is already cached). I think this goal will be difficult because moving from one bucket to the next in the chain necessarily requires invoking the quadratic formula i * ( i + 1 ) / 2 (one multiplication, one addition, and one bit-shift). But I noticed that both Boost and @martinus’ unordered_dense make ample use of __builtin_expect—is that because it registered a perceivable performance gain?

"Scalable Hash Tables" (doctoral thesis), Tobias Maier, 2021, p. 40 briefly outlines the idea of preventing chains from merging by a evicting key not in its home bucket in order to make room for a key that belongs in the bucket. ↩
Verstable's performance isn't entirely unaffected by deletions. While a deletion always occurs at the end of the deletion key's chain and therefore maintains the density of that chain, the deletion will not cause another chain that could have used the vacated bucket to use it (at least until another key belonging to that chain is inserted). In other words, a deletion from one chain will not condense another chain. So we can imagine, for example, a pathological case in which we fill the table to the maximum load factor and then erase every key except for those belonging to the longest chain, leaving us with a table that is almost empty but has worst-case lookup performance (in terms of cache misses). In reality, this effect is probably insignificant: even at high key-counts and load factors, the longest chain only contains about 10 keys (avg. about 1.5), and in our pathological case, the performance will trend back towards the max-load average as new keys are inserted. Of course, the principle advantage here (over tombstones or overflow bytes) is that deletions never lead to rehashing the table. ↩

tylov · 2024-01-17T12:14:39Z

tylov
Jan 17, 2024

Thanks for this effort! I am aware of this problem with STC hmap when using high load factors, but it does store 7 bits of the hash. I have an old branch where I'm using Robin Hood in the STC hmap that I will resurrect and test with. I didn't use it as it didn't perform better, but it will probably help on this issue, and also work better with higher load factors in general.

5 replies

JacksonAllan Jan 17, 2024
Maintainer Author

I have an old branch where I'm using Robin Hood in the STC hmap that I will resurrect and test with. I didn't use it as it didn't perform better, but it will probably help on this issue, and also work better with higher load factors in general.

Robin Hood will definitely solve that issue as any sensible implementation will store probe counts, home bucket indices, or hash codes and therefore not have to rehash keys to determine their home bucket. Also, Robin Hood will back-shift far fewer keys per delete. When I get a chance, I'll do another run of the benchmarks with STC and the ~~three~~ four Robin Hood maps (Martinus' unordered_dense and unordered_map, tsl::robin_hood, and my CC). That should give you a general idea of how STC would perform if it adopted Robin Hood.

it does store 7 bits of the hash

I talked about (and experimented with) this matter fairly extensively with @camel-cdr, but his repo containing the discussion is private. The gist was that if you're storing the same bits of the hash code that determine the key's home bucket (probably the low bits), then 7 bits allows you to mathematically determine the home bucket of any key with a displacement lower than 128. So you either restrict displacement to 128 or reserve a value (e.g. all bits set) for keys with displacements that exceed this limit (i.e. whose home buckets can only be determined via rehashing). Of course, 128 is a low displacement limit for linear probing. If you could instead store 8 bits—i.e. instead of using one bit to mark empty buckets, reserve one value for empty buckets (in addition to the value reserved for keys exceeding the limit)—then that would be better.

One problem, of course, is that you're probably storing bits of the hash code (i.e. a hash fragment or fingerprint) in order to skip most key comparisons, and if you store the same bits that determine the home bucket, then keys with the same home bucket have the same hash fragment/fingerprint. Hence, I recall having the most success with a hybrid approach that stores different amounts of high bits and low bits depending on the displacement. Here's the snippet of code I shared with @camel-cdr (from this repo):

// The following functions each create a one-byte metadatum from a 64-bit hash code.
// The appropriate function depends on the displacement of the element: within 32 buckets of its home bucket, within 160
// buckets of its home bucket, or beyond 160 buckets of its home bucket.

// If the element is within 32 buckets of its home bucket, the metadatum consists of:
// * The 5 lowest bits of the hash code (which also denote an element's home bucket alignment with a granularity of 32).
//   in the lowest 5 bits.
// * The 2 highest bits of the hash code in the next-lowest 2 bits. These bits allow our metadata scanning to
//   differentiate between elements with the same home bucket, which are likely to be near each other.
// * An unset bit in the highest bit, which indicates that the element is within 32 buckets of its home bucket.
// Since the value 0x00 denotes an empty bucket and the value 0x01 denotes an element beyond 160 buckets of its home
// bucket, if the above rules result in either value then we set one of the two bits normally corresponding to the hash
// code's high bits.
static inline uint8_t make_metadatum_0_to_32( uint64_t hash )
{
  uint8_t metadatum = ( ( hash >> 56 ) & 0x60 ) | ( hash & 0x1F );
  metadatum |= !( metadatum >> 1 ) << 6;
  return metadatum;
}

// If the element is beyond 32 buckets of its home bucket but within 160, the metadatum consists of:
// * The 7 lowest bits of the hash code (which also denote an element's home bucket alignment with a granularity of 128,
//   or 160 when we consider that the element is at least 32 buckets away from it) in the lowest 7 bits.
// * A set bit in the highest bit, which indicates that the element is beyond 32 buckets of its home bucket but within
//   160.
static inline uint8_t make_metadatum_32_to_160( uint64_t hash )
{
  return hash | 0x80;
}

// If the element is beyond 160 buckets of its home bucket, we simply mark it with 0x01.
// Of course, this means that there is no way to differentiate between such elements based on their metadata, but this
// drawback is acceptable because these cases are rare.
static inline uint8_t make_metadatum_160_to_infinity( uint64_t hash )
{
  return 0x01;
}

The result was this, where map_x_5_see is employing the above approach and hat is using regular back-shift deletion (as in STC):

martinus Jan 17, 2024

Also, Robin Hood will back-shift far fewer keys per delete. When I get a chance, I'll do another run of the benchmarks with STC and the three Robin Hood maps (Martinus' unordered_dense and unordered_map and my CC).

Note that unordered_dense is relatively slow for erase because it keeps all the elements in a vector, and instead stores the index into that vector in an robin-hood based indexing array. So removing one element means replacing that element with the last in the vector, which needs 2 operations in the indexing array - one removal, and one lookup & index update.

tylov Jan 18, 2024

Yes, I am not sure I want to add this complication to the implementation. As mentioned, it is mostly a problem for deletion (which is less important than lookup and insertion), combined with a high max load factor - which is also not that important:

With 2x expansion and 90% max load, the average memory utilization is 50 + (90 - 50)/2 = 70%
With 2x expansion and 80% max load, the average memory utilization is 50 + (80 - 50)/2 = 65%
I.e., dynamic closed hashing is wasteful with memory by design - like dynamic vectors (75 - 83% avg util), but worse.

I will try to get the Robin Hood branch working again, but for comparison mainly.

JacksonAllan Jan 24, 2024
Maintainer Author

@tylov

Here are my benchmarking results for STC:

The Robin Hood maps are robin_hood (robin_hood::unordered_map), ankerl (ankerl::unordered_dense::map), tsl (tsl::robin_map), and CC (cc_map). Hopefully this gives you a general idea of how STC's performance might change if it adopted Robin Hood. Since the graphs are getting rather busy now, you may need hide the other maps (click on their labels) to make the comparison clear.

JacksonAllan Feb 12, 2024
Maintainer Author

@tylov

I’ve repeated the benchmarks with the new STC Robin Hood map (which I've labelled STC-RH). I had to make some superficial changes to the header to get it to work in parallel with the old STC map, which uses the same header guards and some of the same identifiers. He’s my results:

The other Robin Hood tables are ankerl and robin_hood (@martinus’ ankerl::unordered_dense::map and robin_hood::unordered_map) and tsl (tsl:robin_map), as before. However, CC is no longer a Robin Hood table; rather, the version shown here is the upcoming version and implements Verstable (i.e. in-table chaining with a hash fragment/fingerprint filter) within the constraints of its existing API.

My observations on STC-RH vs STC:

Insert nonexisting: STC-RH is slower, as expected.
Erase existing: STC-RH is faster and fixes the performance issue when the hash function is expensive.
Insert existing (i.e. replace): STC-RH is slower, especially for large buckets.
Erase nonexisting: STC-RH and STC are comparable; which one is faster depends on the bucket size.
Lookup existing: STC-RH is a little slower in general and much slower for large buckets.
Lookup nonexisting: Same as erase nonexisting.
Iteration: STC-RH and STC are about the same.

As for STC vs the other Robin Hood tables, martinus' tables seem noticeably faster in most cases. So maybe look to those for ideas for how to improve Robin Hood performance? In particular, both those tables store metadata (be it displacements or displacements, indices into a densely packed elements array, and hash fragments) in a separate array. If you're storing displacements inline with the keys and values, that might explain the performance difference, especially when the keys/values are large and therefore expensive to traverse.

@attractivechaos

We exchanged some comments a few months ago about the performance penalty of using void pointers as a pattern for implementing generic containers in C. Since CC—at least in the above graphs—now implements Verstable but relies on void pointers, rather than macro pseudo-templates, a CC vs Verstable comparison in these graphs might shed some interesting light on that matter. In this regard, CC seems to be performing on par with Verstable in most benchmarks. It's slower in the iteration benchmarks only because of its API constraints (since iterators in CC are just bucket pointers, the address of the corresponding metadatum has to be recalculated every iteration). Note, though, that CC's API macros are invisibly passing data type sizes and alignments, padding information, and hash and comparison function pointers into every function call, so the compiler has all the information it needs to fully optimize the hash table functions whenever it inlines them. Usually, when people implement void *-based containers, they store such type information and function pointers in the container at runtime, which obviously prevents many optimizations and thereby kills performance.

JacksonAllan · 2024-04-30T12:49:18Z

JacksonAllan
Apr 30, 2024
Maintainer Author

@attractivechaos
@camel-cdr
@joaquintides
@martinus
@P-p-H-d
@tylov

Hello again everyone :)

I’ve upload a draft of my article on my benchmarks here, if anyone would like to take a look. Suggestions are welcome and appreciated. In particular, read my description of your hash table in the Hash Tables Benchmarked section and my analysis of its results in the Analysis section and check that they are accurate and fair.

Joaquín, I included the image you shared earlier showing the cluttering of elements in boost::unordered_flat_map vs absl::flat_hash_map. Is that okay? Or should I recreate the images using on my own data?

Tyge, you were talking earlier about potentially replacing STC’s hash-table implementation with a Robin Hood hash table. Is that a definitive plan? Should I make a note about that in the article?

Finally, one perhaps interesting change to the benchmarks themselves is the addition of a heat map that consolidates the graphs' data into one figure and was inspired by what Joaquín did with the results of Martin’s benchmarks. Here’s what it looks for the 20,000,000-key benchmarks:

The red maxes out at 5.0 (five times slower than the best performer). I may add some little icons indicating which tables use tombstones or overflow bytes for erasure. As I mentioned earlier, boost does even better in the low-key-count benchmarks (which I think show performance when the tables are hotter in the cache).

I’ll eventually be sharing the article on Reddit and maybe some other social media.

15 replies

JacksonAllan May 6, 2024
Maintainer Author

@attractivechaos Great, thanks for the feedback! Will work on incorporating what I can.

Regarding some of the points you raised:

It would be good to mention the bucket size of each implementation. Some implementations use special values to mark empty slots and tombstones. They don't have memory overhead. Others have to store additional information. For example, std::unordered_map uses 8 bytes per bucket for the pointer. This overhead affects the total memory even if every library uses the same load factor.
…
It would be good to have a table to compare the features of each hash table.

This would be good, but there are some complications. For example, I think your statement that std::unordered_map uses eight bytes of overhead per bucket is being very kind to it. In reality, any node-based table has hidden, indeterminate memory costs: the per-node allocation header and padding, and the general fragmentation of available memory caused by many tiny allocations. Also, Absl, Boost, Verstable, and unordered_dense all have one to two bytes of overhead per bucket, except that there is padding for the keys and values that will differ depending on their types. Tsl has, I think, even more padding that will vary based on datatypes. Khash looks like it has two bits of overhead per bucket, but there is no key-value padding, so in many cases it will offer even better memory conservation than this figure suggests, in comparison to the other tables. And some libraries' documentation provides few details about their internals, which makes it harder to figure out their per-bucket memory overhead.

Other features are also hard to summarize in a table cell. For example, when it comes to “collision resolution”, Absl and Boost essentially use both linear and quadradic probing. Verstable/CC uses quadradic probing, except it only ever occurs during insertion of new keys.

I’ll have a go at making such a table and see how it turns out.

In general, it would be good to measure peak memory. A library using less memory can afford to have a smaller load factor, which often helps performance.

I think my early choice to make all the benchmarks run as part of the one program could make measuring the peak memory use per-table difficult. This choice might have been a design mistake. To limit the scope of the article, I basically deferred to the reader to figure out memory usage: “Secondly, the benchmarks do not measure memory usage. However, the memory usage of each open-addressing table, at least, can be estimated based on the maximum load factor it can reasonably tolerate and the size of the metadata it stores per bucket”. But for most tables, I haven’t (yet) given enough information about per-bucket metadata for this statement to be helpful.

rehashing techniques also affect memory.

As I understand, khash does something unusual to avoid having two tables existing in parallel while rehashing, so its peak memory usage should be lower than other tables. If that’s true, then I ought to mention this point in the article. However, after reviewing the explanation from 2008, I’m a bit skeptical:

Space efficient rehashing. Traditional rehashing requires to allocate one addition hash and move elements in the old hash to the new one. For most hash implementations, this means we need 50% extra working space to enlarge a hash. This is not necessary. In khash.h, only a new flags array is allocated on rehashing. Array keys and values are enlarged with realloc which does not claim more memory than the new hash. Keys and values are move from old positions to new positions in the same memory space. This strategy also helps to clear all buckets marked as deleted without changing the size of a hash. [Emphasis added]

The problem, as I see it, is that the statement that realloc will “not claim more memory than the new hash” is only true if the current allocation can be expanded. Otherwise, it needs to maintain the old allocation and a new allocation in parallel while it copies the old data over – similar to what we do manually when we resize hash tables in the conventional manner, except behind the scenes. I did a little test to see how often this occurs as we grow an allocation by powers of two and do no other allocations (so, the ideal scenario). The results seem to suggest that for >= 131072 bytes, realloc cannot expand the existing allocation and must always copy the data to a new allocation.

Deletion matters more when you have lots of tombstones. If you delete 50% of keys, a library with tombstones will be slower upon insertions/queries than a library without tombstones. In a worse case, a library with tombstones may need rehashing earlier. To emulate this process, you may consider to delete 50% of keys and then test the insertion speed.

At present, I just point out in the article that these benchmarks don’t show tombstone-related degradation. Again, I made this choice in order to limit the scope of the article, but I was already planning to add some marks to the tables that use tombstones on the heatmap to remind readers that their erasures aren’t necessarily doing the same job as the other tables’ erasures. This choice is actually rather disadvantageous to my own tables, which have fast deletions but don’t use tombstones. Maybe I can examine this issue in a supplemental, follow-up article.

When I iterate a hash table, I use one loop only. To measure the difference, I did a microbenchmark here. You can see the one-loop version is 3.9 times as fast.

I see what you mean. But I think there is an issue with your benchmark that is causing exaggerated results. This issue revolves around this line in the iteration loop:

z += kh_val(h, k), ++m;

Here, it looks like your intention was to accumulate the values into z in order to ensure that the value access doesn’t get optimized out. But because z is neither volatile nor used again, it gets optimized out anyway. The results I get with this line as is (Fast iteration: 0.683 ns/element; Slow iteration: 3.345 ns/element) are similar to the results you get and appear to be the same as the results I get when I replace the line with just ++m. But if I print z after each benchmark (or make it volatile), the results I get are less spectacular: Fast iteration: 2.928 ns/element; Slow iteration: 3.974 ns/element. And if I add the line z += kh_key(h,k); (since most other hash tables provide key access during iteration without extra cache misses), the results further converge (Fast iteration: 3.106 ns/element; Slow iteration: 3.887 ns/element). So the speed increase gained from eliminating the nested loop is rather significant (20-25%), but not on the order of 3-4x.

Finally, I get that people here are not interested in the idea of ensemble of hash tables, but I want to say this is one of the best ways to store billions of keys.

The issue for me is not so much disinterest as it is – once again – the need to set a limit on the scope of the article somewhere. Also, I would expect the relative performance of a table in an ensemble to be similar to its relative performance as an individual table. In other words, if table A is generally faster than table B, then an ensemble of tables A should be generally faster than an ensemble of tables B, right?

attractivechaos May 7, 2024

@JacksonAllan Thanks for the detailed explanation. You are right that the compiler has optimized out z += kh_val(h, k) and that changing the iteration loop has little effect on performance. My bad.

On peak memory, I observed unexpected pattern in my benchmark running on mac: boost and mlib use noticeably more memory on a deletion-heavy load; unordered_dense uses more memory than the rest of fast libraries. I then reran the benchmark on a linux machine, the peak memory looks ok now: boost, mlib, verstable, robin_hood and unordered_dense all use similar memory on both insertion-only and deletion-heavy loads. I don't know what in mac makes some libraries consume more memory apparently. On linux at least, I now agree with you that controlling the load factor is good enough for these fast libraries.

Both khash and mlib use realloc. Given ~16 million keys, realloc can avoid relocating memory, perhaps because the server has a lot more memory or the way linux manages memory. I don't know.

If it is difficult to measure memory in your current framework, there is not much point to evaluate ensemble of hash tables because this strategy only slows down a library.

martinus May 7, 2024

In my benchmark I use this function to get the peak RSS:
https://github.com/martinus/map_benchmark/blob/master/src/app/getRSS.cpp#L41
So basically just a getrusage call in Linux. Naturally you'll need to run benchmarks separately per process for this to work.

P-p-H-d May 7, 2024

Both khash and mlib use realloc. Given ~16 million keys, realloc can avoid relocating memory, perhaps because the server has a lot more memory or the way linux manages memory. I don't know.

On Linux at least, for a realloc sufficiently big (and page aligned), GLIBC won't move the data on reallocation to increase its size but request Linux to move the MMU pages allocation to a new area (keeping the same physical pages where the data are stored). This explains its performance.

attractivechaos May 7, 2024

@martinus Thank you! I was using getrusage on mac and linux as well. I will use your version on other systems. The strange thing on mac is the unexpected high memory for a few libraries but not the others.

@P-p-H-d Thanks a lot for the explanation. I read about this before but forgot the details. The mac realloc is less consistent. It relocates large memory from time to time. As a result, the peak rss of khash and mlib fluctuates on mac but the peak rss of other libraries is stable.

P-p-H-d · 2024-05-03T17:42:59Z

P-p-H-d
May 3, 2024

You’re talking about using the node-based DICT_DEF2 for the large-bucket benchmark, right? I’ll give it a try.
Well, this is the one. However, it has changed a lot recently as I was able to finish implementing its new design (it is also OA based now).

Note you will find difficulties writing a wrapper "find" function for it without hacking the library (I don't think it is possible to transform back the value pointer into an iterator with this data-structure, as some information is not given back to the user) - whereas it is always possible to convert the iterator into a pointer to value. If needed, I can provide a find method.

The problem is that some of the other libraries (boost, absl) also include node-based alternative tables that may perform better in that benchmark, and for the libraries that don't, we could achieve much the same just by opting to use pointers as the values. I think there is a tension between the goals of the article here: For assessing hash-table designs and how well they cope under different conditions, it doesn’t make much sense to substitute in a different design depending on the benchmark. But for helping readers make an educated choice about which library to use, it doesn’t make sense not to use the best-suited library provided by the library.

I agree with your points. Maybe the solution is to add a new entry if it is a different design.

The easiest way to do this would probably be to make each shim a separate compilation unit. But I’m not certain that the problem you described is the result of optimization because to optimize out lookup call, the compiler would have to recognize that the hash table cannot contain the key, which seems unlikely. Possibly, there is some bug. Would you mind sharing your compiler and the problem table that’s registering the strange result so that I can try to reproduce it (I was already suspicious of stb_ds’s sh, since it performs much better than the other stb_ds table)?

Sure. GCC 13.2.0 on Linux/x86-64
result_2024-05-02T18_57_25.zip

Is there a reason to suspect that iterating 5,000 elements would not give a fair representation of iteration speed?
I see no reason.

The lack of Y-axis scale prevents comparing a benchmark html result to another one.
You mean if readers wanted to see how the speed of two different hash-table operations compare (e.g. lookup existing vs lookup non-existing)?

This, and when you run the benchmark once again and compare the new result with the previous one.

EDIT: I click on reply and github post a new comment instead of a reply :)

2 replies

JacksonAllan May 5, 2024
Maintainer Author

Sorry for the delayed response.

The new DICT:

I added the newest version of DICT on M*LIB’s master branch to the benchmarks. From what I can see, the key-value pairs are stored in a separate array, and the buckets array stores indices into the key-value pairs array, along with hash codes. In this regard, the design is similar to @martinus’ unordered_dense and stb_ds.

To get around the problem of the pointer returned from _get being inconvertible to an iterator, I simply wrote my own _get_itr function. It’s the same as _get, except it returns an iterator:

static inline void mlib_dict_def2_##blueprint##_get_itr(                         \
  const mlib_dict_def2_##blueprint##_t map,                                      \
  blueprint::key_type const key,                                                 \
  mlib_dict_def2_##blueprint##_it_t itr                                          \
)                                                                                \
{                                                                                \
  M_D1CT_CONTRACT(map);                                                          \
  const m_index_t mask = map->mask;                                              \
  m_index_t hash = (m_index_t) blueprint::hash_key(key);                         \
  m_index_t p = hash & mask;                                                     \
  m_index_t s = 1;                                                               \
  /* We are likely to find the correct bucket first */                           \
  while (true) {                                                                 \
    if (M_LIKELY (hash == map->index[p].hash)) {                                 \
      m_index_t d = map->index[p].index;                                         \
      if (M_LIKELY(d >=2 && blueprint::cmpr_keys(map->data[d].pair.key, key))) { \
        itr->dict = map;                                                         \
        itr->index = p;                                                          \
        return;                                                                  \
      }                                                                          \
    }                                                                            \
    if (M_LIKELY (map->index[p].index == 0)) {                                   \
      mlib_dict_def2_##blueprint##_it_end( itr, map );                           \
      return;                                                                    \
    }                                                                            \
    p = (p + M_D1CT_OA_PROBING(s)) & mask;                                       \
  }                                                                              \
}                                                                                \

Here are the results on my machine for 20,000,000 keys (the new DICT appears as “M*LIB dict”).

Looks mostly good to me. In most benchmarks, the hash table is faster than DICT_OA for large buckets and for string keys, and almost as fast as DICT_OA for small buckets with integer keys. In comparison to the other tables, it has excellent insert speed and existing-lookup speed, but it isn’t very fast for unsuccessful lookups. Its iteration is also much slower than DICT_OA’s iteration and 2-3x slower than the slowest among the other open-addressing table. A few things you could consider:

DICT’s buckets (i.e. the elements of what you call the “index” array) store the entire hash code along with the index into the key-value pairs array. Hence, its overhead per bucket is 8 bytes by default, or 16 bytes if M_USE_64BITS_INDEX is enabled. That’s quite a lot compared to the one byte of overhead for absl, 1.07 for boost, and two for Verstable/CC. Martin mitigates this issue in unordered_dense by only storing one byte of the hash code and storing it as part of the same integer that contains the index into the key-value pairs array (the other bits are used for the index). Presumably, a one-byte hash fragment would allow you to skip 255 out of 256 key comparisons that would return false, although the time taken to grow/rehash the table would take a hit.
Although your store the key-value pairs in a separate array, you still iterate over the buckets (i.e. “index”) array, and that’s why we can’t convert a key-value pointer into an iterator. I find this choice a bit strange because perfect iteration speed is one of the biggest benefits of storing the key-value pairs in a separate array. The basis of the choice seems to be that you still want to use tombstones (which are found in the buckets/“index” array), so the key-value pairs might not be contiguous. I think the more common way to handle erasure in such a design is to move the last key-value pair in the key-value pairs array to the vacant slot and then look it up and update its index in the buckets array. So your erasures become slower (they require two lookups rather than one), but in return you get tombstone-less erasure and perfect iteration speed. This is, I think, what Martin does.

If I can fit this hash table into the article, then what’s its current status? As far as I can tell, it is not yet part of an official release, and the in-code and API documentation hasn’t been updated to reflect its new design (e.g. the in-code comments still call it a “chained dictionary”). So what would be the best way to specify its version – as the yet-to-be released v0.7.3, perhaps? And will the labels DICT and DICT_OA remain the same even though both tables now use open addressing?

The lookup-existing bug

Even with GCC 13.2.0 (under MinGW), I couldn’t reproduce the strange result for M*LIB DICT_OA in the lookup non-existing benchmarks. However, I did experience it with the new hash table and – I believe – fixed the problem. The issue was the way I was preventing the compiler from optimizing out the hash-table operation:

auto itr = shim< blueprint >::find( map, shuffled_unique_key< blueprint >( l ) );
do_not_optimize = *(unsigned char *)&itr;

do_not_optimize is declared as a volatile unsigned char. The idea here is that by reading the first byte of the iterator returned from find into a volatile variable, the compiler cannot optimize out the function call. However, in the case of M*LIB, the iterator struct (which is here wrapped in another struct) consists of a pointer to the hash table and a pointer to a key-value pair. After inlining, the compiler was able to determine that all paths of find set the first member – the hash-table pointer – to point to the table and, therefore, omit the call. Replacing the above two lines with the following lines fixes the issue for me (I've already uploaded the change):

volatile auto itr = shim< blueprint >::find( map, shuffled_unique_key< blueprint >( l ) );
(void)itr;

The lack of a scale on the Y-axis

I explained in my earlier response to Joaquín why I wanted to leave out a Y-axis scale. But I can see how the numbers could be useful under some circumstances (e.g. if someone wants to use the benchmarks as a development tool). I think a good solution here might be for me to just output the raw data in CSV or JSON format. Then the users could analyze it and plug it into any graphing software they like.

P-p-H-d May 9, 2024

Presumably, a one-byte hash fragment would allow you to skip 255 out of 256 key comparisons that would return false, although the time taken to grow/rehash the table would take a hit.

That was the initial plan, however performance measurement shown that the time taken to grow/rehash the table took too much a hit. This hash-map is primary designed for huge (key,value) where moving data or rehashing is costly.

to move the last key-value pair in the key-value pairs array to the vacant slot and then look it up and update its index in the buckets array. So your erasures become slower (they require two lookups rather than one), but in return you get tombstone-less erasure and perfect iteration speed

As you said, you will nearly double the erasure time. Moreover you still need tombstone, since tombstone are needed in the 'bucket' table not in the 'key-value' table to handle the 'empty-but-need-to-be-skipped' case.
Instead I plan to implement another hack for increasing iteration speed. Anyway I don't think iteration speed is that important for a hash map (for serialization for example)

If I can fit this hash table into the article, then what’s its current status?

I can release 0.7.3 if you want. the only difficulty is to find a release name %)

will the labels DICT and DICT_OA remain the same even though both tables now use open addressing?

Yes. I don't plan to break the API.

The lookup-existing bug [is no more]

Good !

P-p-H-d · 2024-05-09T18:18:55Z

P-p-H-d
May 9, 2024

For information, since I got some free time (at last !), I have added "bulk" update operations to M*LIB DICT OA that enable to set/get multiple key/values in one operation (as a WIP feature since I don't have enough perspective to know if it's good and it isn't formally tested). It tries to prefetch data to minimize the cache miss.

I have updated my own bench code and didn't see any gain at 1'000'000, but start seeing it at largest size (moreover it really depends on the CPU, the memory speed, the cache size). However I see a +30% gain at 10'000'000.

@JacksonAllan : Unfortunately, I don't see how I can integrate it in your benchmark, as the API is incompatible (note that I don't know if it makes sense).

@attractivechaos : I have updated the udb3 bench with this bulk interface. I get a nice gain with it. If you are interested I can provide a PR.

2 replies

joaquintides May 9, 2024

FWIW, bulk reads worked awesomely for us in boost::concurrent_flat_map:

https://bannalia.blogspot.com/2023/10/bulk-visitation-in-boostconcurrentflatm.html

P-p-H-d May 9, 2024

It is a very interesting article.

We have empirically confirmed that bulk visitation maxes at around bulk_visit_size = 16, and stabilizes beyond that.

I also empirically get the same value.

JacksonAllan · 2024-05-24T23:17:24Z

JacksonAllan
May 24, 2024
Maintainer Author

Hello again everyone :)

I’ve updated the draft of the article. No need to read it all again, but I will try to summarize the most significant changes, especially those based on the earlier feedback:

Updated to Boost 1.85.0, changed the Boost links, and dropped the remark about Boost being a “massive dependency”.
Moved the links to the lower-key-count benchmarks to the top of the “Results” section, rather than the end of it.
The descriptions of M*LIB’s tables now say that they use quadratic probing “by default”.
Added tombstone marks to heatmap and made sure that table descriptions mention the use of tombstones where applicable.
References to klib now refer specifically to khash.
Added note about khashl to the description of khash.
The raw data is now outputted as CSV file, for anyone who wants access to the numbers.
Added M*LIB’s DICT and @ktprime's emilib2.
Added my best attempt at a description of the memory overhead (based on each library’s documentation, associated reading materials, and actual code) to the description of each hash table. This was originally part of a table comparing the libraries, which I created in response to @attractivechaos’ suggestion. However, I ended up ditching the table because its content overlapped too much with the table descriptions and the article is already very long.
Added remark about Boost performing even better in the low-key-count benchmarks.
Added, in the conclusion, a recommendation for klib as the most memory-efficient option.

@attractivechaos
@joaquintides
@martinus
@P-p-H-d

I’ve thanked you all (and included GitHub links) at the end of the article for taking the time to offer feedback on the draft. Please just let me know if you’d prefer not to be mentioned, or if you’d prefer to be mentioned in a different way (e.g. name vs GitHub username).

@P-p-H-d

Sorry I didn’t respond earlier to your response to my question about the current status of M*LIB’s DICT. If you want to release it now/soon as 0.7.3, that’s fine – I’ll update the description. Otherwise, I’m happy to refer to it as something like “DICT from M*LIB v0.7.3 (upcoming)”, perhaps with a note that it is already available on the master branch.

4 replies

joaquintides May 25, 2024

Please just let me know if you’d prefer not to be mentioned, or if you’d prefer to be mentioned in a different way (e.g. name vs GitHub username).

I'm ok with your mentioning me as it is, and thank you for that :-)

P-p-H-d May 25, 2024

If you want to release it now/soon as 0.7.3, that’s fine

I have release V0.7.3. (I have huge patch waiting for being pushed, which changes the internal string format, so it was a good time to publish a release)

Please just let me know if you’d prefer not to be mentioned, or if you’d prefer to be mentioned in a different way (e.g. name vs GitHub username).

You can mention me with my real name (I use it on the commit and copyright)

I’ve updated the draft of the article

For "Unlike ankerl::unordered_dense, the table does not necessarily store key-value pairs contiguously because it relies on tombstones for erasure.", I think the cause is wrong. It is because it doesn't perform two lookups per erase: one for the element to delete, and one for the element that is moved onto the newly empty spot.

ktprime May 26, 2024

@JacksonAllan emilib2o use 1.25 byte metadata per bucket

JacksonAllan May 26, 2024
Maintainer Author

@P-p-H-d

I have release V0.7.3. (I have huge patch waiting for being pushed, which changes the internal string format, so it was a good time to publish a release)

Great! I’ve updated to the new version and updated the article’s text. I’ll also re-run the benchmarks and update the graphs/heatmap before sharing the article.

For "Unlike ankerl::unordered_dense, the table does not necessarily store key-value pairs contiguously because it relies on tombstones for erasure.", I think the cause is wrong. It is because it doesn't perform two lookups per erase: one for the element to delete, and one for the element that is moved onto the newly empty spot.

My language was pretty imprecise here. I've updated the relevant part of the description to:

Like ankerl::unordered_dense and stb_ds's hm and sh, it stores key-value pairs in an array separate from the buckets array. The buckets array stores indices into the key-value pairs array, along with hash codes. Unlike those two other tables, it does not store key-value pairs contiguously. This is because when erasing, it does not move the last key-value pair in the array backward to fill the gap created (a process that requires another lookup to update the index of the moved pair stored in the buckets array).

I also attempted to explain in the results section why DICT's iteration is slow compared to the other open-addressing tables, particularly those that store key-value pairs in a separate array:

This is because the potential presence of gaps in the key-value-pairs array precludes iterating over it directly and without reference to the buckets array, which stores eight or 16 bytes per bucket.

@ktprime

emilib2o use 1.25 byte metadata per bucket

Thanks for the correction! I think the file in your original PR request used a whole extra byte per bucket to store offsets? In any case, I’ve updated the article’s text to specify 1.25 bytes per bucket. I also removed my discussion of the offset/farthest-key mechanism, since I don’t have a very good understanding of it and the implementation details seem to be evolving at a rapid pace.

joaquintides · 2024-05-30T10:29:13Z

joaquintides
May 30, 2024

@JacksonAllan looks like you've finished your article. Planning on posting it on Reddit and other social media?

6 replies

joaquintides May 30, 2024

I'm afraid I posted to r/cpp some hours ago:

https://www.reddit.com/r/cpp/comments/1d418aw/an_extensive_benchmark_of_c_and_c_hash_tables/

Let me know if you'd rather I deleted it so that you can post it yourself according to your plan.

JacksonAllan May 30, 2024
Maintainer Author

Let me know if you'd rather I deleted it so that you can post it yourself according to your plan.

All good! I've just finished addressing some of the comments now.

joaquintides May 30, 2024

My apologies! I saw your post to r/C_programming after writing here and incorrectly assumed the starting pistol had already been shot. Congrats on the article. I see it's faring very well on Reddit, as it deserves.

P-p-H-d May 30, 2024

Congratulation on your article!

JacksonAllan May 31, 2024
Maintainer Author

@joaquintides

Actually, I think it was great that you posted it (thanks!). I'm not sure how kindly the moderators would take to an outsider coming in for the exclusive purpose of posting his own content.

Lots of good critique in the comments that could be used to revise the article somewhere down the track. In particular, sampling and presenting the data at a logarithmic scale might be nice and presumably eliminate the need for three sets of results. One issue here is that although doing so would skew the overall results (especially the heatmap) more in favor of the SIMD tables (Boost, Abseil, and emilib2) and Abseil in particular (which does far better in the 0-200,000-key benchmark than in the 20,000,000-key benchmarks), I could also skew the results back in favor of some of the other libraries (e.g. my own Verstable/CC, which actually overtake Boost in the critical existing-lookup benchmarks in the 20,000,000-key results) just by increasing the total key count tested (e.g. to 100 million or 1 trillion). In other words, just adopting a logarithmic scale doesn't make the data immune to manipulation to tell a particular story, which is a problem when the author has his own horses in the race. Worse, unifying the heatmaps would could obscure the low-key-count data, where the SIMD tables shine most brightly.

@P-p-H-d

Thanks!

P-p-H-d · 2024-05-30T21:05:43Z

P-p-H-d
May 30, 2024

I was looking for why M*LIB was so slow on non-existing keys, and the result was simple. It was simply due to the load factor which is was too high for its design. Theoretically a load factor 0.875 should be around 2.5 times slower than 0.7 according to the design with 8 iterations in the search loop on average, vs 2.4 for a load of 0.7. And I measure 8 iterations of the search loop on average with this benchmark. Sometimes I am just happy when the number matches :)

4 replies

JacksonAllan May 31, 2024
Maintainer Author

I was looking for why M*LIB was so slow on non-existing keys, and the result was simple. It was simply due to the load factor which is too high for its design.

I touch on this in the section of the article that discusses limitations of the benchmarks. All tables are being tested with the same maximum load factor, but the argument can certainly be made that some tables should be allowed to use a lower limit because they store minimal or no metadata. In your case, that’s DICT_OA (which stores no metadata), but not DICT (which stores a relatively large eight or sixteen bytes per bucket). Of course, the hole in this argument is that the memory cost of empty buckets will in many cases overshadow the cost of a byte or two of metadata per bucket.

Theoretically a load factor 0.875 should be around 2.5 times slower than 0.7 according to the design with 8 iterations in the search loop on average, vs 2.4 for a load of 0.7.

I don’t know how exactly you calculated these figures, but the graphs paint a slightly different picture:

For both M*LIB’s tables, the performance degradation as the load factor rises appears relatively linear, and the difference for unsuccessful lookups between 0.7 and 0.875 looks to be roughly 50% (i.e. they're 1.5 times slower at 0.875). So, assuming your above math on average probe counts is correct, the lower-than-expected difference could be—and I'm speculating wildly here—because the cost of the initial cache miss dominates, or because on average we have to probe a certain number of times before we start incurring more cache misses.

P-p-H-d Jun 1, 2024

So, assuming your above math on average probe counts is correct, the lower-than-expected difference could be—and I'm speculating wildly here—because the cost of the initial cache miss dominates

Yes, you are right. My point was without taking into account the prefetch of the cache-line (which is harder to modelize) and which improve the performance (as the first iterations of the loop are already present in the cache line).

I don’t know how exactly you calculated these figures,

For an hash-hable with a load factor q, there is "q" chance to get a filled bucket and "1-q" chance to read an empty bucket. Therefore for the search loop, we need to perform on average before getting an empty bucket:

sum( (k+1) * q^k * (1-q),k=0 .. infinity) = 1/(1-q)
iterations (if my maths are correct).

, but not DICT (which stores a relatively large eight or sixteen bytes per bucket)

To be exact, when DICT allocates 'N' metadata in this table, it allocates only 'N*q' space for the table storing 'key+value'. So it is not so much dependent of the load factor for the memory allocation and uses less memory than expected, specially for big key or value.

JacksonAllan Jun 1, 2024
Maintainer Author

To be exact, when DICT allocates 'N' metadata in this table, it allocates only 'N*q' space for the table storing 'key+value'.

Has q here changed from meaning the current load factor in your previous paragraph to the maximum load factor here? In other words, are you saying that when you grow the table to N buckets, you only allocate N * max_load_factor key-value slots (which makes perfect sense to me)? If so, my description of DICT's memory overhead in the article isn't very fair:

This table’s approximate memory overhead is eight bytes (or 16 bytes, for tables that can accommodate more than 232 key-value pairs), plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket. [Emphasis added]

If you can confirm, I'll update the wording. It should say something like "in addition to the size of a key-value pair per vacant slot in the key-value pairs array (the total number of slots is typically equal to the number of buckets multiplied by the maximum load factor)". I say "typically" here because a quick glance at the code suggests that frequent erasures can cause the two arrays to fall out of sync.

P-p-H-d Jun 2, 2024

Has q here changed from meaning the current load factor in your previous paragraph to the maximum load factor here? In other words, are you saying that when you grow the table to N buckets, you only allocate N * max_load_factor key-value slots (which makes perfect sense to me)?

Yes

Hash table design #1

JacksonAllan Dec 12, 2023 Maintainer

Replies: 14 comments · 65 replies

JacksonAllan Dec 12, 2023 Maintainer Author

JacksonAllan Dec 15, 2023 Maintainer Author

JacksonAllan Dec 15, 2023 Maintainer Author

JacksonAllan Dec 12, 2023 Maintainer Author

JacksonAllan Dec 12, 2023 Maintainer Author

JacksonAllan Jan 17, 2024 Maintainer Author

JacksonAllan Jan 18, 2024 Maintainer Author

JacksonAllan Jan 24, 2024 Maintainer Author

Footnotes

JacksonAllan Jan 17, 2024 Maintainer Author

JacksonAllan Jan 24, 2024 Maintainer Author

JacksonAllan Feb 12, 2024 Maintainer Author

JacksonAllan Apr 30, 2024 Maintainer Author

JacksonAllan May 6, 2024 Maintainer Author

JacksonAllan May 5, 2024 Maintainer Author

The new DICT:

The lookup-existing bug

The lack of a scale on the Y-axis

JacksonAllan May 24, 2024 Maintainer Author

JacksonAllan May 26, 2024 Maintainer Author

JacksonAllan May 30, 2024 Maintainer Author

JacksonAllan
Dec 12, 2023
Maintainer

Replies: 14 comments 65 replies

JacksonAllan Dec 12, 2023
Maintainer Author

JacksonAllan Dec 15, 2023
Maintainer Author

JacksonAllan Dec 15, 2023
Maintainer Author

JacksonAllan Dec 12, 2023
Maintainer Author

JacksonAllan
Dec 12, 2023
Maintainer Author

JacksonAllan
Jan 17, 2024
Maintainer Author

JacksonAllan Jan 18, 2024
Maintainer Author

JacksonAllan Jan 24, 2024
Maintainer Author

JacksonAllan Jan 17, 2024
Maintainer Author

JacksonAllan Jan 24, 2024
Maintainer Author

JacksonAllan Feb 12, 2024
Maintainer Author

JacksonAllan
Apr 30, 2024
Maintainer Author

JacksonAllan May 6, 2024
Maintainer Author

JacksonAllan May 5, 2024
Maintainer Author

JacksonAllan
May 24, 2024
Maintainer Author

JacksonAllan May 26, 2024
Maintainer Author

JacksonAllan May 30, 2024
Maintainer Author