Improve Termmap Indexing Performance +~30% #2058

PSeitz · 2023-05-26T10:36:27Z

This contains many small changes to improve Termmap performance.
Most notably:

Specialized byte compare and equality versions, instead of glibc calls.
ExpUnrolledLinkedList to not contain inline items.

Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default

Benchmark without compare_hash_only

CreateHashMap/alice/174693
                        time:   [642.23 µs 643.80 µs 645.24 µs]
                        thrpt:  [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s]
                 change:
                        time:   [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05)
                        thrpt:  [+14.088% +15.344% +16.862%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [877.03 µs 880.44 µs 884.67 µs]
                        thrpt:  [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s]
                 change:
                        time:   [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05)
                        thrpt:  [+35.301% +35.637% +35.981%]
                        Performance has improved.
CreateHashMap/numbers_zipf/8000000
                        time:   [9.1198 ms 9.1573 ms 9.1961 ms]
                        thrpt:  [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s]
                 change:
                        time:   [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05)
                        thrpt:  [+52.403% +53.440% +54.390%]
                        Performance has improved.

Benchmark with compare_hash_only

CreateHashMap/alice/174693
                        time:   [503.08 µs 505.07 µs 507.22 µs]
                        thrpt:  [328.46 MiB/s 329.86 MiB/s 331.16 MiB/s]
                 change:
                        time:   [-37.334% -36.881% -36.399%] (p = 0.00 < 0.05)
                        thrpt:  [+57.231% +58.432% +59.576%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [863.67 µs 873.21 µs 883.15 µs]
                        thrpt:  [188.64 MiB/s 190.79 MiB/s 192.90 MiB/s]
                 change:
                        time:   [-35.224% -34.352% -33.461%] (p = 0.00 < 0.05)
                        thrpt:  [+50.288% +52.328% +54.379%]
                        Performance has improved.

stacker/src/fastcpy.rs

fulmicoton · 2023-05-29T01:51:53Z

stacker/src/fastcmp.rs

+    loop {
+        if left.len() < SIZE || right.len() < SIZE {
+            break;
+        }


Suggested change

loop {

if left.len() < SIZE || right.len() < SIZE {

break;

}

assert_eq!(left.len(), right.len()); //< I assume this will go away after inlining?

while left.len() >= SIZE { //< I assume this one cannot go away after inlining? (but maybe LLVM is stronger than I think)

I confirmed with godbolt this does two checks with the current code.

This does not remove the bounds checks.
It seems that the compiler can't deduce from same length + same increments on left and right to remove bounds checks

hmm I checked with godbolt and that's not true...

stacker/src/fastcmp.rs

stacker/Performance.md

stacker/src/fastcmp.rs

stacker/src/fastcpy.rs

fulmicoton · 2023-05-29T02:24:59Z

stacker/src/fastcpy.rs

+    dst[0] = src[0];
+    if len >= 2 {
+        double_copy_trick::<2>(src, dst);


I don't understand... Is it faster than?

if len == 1 { dst[0] = src[0]; } else { // len >= 2 double_copy_trick::<2>(src, dst); }

I think both are equal performance wise, but I didn't test much on this case

fulmicoton · 2023-05-29T02:28:06Z

stacker/src/arena_hashmap.rs

@@ -284,31 +304,36 @@ impl ArenaHashMap {
        }
        let hash = self.get_hash(key);
        let mut probe = self.probe(hash);
+        let mut bucket = probe.next_probe();


what is the goal of putting these outside of the loop?

To fetch the next bucket before the loop jmp. The comment for it is at the end of the loop.
The idea behind it is to allow unconditionally prefetching data (without the branch predictor)

What do you mean by unconditionally: this is not a conditional jump here.
also...
I would not be surprised if for LLVM, the two are exactly the same...

Sorry, that part with the branch predictor didn't make sense. It helps the instruction prefetcher a little bit I think.
The two version are indeed the same, but only with unsafe access. With bounds checks llvm can't optimize the code as well.

https://godbolt.org/z/x7xc8sTcY

stacker/src/expull.rs

fulmicoton · 2023-05-29T02:30:49Z

Cool stuff!

fulmicoton · 2023-05-29T06:00:28Z

stacker/src/memory_arena.rs

 }

 impl Page {
    fn new(page_id: usize) -> Page {
        Page {
            page_id,
            len: 0,
-            data: vec![0u8; PAGE_SIZE].into_boxed_slice(),
+            data: Box::new([0u8; PAGE_SIZE]),


Did it help? If it does not I'd rather avoid the stack allocation,

Both are heap allocations. It removes the bounds check in combination with the mask, since the size is known to the compiler

They both end up on the heap, but Rust will often allocate the array on the stack before moving it to the heap, c.f. rust-lang/rust#53827

I don't see that issue looking at the assembly, but vec![0, PAGE_SIZE] is more efficient since it calls alloc zeroed vs. alloc + memset

I"ll switch to vec![0u8; PAGE_SIZE].into_boxed_slice().try_into().unwrap()

codecov-commenter · 2023-05-29T12:04:05Z

Codecov Report

Merging #2058 (4ccfd12) into main (7ee78bd) will increase coverage by 0.01%.
The diff coverage is 98.08%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2058      +/-   ##
==========================================
+ Coverage   94.37%   94.39%   +0.01%     
==========================================
  Files         319      321       +2     
  Lines       59951    60190     +239     
==========================================
+ Hits        56581    56817     +236     
- Misses       3370     3373       +3

Impacted Files	Coverage Δ
stacker/src/fastcpy.rs	`94.00% <94.00%> (ø)`
stacker/src/fastcmp.rs	`99.09% <99.09%> (ø)`
src/indexer/segment_writer.rs	`97.78% <100.00%> (ø)`
stacker/src/arena_hashmap.rs	`97.50% <100.00%> (+1.25%)`	⬆️
stacker/src/expull.rs	`99.50% <100.00%> (+0.92%)`	⬆️
stacker/src/memory_arena.rs	`99.31% <100.00%> (+0.06%)`	⬆️

... and 2 files with indirect coverage changes

fulmicoton · 2023-05-31T04:03:54Z

stacker/src/arena_hashmap.rs

        loop {
-            let bucket = probe.next_probe();
-            let kv: KeyValue = self.table[bucket];
            if kv.is_empty() {


in both case the condition jump is here, and happens after fetching KeyValue

stacker/src/fastcpy.rs

fulmicoton · 2023-05-31T07:25:36Z

stacker/src/expull.rs

@@ -108,59 +84,68 @@ impl<'a> ExpUnrolledLinkedListWriter<'a> {

    #[inline]
    pub fn extend_from_slice(&mut self, mut buf: &[u8]) {
+        debug_assert!(buf.len() <= 32); // Check memcpy assumptions.
+        debug_assert!(!buf.is_empty()); // Check memcpy assumptions.


Suggested change

debug_assert!(!buf.is_empty()); // Check memcpy assumptions.

What is this for?

extend_from_slice is called only for byte slices of 1-18 (?), and the memcpy could be optimized for that, which I had tested.
I removed it, since I didn't want to add too many implicit assumptions.
The debug_asserts are outdated

Did you change something? I do not see any change yet.

I updated it now

I still see the debug_asserts

They are removed 82ce324

stacker/src/expull.rs

fulmicoton · 2023-05-31T07:30:45Z

stacker/src/memory_arena.rs

+    }
+
+    #[inline]
+    fn add_page(&mut self, len: usize) -> usize {


I don't understand the semantics of this method. Can you add some doc?
Why len, why usize?

Also at 1MB a pop, I don't think optimizing this matters.

It's inlined to the hot path though. I refactored it a little bit

I suspect inline is useless too.

Yes, I removed that in the refactoring. I didn't see any impact on performance though

stacker/src/memory_arena.rs

This contains many small changes to improve Termmap performance. Most notably: * Specialized byte compare and equality versions, instead of glibc calls. * ExpUnrolledLinkedList to not contain inline items. Allow compare hash only via a feature flag compare_hash_only: 64bits should be enough with a good hash function to compare strings by their hashes instead of comparing the strings. Disabled by default CreateHashMap/alice/174693 time: [642.23 µs 643.80 µs 645.24 µs] thrpt: [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s] change: time: [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05) thrpt: [+14.088% +15.344% +16.862%] Performance has improved. CreateHashMap/alice_expull/174693 time: [877.03 µs 880.44 µs 884.67 µs] thrpt: [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s] change: time: [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05) thrpt: [+35.301% +35.637% +35.981%] Performance has improved. CreateHashMap/numbers_zipf/8000000 time: [9.1198 ms 9.1573 ms 9.1961 ms] thrpt: [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s] change: time: [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05) thrpt: [+52.403% +53.440% +54.390%] Performance has improved.

PSeitz force-pushed the termmap_perf branch 3 times, most recently from 07b5fd5 to 241cc9b Compare May 26, 2023 10:51

adamreichold reviewed May 26, 2023

View reviewed changes

stacker/src/fastcpy.rs Show resolved Hide resolved