Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Termmap Indexing Performance +~30% #2058

Merged
merged 6 commits into from
Jun 8, 2023
Merged

Improve Termmap Indexing Performance +~30% #2058

merged 6 commits into from
Jun 8, 2023

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented May 26, 2023

This contains many small changes to improve Termmap performance.
Most notably:

  • Specialized byte compare and equality versions, instead of glibc calls.
  • ExpUnrolledLinkedList to not contain inline items.

Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default

Benchmark without compare_hash_only

CreateHashMap/alice/174693
                        time:   [642.23 µs 643.80 µs 645.24 µs]
                        thrpt:  [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s]
                 change:
                        time:   [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05)
                        thrpt:  [+14.088% +15.344% +16.862%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [877.03 µs 880.44 µs 884.67 µs]
                        thrpt:  [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s]
                 change:
                        time:   [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05)
                        thrpt:  [+35.301% +35.637% +35.981%]
                        Performance has improved.
CreateHashMap/numbers_zipf/8000000
                        time:   [9.1198 ms 9.1573 ms 9.1961 ms]
                        thrpt:  [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s]
                 change:
                        time:   [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05)
                        thrpt:  [+52.403% +53.440% +54.390%]
                        Performance has improved.

Benchmark with compare_hash_only

CreateHashMap/alice/174693
                        time:   [503.08 µs 505.07 µs 507.22 µs]
                        thrpt:  [328.46 MiB/s 329.86 MiB/s 331.16 MiB/s]
                 change:
                        time:   [-37.334% -36.881% -36.399%] (p = 0.00 < 0.05)
                        thrpt:  [+57.231% +58.432% +59.576%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [863.67 µs 873.21 µs 883.15 µs]
                        thrpt:  [188.64 MiB/s 190.79 MiB/s 192.90 MiB/s]
                 change:
                        time:   [-35.224% -34.352% -33.461%] (p = 0.00 < 0.05)
                        thrpt:  [+50.288% +52.328% +54.379%]
                        Performance has improved.

@PSeitz PSeitz force-pushed the termmap_perf branch 3 times, most recently from 07b5fd5 to 241cc9b Compare May 26, 2023 10:51
Comment on lines 49 to 52
loop {
if left.len() < SIZE || right.len() < SIZE {
break;
}
Copy link
Collaborator

@fulmicoton fulmicoton May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
loop {
if left.len() < SIZE || right.len() < SIZE {
break;
}
assert_eq!(left.len(), right.len()); //< I assume this will go away after inlining?
while left.len() >= SIZE { //< I assume this one cannot go away after inlining? (but maybe LLVM is stronger than I think)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed with godbolt this does two checks with the current code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not remove the bounds checks.
It seems that the compiler can't deduce from same length + same increments on left and right to remove bounds checks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I checked with godbolt and that's not true...

stacker/src/fastcmp.rs Outdated Show resolved Hide resolved
stacker/src/fastcmp.rs Outdated Show resolved Hide resolved
Comment on lines +70 to +74
dst[0] = src[0];
if len >= 2 {
double_copy_trick::<2>(src, dst);
Copy link
Collaborator

@fulmicoton fulmicoton May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand... Is it faster than?

if len == 1 {
        dst[0] = src[0];
} else {
        // len >= 2
        double_copy_trick::<2>(src, dst);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both are equal performance wise, but I didn't test much on this case

@@ -284,31 +304,36 @@ impl ArenaHashMap {
}
let hash = self.get_hash(key);
let mut probe = self.probe(hash);
let mut bucket = probe.next_probe();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the goal of putting these outside of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fetch the next bucket before the loop jmp. The comment for it is at the end of the loop.
The idea behind it is to allow unconditionally prefetching data (without the branch predictor)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by unconditionally: this is not a conditional jump here.
also...
I would not be surprised if for LLVM, the two are exactly the same...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that part with the branch predictor didn't make sense. It helps the instruction prefetcher a little bit I think.
The two version are indeed the same, but only with unsafe access. With bounds checks llvm can't optimize the code as well.

https://godbolt.org/z/x7xc8sTcY

@fulmicoton
Copy link
Collaborator

Cool stuff!

}

impl Page {
fn new(page_id: usize) -> Page {
Page {
page_id,
len: 0,
data: vec![0u8; PAGE_SIZE].into_boxed_slice(),
data: Box::new([0u8; PAGE_SIZE]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it help? If it does not I'd rather avoid the stack allocation,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are heap allocations. It removes the bounds check in combination with the mask, since the size is known to the compiler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both end up on the heap, but Rust will often allocate the array on the stack before moving it to the heap, c.f. rust-lang/rust#53827

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that issue looking at the assembly, but vec![0, PAGE_SIZE] is more efficient since it calls alloc zeroed vs. alloc + memset

I"ll switch to vec![0u8; PAGE_SIZE].into_boxed_slice().try_into().unwrap()

@codecov-commenter
Copy link

codecov-commenter commented May 29, 2023

Codecov Report

Merging #2058 (4ccfd12) into main (7ee78bd) will increase coverage by 0.01%.
The diff coverage is 98.08%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2058      +/-   ##
==========================================
+ Coverage   94.37%   94.39%   +0.01%     
==========================================
  Files         319      321       +2     
  Lines       59951    60190     +239     
==========================================
+ Hits        56581    56817     +236     
- Misses       3370     3373       +3     
Impacted Files Coverage Δ
stacker/src/fastcpy.rs 94.00% <94.00%> (ø)
stacker/src/fastcmp.rs 99.09% <99.09%> (ø)
src/indexer/segment_writer.rs 97.78% <100.00%> (ø)
stacker/src/arena_hashmap.rs 97.50% <100.00%> (+1.25%) ⬆️
stacker/src/expull.rs 99.50% <100.00%> (+0.92%) ⬆️
stacker/src/memory_arena.rs 99.31% <100.00%> (+0.06%) ⬆️

... and 2 files with indirect coverage changes

loop {
let bucket = probe.next_probe();
let kv: KeyValue = self.table[bucket];
if kv.is_empty() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in both case the condition jump is here, and happens after fetching KeyValue

@@ -108,59 +84,68 @@ impl<'a> ExpUnrolledLinkedListWriter<'a> {

#[inline]
pub fn extend_from_slice(&mut self, mut buf: &[u8]) {
debug_assert!(buf.len() <= 32); // Check memcpy assumptions.
debug_assert!(!buf.is_empty()); // Check memcpy assumptions.
Copy link
Collaborator

@fulmicoton fulmicoton May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
debug_assert!(!buf.is_empty()); // Check memcpy assumptions.

What is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extend_from_slice is called only for byte slices of 1-18 (?), and the memcpy could be optimized for that, which I had tested.
I removed it, since I didn't want to add too many implicit assumptions.
The debug_asserts are outdated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you change something? I do not see any change yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it now

Copy link
Collaborator

@fulmicoton fulmicoton Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the debug_asserts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are removed 82ce324

stacker/src/expull.rs Outdated Show resolved Hide resolved
}

#[inline]
fn add_page(&mut self, len: usize) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the semantics of this method. Can you add some doc?
Why len, why usize?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also at 1MB a pop, I don't think optimizing this matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's inlined to the hot path though. I refactored it a little bit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect inline is useless too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed that in the refactoring. I didn't see any impact on performance though

PSeitz added 6 commits June 1, 2023 14:32
This contains many small changes to improve Termmap performance.
Most notably:
* Specialized byte compare and equality versions, instead of glibc calls.
* ExpUnrolledLinkedList to not contain inline items.

Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default

CreateHashMap/alice/174693
                        time:   [642.23 µs 643.80 µs 645.24 µs]
                        thrpt:  [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s]
                 change:
                        time:   [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05)
                        thrpt:  [+14.088% +15.344% +16.862%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [877.03 µs 880.44 µs 884.67 µs]
                        thrpt:  [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s]
                 change:
                        time:   [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05)
                        thrpt:  [+35.301% +35.637% +35.981%]
                        Performance has improved.
CreateHashMap/numbers_zipf/8000000
                        time:   [9.1198 ms 9.1573 ms 9.1961 ms]
                        thrpt:  [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s]
                 change:
                        time:   [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05)
                        thrpt:  [+52.403% +53.440% +54.390%]
                        Performance has improved.
@PSeitz PSeitz merged commit 27f2020 into main Jun 8, 2023
@PSeitz PSeitz deleted the termmap_perf branch June 8, 2023 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants