-
-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Termmap Indexing Performance +~30% #2058
Conversation
07b5fd5
to
241cc9b
Compare
stacker/src/fastcmp.rs
Outdated
loop { | ||
if left.len() < SIZE || right.len() < SIZE { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop { | |
if left.len() < SIZE || right.len() < SIZE { | |
break; | |
} | |
assert_eq!(left.len(), right.len()); //< I assume this will go away after inlining? | |
while left.len() >= SIZE { //< I assume this one cannot go away after inlining? (but maybe LLVM is stronger than I think) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed with godbolt this does two checks with the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not remove the bounds checks.
It seems that the compiler can't deduce from same length + same increments on left and right to remove bounds checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I checked with godbolt and that's not true...
dst[0] = src[0]; | ||
if len >= 2 { | ||
double_copy_trick::<2>(src, dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand... Is it faster than?
if len == 1 {
dst[0] = src[0];
} else {
// len >= 2
double_copy_trick::<2>(src, dst);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both are equal performance wise, but I didn't test much on this case
@@ -284,31 +304,36 @@ impl ArenaHashMap { | |||
} | |||
let hash = self.get_hash(key); | |||
let mut probe = self.probe(hash); | |||
let mut bucket = probe.next_probe(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the goal of putting these outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fetch the next bucket before the loop jmp. The comment for it is at the end of the loop.
The idea behind it is to allow unconditionally prefetching data (without the branch predictor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by unconditionally: this is not a conditional jump here.
also...
I would not be surprised if for LLVM, the two are exactly the same...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that part with the branch predictor didn't make sense. It helps the instruction prefetcher a little bit I think.
The two version are indeed the same, but only with unsafe access. With bounds checks llvm can't optimize the code as well.
Cool stuff! |
stacker/src/memory_arena.rs
Outdated
} | ||
|
||
impl Page { | ||
fn new(page_id: usize) -> Page { | ||
Page { | ||
page_id, | ||
len: 0, | ||
data: vec![0u8; PAGE_SIZE].into_boxed_slice(), | ||
data: Box::new([0u8; PAGE_SIZE]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did it help? If it does not I'd rather avoid the stack allocation,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are heap allocations. It removes the bounds check in combination with the mask, since the size is known to the compiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They both end up on the heap, but Rust will often allocate the array on the stack before moving it to the heap, c.f. rust-lang/rust#53827
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that issue looking at the assembly, but vec![0, PAGE_SIZE]
is more efficient since it calls alloc zeroed vs. alloc + memset
I"ll switch to vec![0u8; PAGE_SIZE].into_boxed_slice().try_into().unwrap()
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #2058 +/- ##
==========================================
+ Coverage 94.37% 94.39% +0.01%
==========================================
Files 319 321 +2
Lines 59951 60190 +239
==========================================
+ Hits 56581 56817 +236
- Misses 3370 3373 +3
|
loop { | ||
let bucket = probe.next_probe(); | ||
let kv: KeyValue = self.table[bucket]; | ||
if kv.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both case the condition jump is here, and happens after fetching KeyValue
stacker/src/expull.rs
Outdated
@@ -108,59 +84,68 @@ impl<'a> ExpUnrolledLinkedListWriter<'a> { | |||
|
|||
#[inline] | |||
pub fn extend_from_slice(&mut self, mut buf: &[u8]) { | |||
debug_assert!(buf.len() <= 32); // Check memcpy assumptions. | |||
debug_assert!(!buf.is_empty()); // Check memcpy assumptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug_assert!(!buf.is_empty()); // Check memcpy assumptions. |
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extend_from_slice
is called only for byte slices of 1-18 (?), and the memcpy could be optimized for that, which I had tested.
I removed it, since I didn't want to add too many implicit assumptions.
The debug_asserts are outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you change something? I do not see any change yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the debug_asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are removed 82ce324
stacker/src/memory_arena.rs
Outdated
} | ||
|
||
#[inline] | ||
fn add_page(&mut self, len: usize) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the semantics of this method. Can you add some doc?
Why len, why usize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also at 1MB a pop, I don't think optimizing this matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's inlined to the hot path though. I refactored it a little bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect inline is useless too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I removed that in the refactoring. I didn't see any impact on performance though
This contains many small changes to improve Termmap performance. Most notably: * Specialized byte compare and equality versions, instead of glibc calls. * ExpUnrolledLinkedList to not contain inline items. Allow compare hash only via a feature flag compare_hash_only: 64bits should be enough with a good hash function to compare strings by their hashes instead of comparing the strings. Disabled by default CreateHashMap/alice/174693 time: [642.23 µs 643.80 µs 645.24 µs] thrpt: [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s] change: time: [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05) thrpt: [+14.088% +15.344% +16.862%] Performance has improved. CreateHashMap/alice_expull/174693 time: [877.03 µs 880.44 µs 884.67 µs] thrpt: [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s] change: time: [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05) thrpt: [+35.301% +35.637% +35.981%] Performance has improved. CreateHashMap/numbers_zipf/8000000 time: [9.1198 ms 9.1573 ms 9.1961 ms] thrpt: [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s] change: time: [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05) thrpt: [+52.403% +53.440% +54.390%] Performance has improved.
This contains many small changes to improve Termmap performance.
Most notably:
Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default
Benchmark without
compare_hash_only
Benchmark with
compare_hash_only