-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate using prefetching on slice's binary_search algorithm #37251
Comments
It seems as there's no support for the |
One can probably work around that using the |
Do you know how I can enable the I declared an external function like this:
and used it the the binary_search. But when building other crates the compiler bails out with:
|
I've never enabled/disabled features in the compiler crates (in my crates I just add it to the root of the crates that uses the feature). Maybe @alexcrichton does know or know somebody who knows. |
The error there looks unrelated to the |
Ok, I'll try to add the prefetch intrinsic, was it intentionally left out? |
Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out! |
Hmm, I'm stuck at adding the pretetch intrinsic. I added it in:
But I still get the error:
Is there some bootstrapping necessary? |
The bootstrap compiler won't recognize the intrinsic, so you need to guard the definition with A similar problem will show up when you try to use the intrinsic in. I guess the easiest way to solve that without code duplication will be to define a |
@rkruppe You mean |
Yes, I confused "which stage is the compiler" with "which stage are we compiling the source code for". |
Ok, I added the prefetch intrinsic and benchmarked a small stand-alone binsearch. https://gist.github.com/hirschenberger/dcf3fc6f2b7ddad08c889adfa8f5f1cf With this setup, I got the following results:
That's approx. 10%, don't know if it's worth it and if this microbenchmark is significant. I think the compiler is already doing good work. Here's the output of
Update: I did some more tests and got the runtime down to: The change prefetches the destination slice after every iteration. |
@hirschenberger that's a 20% perf improvement! |
Some numbers using
|
@hirschenberger Is the benchmark using prefetching correctly? It looks like it always prefetches |
The I also tested prefetching the I tested several prefetching pattern and this one got the best results. |
Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation? Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 |
Here is a Gist with the LLVM-IR and ASM of the prefetching binsearch.
It performs worse: 120us vs. 82us.
I don't understand what you mean. Isn't the slice split in half at every iteration, sometimes using the head, sometimes the tail for further iteration, which makes it unpredictable for the HW-prefetcher and is the reason why manual prefetching is beneficial? |
The main loop in the gist is: .LBB0_3:
subq %rdi, %rsi
je .LBB0_4 ; tail.is_empty()
leaq (%rax,%rdi,8), %rbx ; the address being accessed
; rax points to s[0], rdi is head.len() -> rbx points to tail[0]
cmpq $2222222, (%rbx) ; *** the memory access ***
movl $1, %ecx
cmovaq %r10, %rcx
movl $0, %edx
cmovneq %rcx, %rdx
cmpq $-1, %rdx
je .LBB0_8 ; Less, continue on tail
testq %rdx, %rdx
je .LBB0_11 ; Equal, exit loop
movq %rdi, %rsi ; s = head (implemented as s.len = head.len)
jmp .LBB0_9 ; Greater, continue on head
.p2align 4, 0x90
.LBB0_8:
leaq 1(%rdi,%r11), %r11
# tail[1..]
addq $8, %rbx ; add (the size of) one element to the start pointer
decq %rsi ; subtract 1 to the length
movq %rbx, %rax ; and move it to s instead of tail
.LBB0_9:
; rax is the start pointer of s, rsi its length
prefetcht0 (%rax) ; *** the prefetch ***
movq %rsi, %rdi
shrq %rdi
cmpq %rdi, %rsi
jae .LBB0_3 ; LLVM was unable to remove the bound checks
jmp .LBB0_10 ; panic on illegal access
.p2align 4, 0x90 (the Yes, in the general case the splitting is unpredictable and results in an unpredictable access pattern, in which 50% of the times the element at the middle of I will try benchmarking on randomized data as soon as the prefetch intrinsic hits nightly. I hope that will point out in which cases the current prefetch strategy is superior to the one in the stackoverflow post. |
See also my comment here |
This paper about the branch predictor and cache effects on binary searches is worth reading. https://arxiv.org/abs/1509.05053 |
Triage: not sure if this optimization ever landed or not, though it seems the pre-requisites did. |
I think this can be closed after #45333 landed. |
Closing as fixed. |
This SO answer shows a 20% speed-up on
binary_search
with prefetching enabled at the cost of:We should consider adding this optimization if it turns out to be worth it.
The text was updated successfully, but these errors were encountered: