Evaluate using prefetching on slice's binary_search algorithm #37251

gnzlbg · 2016-10-18T10:33:17Z

This SO answer shows a 20% speed-up on binary_search with prefetching enabled at the cost of:

trashing at most 2 cache lines when an element is found (since two other elements will be prefetched),
increasing code-size with prefetching instructions.

We should consider adding this optimization if it turns out to be worth it.

The text was updated successfully, but these errors were encountered:

hirschenberger · 2017-04-12T14:59:26Z

It seems as there's no support for the llvm.prefetch intrinsic available in librustc_trans/intrinsic.rs. Is this intentional?

gnzlbg · 2017-04-12T15:08:22Z

One can probably work around that using the link_llvm_intrinsics feature.

hirschenberger · 2017-04-13T08:31:18Z

Do you know how I can enable the link_llvm_intrinsics feature when building the compiler, or do I have to enable it in every crate explicitly?

I declared an external function like this:

extern {
    #[link_name = "llvm.prefetch"]
    fn prefetch(address: *const i8, rw: i32, locality: i32, cache_type: i32);
}

and used it the the binary_search. But when building other crates the compiler bails out with:

Cannot invoke an intrinsic other than donothing, patchpoint or statepoint
  invoke void @llvm.prefetch(i8* %37, i32 0, i32 1, i32 1)
          to label %bb13 unwind label %cleanup

gnzlbg · 2017-04-13T08:56:40Z

Do you know how I can enable the link_llvm_intrinsics feature when building the compiler?

I've never enabled/disabled features in the compiler crates (in my crates I just add it to the root of the crates that uses the feature). Maybe @alexcrichton does know or know somebody who knows.

alexcrichton · 2017-04-13T14:10:55Z

The error there looks unrelated to the link_llvm_intrinsics feature being enabled/disabled, that specifically looks like a codegen error. I'm not particularly sure why that'd show up, though. This may mean that support in intrinsics.rs needs to be added to execute the intrinsic.

hirschenberger · 2017-04-13T14:13:37Z

Ok, I'll try to add the prefetch intrinsic, was it intentionally left out?

alexcrichton · 2017-04-13T14:14:49Z

Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out!

hirschenberger · 2017-04-18T08:10:55Z

Hmm, I'm stuck at adding the pretetch intrinsic. I added it in:

libcore/intrinsics.rs
librustc_trans/context.rs
librustc_trans/intrinsic.rs
librustc_typeck/check/intrinsic.rs

But I still get the error:

error[E0093]: unrecognized intrinsic function: `prefetch`
   --> src/libcore/intrinsics.rs:569:5
    |
569 |     pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unrecognized intrinsic

Is there some bootstrapping necessary?

hanna-kruppe · 2017-04-18T09:01:00Z

The bootstrap compiler won't recognize the intrinsic, so you need to guard the definition with #[cfg(not(stage1))].

A similar problem will show up when you try to use the intrinsic in. I guess the easiest way to solve that without code duplication will be to define a #[cfg(stage1)] stub that does nothing.

hirschenberger · 2017-04-18T09:22:28Z

@rkruppe You mean stage0, or?

hanna-kruppe · 2017-04-18T09:25:36Z

Yes, I confused "which stage is the compiler" with "which stage are we compiling the source code for".

hirschenberger · 2017-04-18T13:49:08Z

Ok, I added the prefetch intrinsic and benchmarked a small stand-alone binsearch.

https://gist.github.com/hirschenberger/dcf3fc6f2b7ddad08c889adfa8f5f1cf

With this setup, I got the following results:

Valilla: test bench_binsearch ... bench: 105 ns/iter (+/- 0)
With prefetch: test bench_binsearch ... bench: 95 ns/iter (+/- 0)

That's approx. 10%, don't know if it's worth it and if this microbenchmark is significant. I think the compiler is already doing good work. Here's the output of perf for both versions:

Vanilla:

           320.577 L1-dcache-load-misses     #    0,46% of all L1-dcache hits   [66,26%]
        69.887.806 L1-dcache-loads                                              [67,23%]
            35.613 L1-dcache-prefetches                                         [66,60%]

With prefetching:

           190.783 L1-dcache-load-misses     #    0,14% of all L1-dcache hits   [67,13%]
       133.433.668 L1-dcache-loads                                              [67,00%]
            51.848 L1-dcache-prefetches                                         [65,94%]

Update:

I did some more tests and got the runtime down to: test bench_binsearch ... bench: 81 ns/iter (+/- 0)

The change prefetches the destination slice after every iteration.

gnzlbg · 2017-04-18T15:22:22Z

@hirschenberger that's a 20% perf improvement!

hirschenberger · 2017-04-18T15:39:14Z

Some numbers using slice::binary_search:

Current nightly: test bench_binsearch ... bench: 149 ns/iter (+/- 1)
With prefetching: test bench_binsearch ... bench: 97 ns/iter (+/- 8)

ranma42 · 2017-04-26T05:18:02Z

@hirschenberger Is the benchmark using prefetching correctly?

It looks like it always prefetches s (which would be the first element of the range being considered) instead of something like s + s.len() >> 1.
Also, I find it somewhat surprising that there is only one prefetch invocation: for binary search you usually prefetch both possible choices (see the SO answer from the OP).

hirschenberger · 2017-04-26T07:19:06Z

The s slice is prefetched for the split_at call in the next iteration. Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.

I also tested prefetching the head and tail splits as in the SO answer, but got no performance benefit. I think by using split_at, the two slices are already in the cache.

I tested several prefetching pattern and this one got the best results.

ranma42 · 2017-04-26T14:23:59Z

The s slice is prefetched for the split_at call in the next iteration.

split_at does not access the contents of the s slice, so it should not be affected by the prefetching. The only place that should perform data accesses within the slice is tail[0].
Could you post the LLVM IR (and/or assembly output) in a gist?

Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.

Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?

Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).

hirschenberger · 2017-04-26T15:11:16Z

Here is a Gist with the LLVM-IR and ASM of the prefetching binsearch.

Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?

It performs worse: 120us vs. 82us.

Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).

I don't understand what you mean. Isn't the slice split in half at every iteration, sometimes using the head, sometimes the tail for further iteration, which makes it unpredictable for the HW-prefetcher and is the reason why manual prefetching is beneficial?

ranma42 · 2017-04-26T16:28:46Z

The main loop in the gist is:

.LBB0_3:
	subq	%rdi, %rsi
	je	.LBB0_4      ; tail.is_empty()
	leaq	(%rax,%rdi,8), %rbx   ; the address being accessed
; rax points to s[0], rdi is head.len() -> rbx points to tail[0]
	cmpq	$2222222, (%rbx)      ; *** the memory access ***
	movl	$1, %ecx
	cmovaq	%r10, %rcx
	movl	$0, %edx
	cmovneq	%rcx, %rdx
	cmpq	$-1, %rdx
	je	.LBB0_8      ; Less, continue on tail
	testq	%rdx, %rdx
	je	.LBB0_11     ; Equal, exit loop
	movq	%rdi, %rsi ; s = head (implemented as s.len = head.len)
	jmp	.LBB0_9      ; Greater, continue on head
	.p2align	4, 0x90
.LBB0_8:
	leaq	1(%rdi,%r11), %r11
# tail[1..]
	addq	$8, %rbx ; add (the size of) one element to the start pointer
	decq	%rsi     ; subtract 1 to the length
	movq	%rbx, %rax ; and move it to s instead of tail
.LBB0_9:
; rax is the start pointer of s, rsi its length
	prefetcht0	(%rax)        ; *** the prefetch ***
	movq	%rsi, %rdi
	shrq	%rdi
	cmpq	%rdi, %rsi
	jae	.LBB0_3      ; LLVM was unable to remove the bound checks
	jmp	.LBB0_10     ; panic on illegal access
	.p2align	4, 0x90

(the # comments are mine).
As expected the only memory accesses it performs are those from cmpq $2222222, (%rbx), but they are not at the address %rax. Instead, they are at %rax,%rdi,8 i.e. at %rax + %rdi * 8.

Yes, in the general case the splitting is unpredictable and results in an unpredictable access pattern, in which 50% of the times the element at the middle of head is accessed, while the other 50% of the times the element at the middle of tail is accessed. It is not clear why prefetching the element at the beginning of the appropriate sub-slice would be beneficial (although one could say that for the last iterations it does not really matter, as they will be on the same cache line).

I will try benchmarking on randomized data as soon as the prefetch intrinsic hits nightly. I hope that will point out in which cases the current prefetch strategy is superior to the one in the stackoverflow post.

Related to rust-lang#37251

nagisa · 2017-06-01T17:21:47Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using prefetching on slice's binary_search algorithm #37251

Evaluate using prefetching on slice's binary_search algorithm #37251

gnzlbg commented Oct 18, 2016

hirschenberger commented Apr 12, 2017

gnzlbg commented Apr 12, 2017

hirschenberger commented Apr 13, 2017

gnzlbg commented Apr 13, 2017

alexcrichton commented Apr 13, 2017

hirschenberger commented Apr 13, 2017

alexcrichton commented Apr 13, 2017

hirschenberger commented Apr 18, 2017

hanna-kruppe commented Apr 18, 2017

hirschenberger commented Apr 18, 2017

hanna-kruppe commented Apr 18, 2017

hirschenberger commented Apr 18, 2017 •

edited

Loading

gnzlbg commented Apr 18, 2017

hirschenberger commented Apr 18, 2017

ranma42 commented Apr 26, 2017

hirschenberger commented Apr 26, 2017

ranma42 commented Apr 26, 2017

hirschenberger commented Apr 26, 2017

ranma42 commented Apr 26, 2017

nagisa commented Jun 1, 2017

khuey commented Nov 23, 2017

steveklabnik commented May 20, 2019

hirschenberger commented Sep 13, 2019

jonas-schievink commented Feb 2, 2020

Evaluate using prefetching on slice's binary_search algorithm #37251

Evaluate using prefetching on slice's binary_search algorithm #37251

Comments

gnzlbg commented Oct 18, 2016

hirschenberger commented Apr 12, 2017

gnzlbg commented Apr 12, 2017

hirschenberger commented Apr 13, 2017

gnzlbg commented Apr 13, 2017

alexcrichton commented Apr 13, 2017

hirschenberger commented Apr 13, 2017

alexcrichton commented Apr 13, 2017

hirschenberger commented Apr 18, 2017

hanna-kruppe commented Apr 18, 2017

hirschenberger commented Apr 18, 2017

hanna-kruppe commented Apr 18, 2017

hirschenberger commented Apr 18, 2017 • edited Loading

gnzlbg commented Apr 18, 2017

hirschenberger commented Apr 18, 2017

ranma42 commented Apr 26, 2017

hirschenberger commented Apr 26, 2017

ranma42 commented Apr 26, 2017

hirschenberger commented Apr 26, 2017

ranma42 commented Apr 26, 2017

nagisa commented Jun 1, 2017

khuey commented Nov 23, 2017

steveklabnik commented May 20, 2019

hirschenberger commented Sep 13, 2019

jonas-schievink commented Feb 2, 2020

hirschenberger commented Apr 18, 2017 •

edited

Loading