Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate using prefetching on slice's binary_search algorithm #37251

Closed
gnzlbg opened this issue Oct 18, 2016 · 24 comments
Closed

Evaluate using prefetching on slice's binary_search algorithm #37251

gnzlbg opened this issue Oct 18, 2016 · 24 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 18, 2016

This SO answer shows a 20% speed-up on binary_search with prefetching enabled at the cost of:

  • trashing at most 2 cache lines when an element is found (since two other elements will be prefetched),
  • increasing code-size with prefetching instructions.

We should consider adding this optimization if it turns out to be worth it.

@steveklabnik steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017
@hirschenberger
Copy link
Contributor

It seems as there's no support for the llvm.prefetch intrinsic available in librustc_trans/intrinsic.rs. Is this intentional?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 12, 2017

One can probably work around that using the link_llvm_intrinsics feature.

@hirschenberger
Copy link
Contributor

Do you know how I can enable the link_llvm_intrinsics feature when building the compiler, or do I have to enable it in every crate explicitly?

I declared an external function like this:

extern {
    #[link_name = "llvm.prefetch"]
    fn prefetch(address: *const i8, rw: i32, locality: i32, cache_type: i32);
}

and used it the the binary_search. But when building other crates the compiler bails out with:

Cannot invoke an intrinsic other than donothing, patchpoint or statepoint
  invoke void @llvm.prefetch(i8* %37, i32 0, i32 1, i32 1)
          to label %bb13 unwind label %cleanup

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 13, 2017

Do you know how I can enable the link_llvm_intrinsics feature when building the compiler?

I've never enabled/disabled features in the compiler crates (in my crates I just add it to the root of the crates that uses the feature). Maybe @alexcrichton does know or know somebody who knows.

@alexcrichton
Copy link
Member

The error there looks unrelated to the link_llvm_intrinsics feature being enabled/disabled, that specifically looks like a codegen error. I'm not particularly sure why that'd show up, though. This may mean that support in intrinsics.rs needs to be added to execute the intrinsic.

@hirschenberger
Copy link
Contributor

Ok, I'll try to add the prefetch intrinsic, was it intentionally left out?

@alexcrichton
Copy link
Member

Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out!

@hirschenberger
Copy link
Contributor

Hmm, I'm stuck at adding the pretetch intrinsic. I added it in:

  • libcore/intrinsics.rs
  • librustc_trans/context.rs
  • librustc_trans/intrinsic.rs
  • librustc_typeck/check/intrinsic.rs

But I still get the error:

error[E0093]: unrecognized intrinsic function: `prefetch`
   --> src/libcore/intrinsics.rs:569:5
    |
569 |     pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unrecognized intrinsic

Is there some bootstrapping necessary?

@hanna-kruppe
Copy link
Contributor

The bootstrap compiler won't recognize the intrinsic, so you need to guard the definition with #[cfg(not(stage1))].

A similar problem will show up when you try to use the intrinsic in. I guess the easiest way to solve that without code duplication will be to define a #[cfg(stage1)] stub that does nothing.

@hirschenberger
Copy link
Contributor

@rkruppe You mean stage0, or?

@hanna-kruppe
Copy link
Contributor

Yes, I confused "which stage is the compiler" with "which stage are we compiling the source code for".

@hirschenberger
Copy link
Contributor

hirschenberger commented Apr 18, 2017

Ok, I added the prefetch intrinsic and benchmarked a small stand-alone binsearch.

https://gist.github.com/hirschenberger/dcf3fc6f2b7ddad08c889adfa8f5f1cf

With this setup, I got the following results:

  • Valilla: test bench_binsearch ... bench: 105 ns/iter (+/- 0)
  • With prefetch: test bench_binsearch ... bench: 95 ns/iter (+/- 0)

That's approx. 10%, don't know if it's worth it and if this microbenchmark is significant. I think the compiler is already doing good work. Here's the output of perf for both versions:

  • Vanilla:
           320.577 L1-dcache-load-misses     #    0,46% of all L1-dcache hits   [66,26%]
        69.887.806 L1-dcache-loads                                              [67,23%]
            35.613 L1-dcache-prefetches                                         [66,60%]
  • With prefetching:
           190.783 L1-dcache-load-misses     #    0,14% of all L1-dcache hits   [67,13%]
       133.433.668 L1-dcache-loads                                              [67,00%]
            51.848 L1-dcache-prefetches                                         [65,94%]

Update:

I did some more tests and got the runtime down to: test bench_binsearch ... bench: 81 ns/iter (+/- 0)

The change prefetches the destination slice after every iteration.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 18, 2017

@hirschenberger that's a 20% perf improvement!

@hirschenberger
Copy link
Contributor

Some numbers using slice::binary_search:

  • Current nightly: test bench_binsearch ... bench: 149 ns/iter (+/- 1)
  • With prefetching: test bench_binsearch ... bench: 97 ns/iter (+/- 8)

@ranma42
Copy link
Contributor

ranma42 commented Apr 26, 2017

@hirschenberger Is the benchmark using prefetching correctly?

It looks like it always prefetches s (which would be the first element of the range being considered) instead of something like s + s.len() >> 1.
Also, I find it somewhat surprising that there is only one prefetch invocation: for binary search you usually prefetch both possible choices (see the SO answer from the OP).

@hirschenberger
Copy link
Contributor

The s slice is prefetched for the split_at call in the next iteration. Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.

I also tested prefetching the head and tail splits as in the SO answer, but got no performance benefit. I think by using split_at, the two slices are already in the cache.

I tested several prefetching pattern and this one got the best results.

@ranma42
Copy link
Contributor

ranma42 commented Apr 26, 2017

The s slice is prefetched for the split_at call in the next iteration.

split_at does not access the contents of the s slice, so it should not be affected by the prefetching. The only place that should perform data accesses within the slice is tail[0].
Could you post the LLVM IR (and/or assembly output) in a gist?

Prefetching s.as_ptr().offset((s.len() >> 1) gives a slightly worse performance.

Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?

Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).

@hirschenberger
Copy link
Contributor

Here is a Gist with the LLVM-IR and ASM of the prefetching binsearch.

Of course, because that's not actually prefetching anything in advance (the data is used almost immediately after the prefetch). In the SO answer the prefetch was performed upon entering the binary search loop on both possible outcomes. How does that fare against the current implementation?

It performs worse: 120us vs. 82us.

Also, notice that since the benchmark is performing a binary search on the same value at each iteration, the sequence of data accesses will never change, therefore they are likely to be in cache after the first iteration: the vector is 1e7 usize elements, but normal binary search will only touch ~24 of them, which should easily fit even in a small L1 cache (prefetching would double this number).

I don't understand what you mean. Isn't the slice split in half at every iteration, sometimes using the head, sometimes the tail for further iteration, which makes it unpredictable for the HW-prefetcher and is the reason why manual prefetching is beneficial?

@ranma42
Copy link
Contributor

ranma42 commented Apr 26, 2017

The main loop in the gist is:

.LBB0_3:
	subq	%rdi, %rsi
	je	.LBB0_4      ; tail.is_empty()
	leaq	(%rax,%rdi,8), %rbx   ; the address being accessed
; rax points to s[0], rdi is head.len() -> rbx points to tail[0]
	cmpq	$2222222, (%rbx)      ; *** the memory access ***
	movl	$1, %ecx
	cmovaq	%r10, %rcx
	movl	$0, %edx
	cmovneq	%rcx, %rdx
	cmpq	$-1, %rdx
	je	.LBB0_8      ; Less, continue on tail
	testq	%rdx, %rdx
	je	.LBB0_11     ; Equal, exit loop
	movq	%rdi, %rsi ; s = head (implemented as s.len = head.len)
	jmp	.LBB0_9      ; Greater, continue on head
	.p2align	4, 0x90
.LBB0_8:
	leaq	1(%rdi,%r11), %r11
# tail[1..]
	addq	$8, %rbx ; add (the size of) one element to the start pointer
	decq	%rsi     ; subtract 1 to the length
	movq	%rbx, %rax ; and move it to s instead of tail
.LBB0_9:
; rax is the start pointer of s, rsi its length
	prefetcht0	(%rax)        ; *** the prefetch ***
	movq	%rsi, %rdi
	shrq	%rdi
	cmpq	%rdi, %rsi
	jae	.LBB0_3      ; LLVM was unable to remove the bound checks
	jmp	.LBB0_10     ; panic on illegal access
	.p2align	4, 0x90

(the # comments are mine).
As expected the only memory accesses it performs are those from cmpq $2222222, (%rbx), but they are not at the address %rax. Instead, they are at %rax,%rdi,8 i.e. at %rax + %rdi * 8.

Yes, in the general case the splitting is unpredictable and results in an unpredictable access pattern, in which 50% of the times the element at the middle of head is accessed, while the other 50% of the times the element at the middle of tail is accessed. It is not clear why prefetching the element at the beginning of the appropriate sub-slice would be beneficial (although one could say that for the last iterations it does not really matter, as they will be on the same cache line).

I will try benchmarking on randomized data as soon as the prefetch intrinsic hits nightly. I hope that will point out in which cases the current prefetch strategy is superior to the one in the stackoverflow post.

hirschenberger added a commit to hirschenberger/rust that referenced this issue Jun 1, 2017
@nagisa
Copy link
Member

nagisa commented Jun 1, 2017

See also my comment here

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017
@khuey
Copy link
Contributor

khuey commented Nov 23, 2017

This paper about the branch predictor and cache effects on binary searches is worth reading. https://arxiv.org/abs/1509.05053

@steveklabnik
Copy link
Member

Triage: not sure if this optimization ever landed or not, though it seems the pre-requisites did.

@hirschenberger
Copy link
Contributor

I think this can be closed after #45333 landed.

@jonas-schievink
Copy link
Contributor

Closing as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests