Adding support for the llvm `prefetch` intrinsic #41418

hirschenberger · 2017-04-20T13:16:33Z

Optimize slice::binary_search by using prefetching.

rust-highfive · 2017-04-20T13:16:39Z

(rust_highfive has picked a reviewer for you, use r? to override)

bluetech · 2017-04-20T14:03:32Z

Just a note: the SO answer from the issue and your perf output also show:

Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

IIUC, this means:

More power usage?
More cache misses elsewhere? (due to 2x cache evictions from the binary search)

I think you should at least split this into 2 commits: one which adds the intrinsic, and one which does the change to binary search. Then the binary search change can be considered separately.

hirschenberger · 2017-04-20T14:16:32Z

You're right, splitting it up makes sense as adding support for the intrinsic is not controversial.
But perhaps some more opinions and knowledge is required, if microoptimizations with prefetching may harm other codepaths and we want to optimize the stdlib with them at all.

I think it may use more power but for a shorter period of time, so it equals out.

hirschenberger · 2017-04-21T07:03:51Z

I've removed the binary_search usage of the prefetch intrinsic.

nagisa · 2017-04-24T21:18:45Z

src/libcore/intrinsics.rs

 }

+    /// Empty bootstrap implementation for stage0 compilation
+    #[cfg(stage0)]
+    pub unsafe fn prefetch<T>(_data: *const T, _rw: i32, _locality: i32, _cache: i32) {


Now that there’s no use for binary_search anymore, this can be removed I think?

I thought of adding the binary_search optimization after this patch landed. I'm sure there are other parts of the compiler which can benefit from prefetching.

nagisa · 2017-04-24T21:21:15Z

src/libcore/intrinsics.rs

@@ -565,8 +565,17 @@ extern "rust-intrinsic" {
    pub fn atomic_umax_rel<T>(dst: *mut T, src: T) -> T;
    pub fn atomic_umax_acqrel<T>(dst: *mut T, src: T) -> T;
    pub fn atomic_umax_relaxed<T>(dst: *mut T, src: T) -> T;
+
+    #[cfg(not(stage0))]
+    pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32);


This is a terrible API, even for an intrinsic. What’s the rw? What’s the locality? What’s the cache? What are the meaning of values these arguments take?

Reading the LLVM’s docs it seems like rw could at least be a bool (I suppose LLVM uses i32 simply to allow themselves more expansion – not a problem for us because intriniscs are forever unstable). Similarly for cache, although it is a bit harder to draw the line.

Yes, you're right. I thought intrinsics APIs should directly map to the underlying llvm API?
I can add an Enum for cache and locality and make the rw-flag a bool.

No, we do not expect intrinsics to map directly to the underlying LLVM API at all. In fact, doing that makes little sense given our strong desire to support multiple backends (Cranelift, maybe?).

If this wasn’t prefetch, I would have suggested adding a regular wrapper with a sane API, but I think this intrinsic might want to avoid having wrapper functions (just like transmute).

nagisa · 2017-04-24T21:24:55Z

I believe our position regarding intrinsics is that we avoid adding these unless there’s an immediate use-case for them. binary_search is one such case and has been since removed from the PR, making this intrinsic have no immediate use case again.

I suppose that in order to get this intrinsic into the compiler, some (real-world?) benchmark numbers will be required, even if for the most trivial use-case. If the goal is to eventually expose a stable interface to the intrinsic, I think it should be a part of this PR as well.

hirschenberger · 2017-04-25T07:12:19Z

In #37251 @alexcrichton replied to my question:

@hirschenberger
Ok, I'll try to add the prefetch intrinsic, was it intentionally left out?
@alexcrichton
Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out!

aturon · 2017-04-25T23:41:05Z

cc @rust-lang/libs, this is a perf improvement backed by some pretty solid numbers. Anyone have hesitations about using these intrinsics?

sfackler · 2017-04-26T00:04:02Z

SGTM

alexcrichton · 2017-04-26T03:30:04Z

@aturon but the usage of prefetches in binary searches was backed out of this PR?

aturon · 2017-04-26T03:33:54Z

Oh, whoops, was going on the PR description and linked issue...

hirschenberger · 2017-04-26T06:23:17Z

Should I add the prefetch use in binary_search again?

aturon · 2017-04-27T03:26:23Z

@hirschenberger I'm personally fine bundling them. (I think the original comment was just about splitting commits, not the PR).

hirschenberger · 2017-04-27T10:05:53Z

I added the binary_search optimization as separate commit.

ranma42 · 2017-04-27T11:45:33Z

Given that apparently the intrinsic will come along with its usage, I did some additional benchmarking as per #37251 (comment).

The benchmark I used is based on the original one from #37251 (comment), but it tests some different combinations:

noprefetch is doing no prefetching at all
prefetchhead is prefetching s.as_ptr() (as the original benchmark)
prefetchhead is prefetching s.as_ptr().wrapping_offset(s.len() as isize >> 1) (i.e. the pointer which will be &tail[0] during the next iteration)

It also tests two different search patterns:

fixed always searches for 22222 (as the original benchmark)
rand uses a naive LCG to randomize the key being searched

My results are:

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          53 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          54 ns/iter (+/- 3)
test bench_binsearch_fixed_prefetchmid  ... bench:          46 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          56 ns/iter (+/- 4)
test bench_binsearch_rand_prefetchhead  ... bench:          62 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchmid   ... bench:          61 ns/iter (+/- 4)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          53 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          46 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          46 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          62 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          63 ns/iter (+/- 13)
test bench_binsearch_rand_prefetchmid   ... bench:          61 ns/iter (+/- 4)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          53 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          54 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          54 ns/iter (+/- 6)
test bench_binsearch_rand_noprefetch    ... bench:          59 ns/iter (+/- 3)
test bench_binsearch_rand_prefetchhead  ... bench:          60 ns/iter (+/- 26)
test bench_binsearch_rand_prefetchmid   ... bench:          61 ns/iter (+/- 3)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

I think some more investigation is needed before choosing the prefetching policy.
A more reliable benchmark would be needed to evaluate the impact of the change, because this one seems to be affected by external factors (I believe the instability might be caused by the cache alignment of the vector).

hirschenberger · 2017-04-27T12:37:22Z

I tested your benchmark on my machine and got the following results:

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          87 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          81 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          83 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          92 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          93 ns/iter (+/- 3)
test bench_binsearch_rand_prefetchmid   ... bench:          98 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          86 ns/iter (+/- 8)
test bench_binsearch_fixed_prefetchhead ... bench:          82 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          83 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          96 ns/iter (+/- 14)
test bench_binsearch_rand_prefetchhead  ... bench:          96 ns/iter (+/- 9)
test bench_binsearch_rand_prefetchmid   ... bench:          98 ns/iter (+/- 9)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          87 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          81 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          83 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          96 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          93 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchmid   ... bench:          95 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

But when I remove the branching for the prefetch pattern, results drop significantly for the non-prefetching code.

test bench_binsearch_fixed_prefetchhead ... bench:          81 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          93 ns/iter (+/- 0)

test bench_binsearch_fixed_prefetchmid  ... bench:          84 ns/iter (+/- 1)
test bench_binsearch_rand_prefetchmid   ... bench:          95 ns/iter (+/- 3)

test bench_binsearch_fixed_noprefetch   ... bench:         105 ns/iter (+/- 2)
test bench_binsearch_rand_noprefetch    ... bench:         101 ns/iter (+/- 0)

Disclaimer: My CPU is not the newest generation Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
I also had strange performance results on other benchmarks. I'll run your bench on a more recent CPU asap.

ranma42 · 2017-04-27T13:25:49Z

The branching for the selection of prefetching strategy should not affect the benchmarks, as it is completely inlined by LLVM (you can easily check it with --emit=asm or by disassembling the output binary).

hirschenberger · 2017-04-27T15:24:51Z

Ok, on my Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz notebook, there's really no clear performance pattern recognizable. I should avoid doing benchmarks in legacy hardware.

My proposal: I change the PR back to skip the binary_search prefetching and we land the prefetching intrinsic support? I'm sure people will find optimizations afterwards?

ranma42 · 2017-04-27T16:00:31Z

I think that might be a good idea. Unfortunately for binary_search there are several factors that need to be kept into account in order to make the prefetching effective, such as the memory behaviour of the comparison function and the optimal prefetch distance.
It is probably easier to get "stable" performance improvements by doing prefetching on code that is not parametric on the type or the internal operations being performed.
For example I would expect the prefetch intrinsic might improve the performance of functions that manage big triangular matrices (on intel, hardware prefetching might be effective for rectangular matrices, but it is probably unable to handle triangular matrices).

aturon · 2017-04-27T18:13:15Z

@hirschenberger Sure, that seems like the easiest way to make quick progress here.

hirschenberger · 2017-04-28T06:28:20Z

Ok, I removed the binary_search part, ready to go...

aturon · 2017-04-29T03:02:02Z

OK, I'm taking this out of the libs team purview, but it still needs a compiler team member to sign off, and @nagisa still has outstanding issues on the API itself.

r? @nagisa

carols10cents · 2017-05-15T14:49:19Z

I'm not obsessive to get this PR landed. There are pros and cons to support the intrinsic. But we can do it like llvm: support software prefetching and assume people using it, know it's implications and warts of different platforms.
Or we leave it out and miss some optimization possibilities.
What I would not do is, wrapping the function in a platform dependent op or no-op depending on effectiveness of the platform's instructions. That will be a maintenance monster.

I also don't know, if the intrinsic is handled in a platform-dependent way in llvm, and they maintain this already?

@nagisa do you have thoughts on the questions @hirschenberger raised? Sounds like they need some direction here.

nagisa · 2017-05-15T15:39:37Z

I also don't know, if the intrinsic is handled in a platform-dependent way in llvm

Yes. It might emit something else than an actual prefetch instruction on some architectures and/or targets (if it e.g. does not support such an instruction, it might simply decide to load the pointer into register earlier). That’s the decision for the backend.

and they maintain this already?

Maintain what? The platform-specific list of special-cases based on the efficiency? Then the answer is no, I think. But again, this is something that backend could do if it had to.

Either way I do not see these answers affecting the implementation of this intrinsic much. What needs to happen before this can land is a signature improvement.

I would be fine taking this with a signature looking maybe like this:

fn prefetch_data<T>(data: *const T, write: bool, locality: i32);
fn prefetch_instruction<T>(data: *const T, write: bool, locality: i32); // does T even make sense here?

I think that’s the best we can do without going for an unnecessarily complex implementation.

The intrinsic also should have at least a blurb of documentation that mentions:

The write and locality arguments must be constant.

because that is mandated by the LLVM. Mis-using the intrinsic will likely cause an ICE, but I have no idea how to easily check for this precondition.

alexcrichton · 2017-05-18T20:26:28Z

@hirschenberger do you have thoughts on the API proposed by @nagisa?

hirschenberger · 2017-05-19T07:38:34Z

I find the true -> write, false -> read mapping a little clunky and C-ish. But if we don't want to introduce an RW-enum, it's ok.

We could also expand this further and add 4 functions:
prefetch_read_data,
prefetch_write_data,
prefetch_read_instruction,
prefetch_write_instruction

nagisa · 2017-05-19T08:30:55Z

wfm.

alexcrichton · 2017-05-25T16:06:45Z

ping @hirschenberger just wanted to keep this on your radar! Did you want to apply the suggested updates?

hirschenberger · 2017-05-29T06:06:38Z

Yes, I will update the PR soon. Sorry for the delay.

Related to rust-lang#37251

hirschenberger · 2017-06-01T06:40:31Z

Ok, I split up the function and added codegen-tests and some comments.

hirschenberger · 2017-06-01T06:46:10Z

@alexcrichton BTW, is it possible to typeck if the locality parameter is a literal constant (is this the correct word?)?

alexcrichton · 2017-06-01T17:02:45Z

Ah I'm not sure personally but others may know!

nagisa · 2017-06-01T17:18:14Z

@bors r+

bors · 2017-06-01T17:18:15Z

📌 Commit f83901b has been approved by nagisa

nagisa · 2017-06-01T17:20:30Z

~~@hirschenberger please remove “Fixes #issue” from the description as this only adds the intrinsic.~~

Forgot I can do it myself.

bors · 2017-06-01T18:51:54Z

⌛ Testing commit f83901b with merge 3b0b9d1...

bors · 2017-06-01T23:27:50Z

💔 Test failed - status-travis

Mark-Simulacrum · 2017-06-01T23:30:34Z

@bors retry

spurious timeout failure

bors · 2017-06-02T04:58:15Z

⌛ Testing commit f83901b with merge 668e698...

Adding support for the llvm `prefetch` intrinsic Optimize `slice::binary_search` by using prefetching.

bors · 2017-06-02T07:51:15Z

☀️ Test successful - status-appveyor, status-travis
Approved by: nagisa
Pushing 668e698 to master...

rust-highfive assigned aturon Apr 20, 2017

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 20, 2017

hirschenberger force-pushed the prefetch-intrinsic branch from 7a32eae to 27a2ca7 Compare April 21, 2017 07:02

shepmaster added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 21, 2017

nagisa reviewed Apr 24, 2017

View reviewed changes

hirschenberger force-pushed the prefetch-intrinsic branch from 7d8edc1 to 27a2ca7 Compare April 27, 2017 15:25

rust-highfive assigned nagisa and unassigned aturon Apr 29, 2017

carols10cents added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 15, 2017

alexcrichton added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 18, 2017

hirschenberger force-pushed the prefetch-intrinsic branch from 27a2ca7 to a50f02c Compare May 31, 2017 15:13

Adding support for the llvm prefetch intrinsic

f83901b

Related to rust-lang#37251

hirschenberger force-pushed the prefetch-intrinsic branch from a50f02c to f83901b Compare June 1, 2017 06:38

nagisa mentioned this pull request Jun 1, 2017

Evaluate using prefetching on slice's binary_search algorithm #37251

Closed

bors mentioned this pull request Jun 1, 2017

Add E0602 #42361

Merged

bors added a commit that referenced this pull request Jun 2, 2017

Auto merge of #41418 - hirschenberger:prefetch-intrinsic, r=nagisa

668e698

Adding support for the llvm `prefetch` intrinsic Optimize `slice::binary_search` by using prefetching.

bors merged commit f83901b into rust-lang:master Jun 2, 2017

Adding support for the llvm prefetch intrinsic #41418

Adding support for the llvm prefetch intrinsic #41418

Conversation

hirschenberger commented Apr 20, 2017 • edited by nagisa Loading

rust-highfive commented Apr 20, 2017

bluetech commented Apr 20, 2017

hirschenberger commented Apr 20, 2017

hirschenberger commented Apr 21, 2017

nagisa Apr 24, 2017

Choose a reason for hiding this comment

hirschenberger Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

nagisa Apr 24, 2017

Choose a reason for hiding this comment

hirschenberger Apr 25, 2017

Choose a reason for hiding this comment

nagisa Apr 26, 2017 • edited by eddyb Loading

Choose a reason for hiding this comment

nagisa commented Apr 24, 2017

hirschenberger commented Apr 25, 2017

aturon commented Apr 25, 2017

sfackler commented Apr 26, 2017

alexcrichton commented Apr 26, 2017

aturon commented Apr 26, 2017

hirschenberger commented Apr 26, 2017

aturon commented Apr 27, 2017

hirschenberger commented Apr 27, 2017

ranma42 commented Apr 27, 2017 • edited Loading

hirschenberger commented Apr 27, 2017 • edited Loading

ranma42 commented Apr 27, 2017 • edited Loading

hirschenberger commented Apr 27, 2017

ranma42 commented Apr 27, 2017

aturon commented Apr 27, 2017

hirschenberger commented Apr 28, 2017

aturon commented Apr 29, 2017

carols10cents commented May 15, 2017

nagisa commented May 15, 2017 • edited Loading

alexcrichton commented May 18, 2017

hirschenberger commented May 19, 2017 • edited Loading

nagisa commented May 19, 2017

alexcrichton commented May 25, 2017

hirschenberger commented May 29, 2017

hirschenberger commented Jun 1, 2017

hirschenberger commented Jun 1, 2017

alexcrichton commented Jun 1, 2017

nagisa commented Jun 1, 2017

bors commented Jun 1, 2017

nagisa commented Jun 1, 2017 • edited Loading

bors commented Jun 1, 2017

bors commented Jun 1, 2017

Mark-Simulacrum commented Jun 1, 2017

bors commented Jun 2, 2017

bors commented Jun 2, 2017

Adding support for the llvm `prefetch` intrinsic #41418

Adding support for the llvm `prefetch` intrinsic #41418

hirschenberger commented Apr 20, 2017 •

edited by nagisa

Loading

hirschenberger Apr 25, 2017 •

edited

Loading

nagisa Apr 26, 2017 •

edited by eddyb

Loading

ranma42 commented Apr 27, 2017 •

edited

Loading

hirschenberger commented Apr 27, 2017 •

edited

Loading

ranma42 commented Apr 27, 2017 •

edited

Loading

nagisa commented May 15, 2017 •

edited

Loading

hirschenberger commented May 19, 2017 •

edited

Loading

nagisa commented Jun 1, 2017 •

edited

Loading