Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for the llvm prefetch intrinsic #41418

Merged
merged 1 commit into from
Jun 2, 2017

Conversation

hirschenberger
Copy link
Contributor

@hirschenberger hirschenberger commented Apr 20, 2017

Optimize slice::binary_search by using prefetching.

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@bluetech
Copy link

Just a note: the SO answer from the issue and your perf output also show:

Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

IIUC, this means:

  • More power usage?
  • More cache misses elsewhere? (due to 2x cache evictions from the binary search)

I think you should at least split this into 2 commits: one which adds the intrinsic, and one which does the change to binary search. Then the binary search change can be considered separately.

@hirschenberger
Copy link
Contributor Author

You're right, splitting it up makes sense as adding support for the intrinsic is not controversial.
But perhaps some more opinions and knowledge is required, if microoptimizations with prefetching may harm other codepaths and we want to optimize the stdlib with them at all.

I think it may use more power but for a shorter period of time, so it equals out.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Apr 20, 2017
@hirschenberger
Copy link
Contributor Author

I've removed the binary_search usage of the prefetch intrinsic.

@shepmaster shepmaster added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 21, 2017
}

/// Empty bootstrap implementation for stage0 compilation
#[cfg(stage0)]
pub unsafe fn prefetch<T>(_data: *const T, _rw: i32, _locality: i32, _cache: i32) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there’s no use for binary_search anymore, this can be removed I think?

Copy link
Contributor Author

@hirschenberger hirschenberger Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of adding the binary_search optimization after this patch landed. I'm sure there are other parts of the compiler which can benefit from prefetching.

@@ -565,8 +565,17 @@ extern "rust-intrinsic" {
pub fn atomic_umax_rel<T>(dst: *mut T, src: T) -> T;
pub fn atomic_umax_acqrel<T>(dst: *mut T, src: T) -> T;
pub fn atomic_umax_relaxed<T>(dst: *mut T, src: T) -> T;

#[cfg(not(stage0))]
pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a terrible API, even for an intrinsic. What’s the rw? What’s the locality? What’s the cache? What are the meaning of values these arguments take?

Reading the LLVM’s docs it seems like rw could at least be a bool (I suppose LLVM uses i32 simply to allow themselves more expansion – not a problem for us because intriniscs are forever unstable). Similarly for cache, although it is a bit harder to draw the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I thought intrinsics APIs should directly map to the underlying llvm API?
I can add an Enum for cache and locality and make the rw-flag a bool.

Copy link
Member

@nagisa nagisa Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not expect intrinsics to map directly to the underlying LLVM API at all. In fact, doing that makes little sense given our strong desire to support multiple backends (Cranelift, maybe?).

If this wasn’t prefetch, I would have suggested adding a regular wrapper with a sane API, but I think this intrinsic might want to avoid having wrapper functions (just like transmute).

@nagisa
Copy link
Member

nagisa commented Apr 24, 2017

I believe our position regarding intrinsics is that we avoid adding these unless there’s an immediate use-case for them. binary_search is one such case and has been since removed from the PR, making this intrinsic have no immediate use case again.

I suppose that in order to get this intrinsic into the compiler, some (real-world?) benchmark numbers will be required, even if for the most trivial use-case. If the goal is to eventually expose a stable interface to the intrinsic, I think it should be a part of this PR as well.

@hirschenberger
Copy link
Contributor Author

In #37251 @alexcrichton replied to my question:

@hirschenberger
Ok, I'll try to add the prefetch intrinsic, was it intentionally left out?
@alexcrichton
Nah it probably just hasn't gotten around to getting bound yet, very little is intentionally left out!

@aturon
Copy link
Member

aturon commented Apr 25, 2017

cc @rust-lang/libs, this is a perf improvement backed by some pretty solid numbers. Anyone have hesitations about using these intrinsics?

@sfackler
Copy link
Member

SGTM

@alexcrichton
Copy link
Member

@aturon but the usage of prefetches in binary searches was backed out of this PR?

@aturon
Copy link
Member

aturon commented Apr 26, 2017

Oh, whoops, was going on the PR description and linked issue...

@hirschenberger
Copy link
Contributor Author

Should I add the prefetch use in binary_search again?

@aturon
Copy link
Member

aturon commented Apr 27, 2017

@hirschenberger I'm personally fine bundling them. (I think the original comment was just about splitting commits, not the PR).

@hirschenberger
Copy link
Contributor Author

I added the binary_search optimization as separate commit.

@ranma42
Copy link
Contributor

ranma42 commented Apr 27, 2017

Given that apparently the intrinsic will come along with its usage, I did some additional benchmarking as per #37251 (comment).

The benchmark I used is based on the original one from #37251 (comment), but it tests some different combinations:

  • noprefetch is doing no prefetching at all
  • prefetchhead is prefetching s.as_ptr() (as the original benchmark)
  • prefetchhead is prefetching s.as_ptr().wrapping_offset(s.len() as isize >> 1) (i.e. the pointer which will be &tail[0] during the next iteration)

It also tests two different search patterns:

  • fixed always searches for 22222 (as the original benchmark)
  • rand uses a naive LCG to randomize the key being searched

My results are:

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          53 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          54 ns/iter (+/- 3)
test bench_binsearch_fixed_prefetchmid  ... bench:          46 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          56 ns/iter (+/- 4)
test bench_binsearch_rand_prefetchhead  ... bench:          62 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchmid   ... bench:          61 ns/iter (+/- 4)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          53 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          46 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          46 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          62 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          63 ns/iter (+/- 13)
test bench_binsearch_rand_prefetchmid   ... bench:          61 ns/iter (+/- 4)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          53 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          54 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          54 ns/iter (+/- 6)
test bench_binsearch_rand_noprefetch    ... bench:          59 ns/iter (+/- 3)
test bench_binsearch_rand_prefetchhead  ... bench:          60 ns/iter (+/- 26)
test bench_binsearch_rand_prefetchmid   ... bench:          61 ns/iter (+/- 3)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

I think some more investigation is needed before choosing the prefetching policy.
A more reliable benchmark would be needed to evaluate the impact of the change, because this one seems to be affected by external factors (I believe the instability might be caused by the cache alignment of the vector).

@hirschenberger
Copy link
Contributor Author

hirschenberger commented Apr 27, 2017

I tested your benchmark on my machine and got the following results:

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          87 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          81 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          83 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          92 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          93 ns/iter (+/- 3)
test bench_binsearch_rand_prefetchmid   ... bench:          98 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          86 ns/iter (+/- 8)
test bench_binsearch_fixed_prefetchhead ... bench:          82 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          83 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          96 ns/iter (+/- 14)
test bench_binsearch_rand_prefetchhead  ... bench:          96 ns/iter (+/- 9)
test bench_binsearch_rand_prefetchmid   ... bench:          98 ns/iter (+/- 9)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

running 6 tests
test bench_binsearch_fixed_noprefetch   ... bench:          87 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchhead ... bench:          81 ns/iter (+/- 0)
test bench_binsearch_fixed_prefetchmid  ... bench:          83 ns/iter (+/- 0)
test bench_binsearch_rand_noprefetch    ... bench:          96 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          93 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchmid   ... bench:          95 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

But when I remove the branching for the prefetch pattern, results drop significantly for the non-prefetching code.

test bench_binsearch_fixed_prefetchhead ... bench:          81 ns/iter (+/- 0)
test bench_binsearch_rand_prefetchhead  ... bench:          93 ns/iter (+/- 0)

test bench_binsearch_fixed_prefetchmid  ... bench:          84 ns/iter (+/- 1)
test bench_binsearch_rand_prefetchmid   ... bench:          95 ns/iter (+/- 3)

test bench_binsearch_fixed_noprefetch   ... bench:         105 ns/iter (+/- 2)
test bench_binsearch_rand_noprefetch    ... bench:         101 ns/iter (+/- 0)

Disclaimer: My CPU is not the newest generation Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
I also had strange performance results on other benchmarks. I'll run your bench on a more recent CPU asap.

@ranma42
Copy link
Contributor

ranma42 commented Apr 27, 2017

The branching for the selection of prefetching strategy should not affect the benchmarks, as it is completely inlined by LLVM (you can easily check it with --emit=asm or by disassembling the output binary).

@hirschenberger
Copy link
Contributor Author

Ok, on my Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz notebook, there's really no clear performance pattern recognizable. I should avoid doing benchmarks in legacy hardware.

My proposal: I change the PR back to skip the binary_search prefetching and we land the prefetching intrinsic support? I'm sure people will find optimizations afterwards?

@ranma42
Copy link
Contributor

ranma42 commented Apr 27, 2017

I think that might be a good idea. Unfortunately for binary_search there are several factors that need to be kept into account in order to make the prefetching effective, such as the memory behaviour of the comparison function and the optimal prefetch distance.
It is probably easier to get "stable" performance improvements by doing prefetching on code that is not parametric on the type or the internal operations being performed.
For example I would expect the prefetch intrinsic might improve the performance of functions that manage big triangular matrices (on intel, hardware prefetching might be effective for rectangular matrices, but it is probably unable to handle triangular matrices).

@aturon
Copy link
Member

aturon commented Apr 27, 2017

@hirschenberger Sure, that seems like the easiest way to make quick progress here.

@hirschenberger
Copy link
Contributor Author

Ok, I removed the binary_search part, ready to go...

@aturon
Copy link
Member

aturon commented Apr 29, 2017

OK, I'm taking this out of the libs team purview, but it still needs a compiler team member to sign off, and @nagisa still has outstanding issues on the API itself.

r? @nagisa

@rust-highfive rust-highfive assigned nagisa and unassigned aturon Apr 29, 2017
@carols10cents
Copy link
Member

I'm not obsessive to get this PR landed. There are pros and cons to support the intrinsic. But we can do it like llvm: support software prefetching and assume people using it, know it's implications and warts of different platforms.
Or we leave it out and miss some optimization possibilities.
What I would not do is, wrapping the function in a platform dependent op or no-op depending on effectiveness of the platform's instructions. That will be a maintenance monster.

I also don't know, if the intrinsic is handled in a platform-dependent way in llvm, and they maintain this already?

@nagisa do you have thoughts on the questions @hirschenberger raised? Sounds like they need some direction here.

@carols10cents carols10cents added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 15, 2017
@nagisa
Copy link
Member

nagisa commented May 15, 2017

I also don't know, if the intrinsic is handled in a platform-dependent way in llvm

Yes. It might emit something else than an actual prefetch instruction on some architectures and/or targets (if it e.g. does not support such an instruction, it might simply decide to load the pointer into register earlier). That’s the decision for the backend.

and they maintain this already?

Maintain what? The platform-specific list of special-cases based on the efficiency? Then the answer is no, I think. But again, this is something that backend could do if it had to.


Either way I do not see these answers affecting the implementation of this intrinsic much. What needs to happen before this can land is a signature improvement.

I would be fine taking this with a signature looking maybe like this:

fn prefetch_data<T>(data: *const T, write: bool, locality: i32);
fn prefetch_instruction<T>(data: *const T, write: bool, locality: i32); // does T even make sense here?

I think that’s the best we can do without going for an unnecessarily complex implementation.

The intrinsic also should have at least a blurb of documentation that mentions:

The write and locality arguments must be constant.

because that is mandated by the LLVM. Mis-using the intrinsic will likely cause an ICE, but I have no idea how to easily check for this precondition.

@alexcrichton
Copy link
Member

@hirschenberger do you have thoughts on the API proposed by @nagisa?

@alexcrichton alexcrichton added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 18, 2017
@hirschenberger
Copy link
Contributor Author

hirschenberger commented May 19, 2017

I find the true -> write, false -> read mapping a little clunky and C-ish. But if we don't want to introduce an RW-enum, it's ok.

We could also expand this further and add 4 functions:
prefetch_read_data,
prefetch_write_data,
prefetch_read_instruction,
prefetch_write_instruction

@nagisa
Copy link
Member

nagisa commented May 19, 2017

wfm.

@alexcrichton
Copy link
Member

ping @hirschenberger just wanted to keep this on your radar! Did you want to apply the suggested updates?

@hirschenberger
Copy link
Contributor Author

Yes, I will update the PR soon. Sorry for the delay.

@hirschenberger
Copy link
Contributor Author

Ok, I split up the function and added codegen-tests and some comments.

@hirschenberger
Copy link
Contributor Author

@alexcrichton BTW, is it possible to typeck if the locality parameter is a literal constant (is this the correct word?)?

@alexcrichton
Copy link
Member

Ah I'm not sure personally but others may know!

@nagisa
Copy link
Member

nagisa commented Jun 1, 2017

@bors r+

@bors
Copy link
Contributor

bors commented Jun 1, 2017

📌 Commit f83901b has been approved by nagisa

@nagisa
Copy link
Member

nagisa commented Jun 1, 2017

@hirschenberger please remove “Fixes #issue” from the description as this only adds the intrinsic.

Forgot I can do it myself.

@bors
Copy link
Contributor

bors commented Jun 1, 2017

⌛ Testing commit f83901b with merge 3b0b9d1...

@bors bors mentioned this pull request Jun 1, 2017
@bors
Copy link
Contributor

bors commented Jun 1, 2017

💔 Test failed - status-travis

@Mark-Simulacrum
Copy link
Member

@bors retry

  • spurious timeout failure

@bors
Copy link
Contributor

bors commented Jun 2, 2017

⌛ Testing commit f83901b with merge 668e698...

bors added a commit that referenced this pull request Jun 2, 2017
Adding support for the llvm `prefetch` intrinsic

Optimize `slice::binary_search` by using prefetching.
@bors
Copy link
Contributor

bors commented Jun 2, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: nagisa
Pushing 668e698 to master...

@bors bors merged commit f83901b into rust-lang:master Jun 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.