-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for the llvm prefetch
intrinsic
#41418
Conversation
r? @aturon (rust_highfive has picked a reviewer for you, use r? to override) |
Just a note: the SO answer from the issue and your perf output also show:
IIUC, this means:
I think you should at least split this into 2 commits: one which adds the intrinsic, and one which does the change to binary search. Then the binary search change can be considered separately. |
You're right, splitting it up makes sense as adding support for the intrinsic is not controversial. I think it may use more power but for a shorter period of time, so it equals out. |
7a32eae
to
27a2ca7
Compare
I've removed the |
src/libcore/intrinsics.rs
Outdated
} | ||
|
||
/// Empty bootstrap implementation for stage0 compilation | ||
#[cfg(stage0)] | ||
pub unsafe fn prefetch<T>(_data: *const T, _rw: i32, _locality: i32, _cache: i32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there’s no use for binary_search anymore, this can be removed I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of adding the binary_search optimization after this patch landed. I'm sure there are other parts of the compiler which can benefit from prefetching.
src/libcore/intrinsics.rs
Outdated
@@ -565,8 +565,17 @@ extern "rust-intrinsic" { | |||
pub fn atomic_umax_rel<T>(dst: *mut T, src: T) -> T; | |||
pub fn atomic_umax_acqrel<T>(dst: *mut T, src: T) -> T; | |||
pub fn atomic_umax_relaxed<T>(dst: *mut T, src: T) -> T; | |||
|
|||
#[cfg(not(stage0))] | |||
pub fn prefetch<T>(data: *const T, rw: i32, locality: i32, cache: i32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a terrible API, even for an intrinsic. What’s the rw? What’s the locality? What’s the cache? What are the meaning of values these arguments take?
Reading the LLVM’s docs it seems like rw
could at least be a bool (I suppose LLVM uses i32 simply to allow themselves more expansion – not a problem for us because intriniscs are forever unstable). Similarly for cache
, although it is a bit harder to draw the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I thought intrinsics APIs should directly map to the underlying llvm API?
I can add an Enum for cache and locality and make the rw-flag a bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not expect intrinsics to map directly to the underlying LLVM API at all. In fact, doing that makes little sense given our strong desire to support multiple backends (Cranelift, maybe?).
If this wasn’t prefetch, I would have suggested adding a regular wrapper with a sane API, but I think this intrinsic might want to avoid having wrapper functions (just like transmute).
I believe our position regarding intrinsics is that we avoid adding these unless there’s an immediate use-case for them. I suppose that in order to get this intrinsic into the compiler, some (real-world?) benchmark numbers will be required, even if for the most trivial use-case. If the goal is to eventually expose a stable interface to the intrinsic, I think it should be a part of this PR as well. |
In #37251 @alexcrichton replied to my question:
|
cc @rust-lang/libs, this is a perf improvement backed by some pretty solid numbers. Anyone have hesitations about using these intrinsics? |
SGTM |
@aturon but the usage of prefetches in binary searches was backed out of this PR? |
Oh, whoops, was going on the PR description and linked issue... |
Should I add the prefetch use in binary_search again? |
@hirschenberger I'm personally fine bundling them. (I think the original comment was just about splitting commits, not the PR). |
I added the binary_search optimization as separate commit. |
Given that apparently the intrinsic will come along with its usage, I did some additional benchmarking as per #37251 (comment). The benchmark I used is based on the original one from #37251 (comment), but it tests some different combinations:
It also tests two different search patterns:
My results are:
I think some more investigation is needed before choosing the prefetching policy. |
I tested your benchmark on my machine and got the following results:
But when I remove the branching for the prefetch pattern, results drop significantly for the non-prefetching code.
Disclaimer: My CPU is not the newest generation |
The branching for the selection of prefetching strategy should not affect the benchmarks, as it is completely inlined by LLVM (you can easily check it with |
Ok, on my My proposal: I change the PR back to skip the binary_search prefetching and we land the prefetching intrinsic support? I'm sure people will find optimizations afterwards? |
7d8edc1
to
27a2ca7
Compare
I think that might be a good idea. Unfortunately for |
@hirschenberger Sure, that seems like the easiest way to make quick progress here. |
Ok, I removed the binary_search part, ready to go... |
@nagisa do you have thoughts on the questions @hirschenberger raised? Sounds like they need some direction here. |
Yes. It might emit something else than an actual prefetch instruction on some architectures and/or targets (if it e.g. does not support such an instruction, it might simply decide to load the pointer into register earlier). That’s the decision for the backend.
Maintain what? The platform-specific list of special-cases based on the efficiency? Then the answer is no, I think. But again, this is something that backend could do if it had to. Either way I do not see these answers affecting the implementation of this intrinsic much. What needs to happen before this can land is a signature improvement. I would be fine taking this with a signature looking maybe like this:
I think that’s the best we can do without going for an unnecessarily complex implementation. The intrinsic also should have at least a blurb of documentation that mentions:
because that is mandated by the LLVM. Mis-using the intrinsic will likely cause an ICE, but I have no idea how to easily check for this precondition. |
@hirschenberger do you have thoughts on the API proposed by @nagisa? |
I find the We could also expand this further and add 4 functions: |
wfm. |
ping @hirschenberger just wanted to keep this on your radar! Did you want to apply the suggested updates? |
Yes, I will update the PR soon. Sorry for the delay. |
27a2ca7
to
a50f02c
Compare
a50f02c
to
f83901b
Compare
Ok, I split up the function and added codegen-tests and some comments. |
@alexcrichton BTW, is it possible to typeck if the |
Ah I'm not sure personally but others may know! |
@bors r+ |
📌 Commit f83901b has been approved by |
Forgot I can do it myself. |
⌛ Testing commit f83901b with merge 3b0b9d1... |
💔 Test failed - status-travis |
@bors retry
|
Adding support for the llvm `prefetch` intrinsic Optimize `slice::binary_search` by using prefetching.
☀️ Test successful - status-appveyor, status-travis |
Optimize
slice::binary_search
by using prefetching.