-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
major performance regression between Rust 1.50 and beta when using target-cpu=native #83027
Comments
Oh, also, I did try to find a smaller reproduction. Since the regression is ultimately rooted in the SIMD implementation found in the use memchr::memchr;
fn main() {
let haystack = "abcdefghijklmnopqrstuvwxyz".repeat(15);
for _ in 0..100_000_000 {
assert_eq!(None, memchr(b'@', haystack.as_bytes()));
}
} But both versions of the program inlined all the routines I would expect. |
Can you check if beta (1.51) reproduces this regression? My immediate guess is that it's caused by the LLVM 12 upgrade, which landed in #81451. cc @rust-lang/wg-llvm |
Yes, I am able to reproduce on beta too:
|
Needs MCVE because OP said #83027 (comment) did not reproduce the bug. |
Could you describe full reproduction steps, including any custom options and features used when building ripgrep? Do you use I couldn't reproduce the issue. |
Also, are you compiling |
Ah!!! Thank you so much for mentioning Some preliminaries for checking my environment:
Compile four different binaries. stable, stable + target-cpu=native, beta and beta + target-cpu=native. Only beta+native has the performance regression.
And to show that only beta+native has the issue (the curl command for getting the subtitles is in my OP):
So given the new focus on
Now compile two binaries: one with beta and one with beta and
And now run them:
I've run |
What is your Possible cause: #80749 Does this go away with |
@nagisa Broadwell:
What is the best way to fix it?
I've never tried using
|
Perhaps using a |
Ah thanks for the link. I ran this:
But the regression remains:
I guess it helps in the strictest sense that it doesn't have the performance regression:
But I think what I meant was, "how do we not get a performance regression when using
Hmmm... Okay. cc @Amanieu Is there a more succinct/higher-level description of why #80749 is possibly the cause here? I guess what I mean to say is, what changed that stopped the inlining from happening here? |
I'm happy to try and explain it here; I don't recall there being a good description of this elsewhere: It is not valid for a function to be inlined into another if the feature sets differ between them. On x86_64 in particular this is exemplified by potentially differing ABIs and registers when a feature is available and when it isn't. As the features are tracked at a per-function level, LLVM is forced to disable inlining of such differing functions so that their features don't get lost. The linked PR specifies an exact list of features that shall be applied to all functions that don't specify anything otherwise, so I suspect conflicts in memchr code occur quite naturally when there's interaction between SIMD and regular code. With that in mind I would've expected |
@BurntSushi I can reproduce on skylake, including with the following: #[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
use std::intrinsics::transmute;
fn main() {
#[target_feature(enable = "avx2")]
unsafe fn test() {
let a = _mm256_set_epi32(1, 1, 1, 1, 1, 1, 1, 1);
let b = _mm256_set_epi32(2, 2, 2, 2, 2, 2, 2, 2);
let e = _mm256_set_epi32(3, 3, 3, 3, 3, 3, 3, 3);
let r = _mm256_add_epi32(a, b);
assert_eq_m256i(e, r);
}
if is_x86_feature_detected!("avx2") {
unsafe { test() }
} else {
panic!("avx2 feature not detected");
}
}
#[target_feature(enable = "avx")]
pub unsafe fn assert_eq_m256i(a: __m256i, b: __m256i) {
assert_eq!(transmute::<_, [u64; 4]>(a), transmute::<_, [u64; 4]>(b))
} Building without $ objdump -d regress| grep as_i32x8
$ objdump -d regress-skylake| grep as_i32x8
$ objdump -d regress-native| grep as_i32x8
0000000000006960 <_ZN4core9core_arch3x868m256iExt8as_i32x817h6f0a02a3bdc3d3e7E>:
6ade: e8 7d fe ff ff callq 6960 <_ZN4core9core_arch3x868m256iExt8as_i32x817h6f0a02a3bdc3d3e7E>
6b0a: e8 51 fe ff ff callq 6960 <_ZN4core9core_arch3x868m256iExt8as_i32x817h6f0a02a3bdc3d3e7E> |
@lqd Thanks! Hopefully that helps dig into this a bit more.
So just to be super precise, did you mean "differ" literally? As in, if I have a function compiled with just the I'm assuming that you mean, "if the caller's feature set is not a superset of the function, then the function cannot be inlined." If that assumption is wrong, then I think my mental model is broken.
Hmmm okay. So let me try to play this back to you in my own words to make sure I grok this. So let's pick a function that isn't getting inlined, say, I think the key here is that functions like So I guess what I don't quite grok is what it is about I think your point above about these sorts of functions being tagged with |
LLVM should inline based on subsets, not exact matches. If it's not then that's a bug. I can't reproduce with #[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
extern "C" {
fn black_box(a: *const u8);
}
pub fn foo() {
#[target_feature(enable = "avx2")]
unsafe fn test() {
let a = _mm256_set_epi32(1, 1, 1, 1, 1, 1, 1, 1);
let b = _mm256_set_epi32(2, 2, 2, 2, 2, 2, 2, 2);
let e = _mm256_set_epi32(3, 3, 3, 3, 3, 3, 3, 3);
let r = _mm256_add_epi32(a, b);
assert_eq_m256i(e, r);
}
if is_x86_feature_detected!("avx2") {
unsafe { test() }
} else {
loop {}
}
}
#[target_feature(enable = "avx")]
pub unsafe fn assert_eq_m256i(a: __m256i, b: __m256i) {
black_box(&a as *const _ as *const _);
black_box(&b as *const _ as *const _);
} |
note: we're also talking about this in https://rust-lang.zulipchat.com/#narrow/stream/247081-t-compiler.2Fperformance/topic/major.20performance.20regression.20between.20Rust.201.2E50.20and.20.2383027 we have a repro that is easier to work with https://godbolt.org/z/3vTv3s |
I'm sorry for my confusing wording. Its not exactly superset, but when features are compatible. Subset-superset relationship does not always imply compatibility, though it usually does, and for x86_64, as far as I can tell, if the callee has a subset of features, it is compatible for inlining.
After some thinking I think what may be happening here is somewhat different. I'll output some LLVM-IR in the further explanation as well as some rust code. Everything (MCVE) together is in this godbolt. So… when a #[target_feature(enable = "avx2")]
pub unsafe fn _mm256_add_epi32(a: __m256i, b: __m256i) -> __m256i { ... } It will translate to a function that looks a lot like this: define void @_mm256_add_epi32(%__m256i* %0, %__m256i* %1, %__m256i* %2) unnamed_addr #0 { ... }
attributes #0 = { ... "target-cpu"="skylake-avx512" "target-features"="+avx2" } Similarly, when a function as such is compiled: pub(crate) trait m256iExt: Sized {
// ...
// #[target_feature(default)]
fn as_i32x8(self) -> i32x8 {
unsafe { transmute(self.as_m256i()) }
}
} It will become a: define internal fastcc void @as_i32x8(<8 x i32>* %0, %__m256i* %1) unnamed_addr #0 { ... }
attributes #0 = { ... "target-cpu"="skylake-avx512" } ; uses global default target-features! Now, AFAICT LLVM will not "combine" the per-function target features to the list of global features, but rather overwrite. And so what ought to happen here is that we have a Now, some further exploration with the godbolt example has showed some pretty weird behaviours, so I'm not exactly sure if what I'm saying is entirely correct.
So I think my theory may be plausible to some extent, but also probably incorrect given the two weird behaviours above... |
In short, there's at least one bug with |
…ochenkov Adjust `-Ctarget-cpu=native` handling in cg_llvm When cg_llvm encounters the `-Ctarget-cpu=native` it computes an explciit set of features that applies to the target in order to correctly compile code for the host CPU (because e.g. `skylake` alone is not sufficient to tell if some of the instructions are available or not). However there were a couple of issues with how we did this. Firstly, the order in which features were overriden wasn't quite right – conceptually you'd expect `-Ctarget-cpu=native` option to override the features that are implicitly set by the target definition. However due to how other `-Ctarget-cpu` values are handled we must adopt the following order of priority: * Features from -Ctarget-cpu=*; are overriden by * Features implied by --target; are overriden by * Features from -Ctarget-feature; are overriden by * function specific features. Another problem was in that the function level `target-features` attribute would overwrite the entire set of the globally enabled features, rather than just the features the `#[target_feature(enable/disable)]` specified. With something like `-Ctarget-cpu=native` we'd end up in a situation wherein a function without `#[target_feature(enable)]` annotation would have a broader set of features compared to a function with one such attribute. This turned out to be a cause of heavy run-time regressions in some code using these function-level attributes in conjunction with `-Ctarget-cpu=native`, for example. With this PR rustc is more careful about specifying the entire set of features for functions that use `#[target_feature(enable/disable)]` or `#[instruction_set]` attributes. Sadly testing the original reproducer for this behaviour is quite impossible – we cannot rely on `-Ctarget-cpu=native` to be anything in particular on developer or CI machines. cc rust-lang#83027 `@BurntSushi`
Assigning @rustbot label -I-prioritize +P-high |
This does appear fixed by #83084!
Thanks again @nagisa and everyone who helped diagnose this problem. :-) |
I'll just start with some reproduction steps that I'm hoping someone else will be able to reproduce. This assumes you've compiled ripgrep with Rust 1.50 to a binary named
rg-stable_1.50
and also compiled ripgrep with Rust nightly 2021-03-09 to a binary namedrg-nightly_2021-03-09
(alternatively, compile with the beta release, as I've reproduced the problem there in a subsequent comment):Here is the relevant part of the profile I extracted by running the ripgrep compiled with nightly under
perf
:The key difference between Rust nightly and stable is the fact that it looks like
i8x32::new
isn't being inlined. But it's not the only one. There are other functions showing up in the profile, likecore::core_arch::x86::m256iExt::as_i32x8
, that aren't being inlined either. These are trivial cast functions, and them not being inlined is likely a bug. (So an alternative title for this issue might be, "some trivial functions aren't getting inlined in hot code paths." But I figured I'd start with the actual problem I'm seeing in case my analysis is wrong.)Initially I assumed that maybe something had changed in stdarch recently related to these code paths, but I don't see anything. So I'm a bit worried that perhaps something else changed that impacted inlining decisions, and this is an indirect effect. Alas, I'm stuck at this point and would love some help getting to the bottom of it.
It's possible, perhaps even likely, that this is related to #60637. I note that it is used to justify some
inline(always)
annotations, butfn new
is left at just#[inline]
.Perhaps there is a quick fix where we need to go over some of the lower level SIMD routines and make sure they're tagged with
inline(always)
. But really, it seems to me like these functions really should be inlined automatically. I note that this doesn't look like a cross crate problem that might typically be a reason for preventing inlining. In particular,_mm256_setr_epi8
is being inlined (as one would expect), but the call toi8x32
in its implementation is the thing not being inlined. So this seems pretty suspicious to me.Apologies for not narrowing this down more. A good next step might be to find the specific version of nightly that introduced this problem.
The text was updated successfully, but these errors were encountered: