-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic vectorization performance regression from 1.48.0 onwards #85265
Comments
Assigning a priority according to the WG-prioritization discussion on Zulip and removing @rustbot modify labels +P-high -I-prioritize |
I think, this issue caused by using i128 when sending "small" arguments by our ABI.
rustc generates code which send this arrays as i128 values and LLVM fails to replace them by xmm registers. |
Maybe we should use vector types for this instead of integers for small arrays. |
@AngelicosPhosphoros I've been looking at this issue to see if I could fix it, but it seems to me that i128s are never explicitly constructed inside of |
@JulianKnodt I don't remember where it is exactly but try to start from here:
|
Is there any progress with this? There's seemingly no way to fully mitigate this regression, not even by using intrinsics. |
I ended up not having time to work on this, sorry about that. |
I've done some analysis in #91447 (comment) about a similar issue and what I've found in this issue is that the return type is not the best and that using an out parameter fix this regression with the return type. pub fn case_2(a: [f32; 4], b: [f32; 4], out: &mut [f32; 4]) {
for i in 0..4 {
out[i] = a[i] + b[i];
}
} example::case_1:
movq xmm0, rcx
movq xmm1, rdx
punpcklqdq xmm1, xmm0
movq xmm0, rsi
movq xmm2, rdi
punpcklqdq xmm2, xmm0
addps xmm2, xmm1
movups xmmword ptr [r8], xmm2
ret And using reference for the inputs fix the rest. pub fn case_2(a: &[f32; 4], b: &[f32; 4], out: &mut [f32; 4]) {
for i in 0..4 {
out[i] = a[i] + b[i];
}
} example::case_2:
movups xmm0, xmmword ptr [rdi]
movups xmm1, xmmword ptr [rsi]
addps xmm1, xmm0
movups xmmword ptr [rdx], xmm1
ret |
@Urgau Hi. I did play around with this workaround a while back as a temporary solution. Unfortunately, it didn't work very effectively. In practice, the regression still came back to haunt me as soon as I called the workaround method. The compiler just optimized it away, as it should. It also, obviously, doesn't work with standard traits such as Add or AddAssign. This is where I initially encountered the problem. Example code illustrating:
Code/// Ineffective workaround.
pub fn use_ref_add(a: Vec4, b: Vec4) -> Vec4 {
let mut c = Vec4::default();
Vec4::ref_add(&a, &b, &mut c);
c
}
#[derive(Default)]
pub struct Vec4 {
pub x: f32,
pub y: f32,
pub z: f32,
pub w: f32,
}
impl Vec4 {
/// Ineffective workaround.
pub fn ref_add(lhs: &Self, rhs: &Self, out: &mut Self) {
out.x = lhs.x + rhs.x;
out.y = lhs.y + rhs.y;
out.z = lhs.z + rhs.z;
out.w = lhs.w + rhs.w;
}
}
impl std::ops::Add for Vec4 {
type Output = Self;
fn add(self, other: Self) -> Self {
Self { x: self.x + other.x, y: self.y + other.y, z: self.z + other.z, w: self.w + other.w }
}
}
impl std::ops::AddAssign for Vec4 {
fn add_assign(&mut self, rhs: Self) {
self.x += rhs.x;
self.y += rhs.y;
self.z += rhs.z;
self.w += rhs.w;
}
} Rust 1.47 - Okhttps://rust.godbolt.org/z/WdMYdT1rM example::use_ref_add:
mov rax, rdi
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdx]
addps xmm1, xmm0
movups xmmword ptr [rdi], xmm1
ret
example::Vec4::ref_add:
movups xmm0, xmmword ptr [rdi]
movups xmm1, xmmword ptr [rsi]
addps xmm1, xmm0
movups xmmword ptr [rdx], xmm1
ret
<example::Vec4 as core::ops::arith::Add>::add:
mov rax, rdi
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdx]
addps xmm1, xmm0
movups xmmword ptr [rdi], xmm1
ret
<example::Vec4 as core::ops::arith::AddAssign>::add_assign:
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdi]
addps xmm1, xmm0
movups xmmword ptr [rdi], xmm1
ret Rust 1.57 - Regressionhttps://rust.godbolt.org/z/3zrcqc5E4 example::use_ref_add:
movd xmm0, esi
mov rax, rsi
shld rax, rdi, 32
shr rsi, 32
movq xmm1, rax
movq xmm2, rsi
punpckldq xmm2, xmm1
movd xmm1, edi
movd xmm3, edx
addps xmm3, xmm1
movd xmm1, ecx
addss xmm1, xmm0
shrd rdx, rcx, 32
shr rcx, 32
movq xmm0, rdx
movq xmm4, rcx
punpckldq xmm4, xmm0
addps xmm4, xmm2
movd eax, xmm3
movd ecx, xmm1
movd edx, xmm4
shufps xmm4, xmm4, 85
movd esi, xmm4
shl rsi, 32
shl rdx, 32
or rdx, rcx
or rax, rsi
ret
example::Vec4::ref_add:
movups xmm0, xmmword ptr [rdi]
movups xmm1, xmmword ptr [rsi]
addps xmm1, xmm0
movups xmmword ptr [rdx], xmm1
ret
<example::Vec4 as core::ops::arith::Add>::add:
movd xmm0, esi
mov rax, rsi
shld rax, rdi, 32
shr rsi, 32
movq xmm1, rax
movq xmm2, rsi
punpckldq xmm2, xmm1
movd xmm1, edi
movd xmm3, edx
addps xmm3, xmm1
movd xmm1, ecx
addss xmm1, xmm0
shrd rdx, rcx, 32
shr rcx, 32
movq xmm0, rdx
movq xmm4, rcx
punpckldq xmm4, xmm0
addps xmm4, xmm2
movd eax, xmm3
movd ecx, xmm1
movd edx, xmm4
shufps xmm4, xmm4, 85
movd esi, xmm4
shl rsi, 32
shl rdx, 32
or rdx, rcx
or rax, rsi
ret
<example::Vec4 as core::ops::arith::AddAssign>::add_assign:
movq xmm0, rdx
movq xmm1, rsi
punpcklqdq xmm1, xmm0
movups xmm0, xmmword ptr [rdi]
addps xmm0, xmm1
movups xmmword ptr [rdi], xmm0
ret |
Ok. Despite not knowing much about rust internals, I spent the weekend bisecting and trying to figure things out. The offending commits are listed below. My initial impression is that 62fe055 is primarily at fault. Pull #76986: 'Return values up to 128 bits in registers' and pull #79547: 'Pass arguments up to 2*usize by value', close issue #26494: 'Rust should use registers more aggressively'. After this our [f32; 4] auto-vectorizations begin to fail. Note that [f32; 8] auto-vectorization is still ok. The core modification, minus the new debug checks is here: // Pass and return structures up to 2 pointers in size by value, matching `ScalarPair`.
// LLVM will usually pass these in 2 registers, which is more efficient than by-ref.
let max_by_val_size = Pointer.size(self) * 2;
let size = arg.layout.size;
if arg.layout.is_unsized() || size > max_by_val_size {
arg.make_indirect();
} else {
// We want to pass small aggregates as immediates, but using
// a LLVM aggregate type for this leads to bad optimizations,
// so we pick an appropriately sized integer type instead.
arg.cast_to(Reg { kind: RegKind::Integer, size });
} If we return 'let max_by_val_size = Pointer.size(self) * 2' to 'let max_by_val_size = Pointer.size(self)', remove the new debug checks and rebuild rustc; our vectorization appears to return to normal as per 1.47. I'm sure someone will know what's going on. My guess is that LLVM doesn't vectorize 128-bit register types well, we are using the wrong type, or we are obscuring crucial context/ type information. No idea if this comment is relevant, or if I'm barking up the wrong tree, specifically the section leading up to: The commits in question are intended to improve performance, but they break basic auto-vectorization which is a big loss. Could we consider reverting them until this issue is ironed out? I'll shortly leave a comment in issue #26494 to get some feedback. Regression bisectsRegression in nightly-2020-09-28 Regression in nightly-2020-12-03 Regression in nightly-2021-03-05 |
@shampoofactory Happy to find out that we can to the same conclusion. I have already experiment this weekend with the code on the Rust-CI #91507 and results are not good for compile time. I also find out that:
PS: the problem is not LLVM or |
Strictly speaking, even 1.47 wasn't really ideal either. I had a few ABI related things in my backlog when I was still working on rustc, but unfortunately didn't get to actually implement them. I might try to tackle them during the holidays, but can't promise anything. If someone wants to work on this, the basic problem here is that we're not using the proper types on the LLVM IR level. The following is all from an x86_64 point of view. In 1.47, the struct is passed by reference, which allows LLVM to use SIMD instructions for the addition, but forced the data to go through memory instead of registers. The function signature is: define void @"_ZN61_$LT$example..Vec4$u20$as$u20$core..ops..arith..AddAssign$GT$10add_assign17ha933a0951500a174E"(%Vec4* nocapture align 4 dereferenceable(16) %self, %Vec4* noalias nocapture readonly dereferenceable(16) %rhs) unnamed_addr #1 !dbg !33 { In 1.57 the whole thing is hidden in an i128 which means that while the values are passed in registers, they're passed in the integer registers, not in SSE registers. The function signature is: define void @"_ZN61_$LT$example..Vec4$u20$as$u20$core..ops..arith..AddAssign$GT$10add_assign17h9886770253471322E"(%Vec4* noalias nocapture align 4 dereferenceable(16) %self, i128 %0) unnamed_addr #1 !dbg !27 { What should be used as the argument type at the LLVM IR level is a struct of two vectors of two floats, as the C equivalent uses when compiling with clang: define dso_local { <2 x float>, <2 x float> } @case_2(<2 x float> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #0 { This allows the arguments to be passed in SSE registers and results in this assembler code: addps %xmm2, %xmm0
addps %xmm3, %xmm1
retq Note that functions using the |
@rustbot label -E-needs-bisection |
Performance results for pull #91719: '[Experiment] revert #26494':
|
IIRC the benchmark suite only tests the compiler's performance when checking/compiling code. The performance of the generated code is only indirectly measured by that (as the compiler compiles itself), but since these ABI change can enable (or block) a number of optimizations, that might affect the performance of the LLVM part of those build, so I'd take those results with a grain of salt. |
Yes, there is no perf test for actual Rust executables aside from the compiler. |
Fixed and with a codegen regression test in #94570. |
Hi all. The compiler output for the test cases below is efficiently handled in 1.47. However, it has since been progressively unraveling into something quite awkward. This trend occurs on both x64 and ARM targets. All examples are compiled with
-C opt-level=3 -C lto=fat -C codegen-units=1
.Code
1.47.0 - Ok
https://rust.godbolt.org/z/33WhfM3n8
1.48.0 - Regression case_1
https://rust.godbolt.org/z/bqEre5dx5
1.50.0 - Regression case_1 and case_2
https://rust.godbolt.org/z/Pj399ezPq
1.52.0 - Further regression case_2
https://rust.godbolt.org/z/KsvbW5vhW
The text was updated successfully, but these errors were encountered: