Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic vectorization performance regression from 1.48.0 onwards #85265

Closed
shampoofactory opened this issue May 13, 2021 · 18 comments
Closed

Basic vectorization performance regression from 1.48.0 onwards #85265

shampoofactory opened this issue May 13, 2021 · 18 comments
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@shampoofactory
Copy link
Contributor

Hi all. The compiler output for the test cases below is efficiently handled in 1.47. However, it has since been progressively unraveling into something quite awkward. This trend occurs on both x64 and ARM targets. All examples are compiled with -C opt-level=3 -C lto=fat -C codegen-units=1.

Code

pub fn case_1(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    [
        a[0] + b[0],
        a[1] + b[1],
        a[2] + b[2],
        a[3] + b[3],
    ]
}

pub fn case_2(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    let mut c = [0.0; 4];
    for i in 0..4 {
        c[i] = a[i] + b[i];
    }
    c
}

1.47.0 - Ok

https://rust.godbolt.org/z/33WhfM3n8

example::case_1:
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vaddps  xmm0, xmm0, xmmword ptr [rdx]
        vmovups xmmword ptr [rdi], xmm0
        ret

example::case_2:
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vaddps  xmm0, xmm0, xmmword ptr [rdx]
        vmovups xmmword ptr [rdi], xmm0
        ret

1.48.0 - Regression case_1

https://rust.godbolt.org/z/bqEre5dx5

example::case_1:
        vmovss  xmm0, dword ptr [rdi]
        vaddss  xmm0, xmm0, dword ptr [rsi]
        vmovss  xmm1, dword ptr [rdi + 4]
        vaddss  xmm1, xmm1, dword ptr [rsi + 4]
        vmovsd  xmm2, qword ptr [rdi + 8]
        vmovsd  xmm3, qword ptr [rsi + 8]
        vaddps  xmm2, xmm2, xmm3
        vmovd   eax, xmm0
        vmovd   ecx, xmm1
        vextractps      esi, xmm2, 0
        vextractps      edx, xmm2, 1
        shl     rdx, 32
        or      rdx, rsi
        shl     rcx, 32
        or      rax, rcx
        ret

example::case_2:
        vmovups xmm0, xmmword ptr [rdi]
        vaddps  xmm0, xmm0, xmmword ptr [rsi]
        vmovq   rax, xmm0
        vpextrq rdx, xmm0, 1
        ret

1.50.0 - Regression case_1 and case_2

https://rust.godbolt.org/z/Pj399ezPq

example::case_1:
        vmovd   xmm0, esi
        shr     rsi, 32
        vmovd   xmm1, edi
        shr     rdi, 32
        vpinsrd xmm1, xmm1, edi, 1
        vmovd   xmm2, esi
        vmovd   xmm3, edx
        shr     rdx, 32
        vpinsrd xmm3, xmm3, edx, 1
        vaddps  xmm1, xmm1, xmm3
        vmovd   xmm3, ecx
        shr     rcx, 32
        vaddss  xmm0, xmm0, xmm3
        vmovd   xmm3, ecx
        vaddss  xmm2, xmm2, xmm3
        vmovd   edx, xmm0
        vmovd   eax, xmm2
        shl     rax, 32
        or      rdx, rax
        vxorps  xmm0, xmm0, xmm0
        vblendps        xmm0, xmm1, xmm0, 2
        vmovq   rcx, xmm0
        vmovshdup       xmm0, xmm1
        vmovq   rax, xmm0
        shl     rax, 32
        or      rax, rcx
        ret

example::case_2:
        vmovq   xmm0, rcx
        vmovq   xmm1, rdx
        vpunpcklqdq     xmm0, xmm1, xmm0
        vmovq   xmm1, rsi
        vmovq   xmm2, rdi
        vpunpcklqdq     xmm1, xmm2, xmm1
        vaddps  xmm0, xmm1, xmm0
        vmovq   rax, xmm0
        vpextrq rdx, xmm0, 1
        ret

1.52.0 - Further regression case_2

https://rust.godbolt.org/z/KsvbW5vhW

example::case_1:
        vmovd   xmm0, esi
        shr     rsi, 32
        vmovd   xmm1, edi
        shr     rdi, 32
        vpinsrd xmm1, xmm1, edi, 1
        vmovd   xmm2, esi
        vmovd   xmm3, edx
        shr     rdx, 32
        vpinsrd xmm3, xmm3, edx, 1
        vaddps  xmm1, xmm1, xmm3
        vmovd   xmm3, ecx
        shr     rcx, 32
        vaddss  xmm0, xmm0, xmm3
        vmovd   xmm3, ecx
        vaddss  xmm2, xmm2, xmm3
        vmovd   edx, xmm0
        vmovd   eax, xmm2
        shl     rax, 32
        or      rdx, rax
        vxorps  xmm0, xmm0, xmm0
        vblendps        xmm0, xmm0, xmm1, 1
        vmovq   rcx, xmm0
        vmovshdup       xmm0, xmm1
        vmovq   rax, xmm0
        shl     rax, 32
        or      rax, rcx
        ret

example::case_2:
        mov     rax, rsi
        shld    rax, rdi, 32
        vmovd   xmm0, esi
        shr     rsi, 32
        vmovq   xmm1, rax
        vmovq   xmm2, rsi
        vpunpckldq      xmm1, xmm2, xmm1
        vmovd   xmm2, ecx
        vaddss  xmm0, xmm0, xmm2
        vmovd   xmm2, edx
        shrd    rdx, rcx, 32
        shr     rcx, 32
        vmovq   xmm3, rdx
        vmovq   xmm4, rcx
        vpunpckldq      xmm3, xmm4, xmm3
        vmovd   xmm4, edi
        vaddps  xmm1, xmm1, xmm3
        vaddps  xmm2, xmm2, xmm4
        vextractps      eax, xmm2, 0
        vmovd   ecx, xmm0
        vextractps      edx, xmm1, 0
        vextractps      esi, xmm1, 1
        shl     rsi, 32
        shl     rdx, 32
        or      rdx, rcx
        or      rax, rsi
        ret
@shampoofactory shampoofactory added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels May 13, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label May 13, 2021
@LeSeulArtichaut LeSeulArtichaut added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. A-codegen Area: Code generation and removed regression-untriaged Untriaged performance or correctness regression. labels May 13, 2021
@Mark-Simulacrum Mark-Simulacrum added the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label May 14, 2021
@hameerabbasi
Copy link
Contributor

Assigning a priority according to the WG-prioritization discussion on Zulip and removing I-prioritize.

@rustbot modify labels +P-high -I-prioritize

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 18, 2021
@AngelicosPhosphoros
Copy link
Contributor

I think, this issue caused by using i128 when sending "small" arguments by our ABI.
It is somewhere here https://github.com/rust-lang/rust/tree/6f9df55a782c5373c75dfa23e6ba50f8b42318ef/compiler/rustc_target/src/abi/call
or here

fn classify_arg<'a, Ty, C>(

godbolt link

rustc generates code which send this arrays as i128 values and LLVM fails to replace them by xmm registers.
If we send arguments as references, generated is more or less OK.
godbolt

@AngelicosPhosphoros
Copy link
Contributor

Maybe we should use vector types for this instead of integers for small arrays.
https://llvm.org/docs/LangRef.html#vector-type

@JulianKnodt
Copy link
Contributor

@AngelicosPhosphoros I've been looking at this issue to see if I could fix it, but it seems to me that i128s are never explicitly constructed inside of rust/compiler/rustc_target/src/abi/call/x86_64.rs, so I was wondering where I could change the generated integer to small arrays?

@AngelicosPhosphoros
Copy link
Contributor

AngelicosPhosphoros commented Jul 30, 2021

@JulianKnodt I don't remember where it is exactly but try to start from here:

@MSxDOS
Copy link

MSxDOS commented Oct 29, 2021

Is there any progress with this? There's seemingly no way to fully mitigate this regression, not even by using intrinsics.

@JulianKnodt
Copy link
Contributor

I ended up not having time to work on this, sorry about that.

@Urgau
Copy link
Member

Urgau commented Dec 3, 2021

I've done some analysis in #91447 (comment) about a similar issue and what I've found in this issue is that the return type is not the best and that using an out parameter fix this regression with the return type.

pub fn case_2(a: [f32; 4], b: [f32; 4], out: &mut [f32; 4]) {
    for i in 0..4 {
        out[i] = a[i] + b[i];
    }
}
example::case_1:
        movq    xmm0, rcx
        movq    xmm1, rdx
        punpcklqdq      xmm1, xmm0
        movq    xmm0, rsi
        movq    xmm2, rdi
        punpcklqdq      xmm2, xmm0
        addps   xmm2, xmm1
        movups  xmmword ptr [r8], xmm2
        ret

And using reference for the inputs fix the rest.

pub fn case_2(a: &[f32; 4], b: &[f32; 4], out: &mut [f32; 4]) {
    for i in 0..4 {
        out[i] = a[i] + b[i];
    }
}
example::case_2:
        movups  xmm0, xmmword ptr [rdi]
        movups  xmm1, xmmword ptr [rsi]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdx], xmm1
        ret

@shampoofactory
Copy link
Contributor Author

shampoofactory commented Dec 3, 2021

@Urgau Hi. I did play around with this workaround a while back as a temporary solution. Unfortunately, it didn't work very effectively.

In practice, the regression still came back to haunt me as soon as I called the workaround method. The compiler just optimized it away, as it should.

It also, obviously, doesn't work with standard traits such as Add or AddAssign. This is where I initially encountered the problem.

Example code illustrating:

  • why the workaround isn't always effective, see 'use_ref_add'.
  • regression with 'Add' and 'AddAssign'

Code

/// Ineffective workaround.
pub fn use_ref_add(a: Vec4, b: Vec4) -> Vec4 {
    let mut c = Vec4::default();
    Vec4::ref_add(&a, &b, &mut c);
    c
}

#[derive(Default)]
pub struct Vec4 {
    pub x: f32,
    pub y: f32,
    pub z: f32,
    pub w: f32,
}

impl Vec4 {
    /// Ineffective workaround.
    pub fn ref_add(lhs: &Self, rhs: &Self, out: &mut Self) {
        out.x = lhs.x + rhs.x;
        out.y = lhs.y + rhs.y;
        out.z = lhs.z + rhs.z;
        out.w = lhs.w + rhs.w;
    }
}

impl std::ops::Add for Vec4 {
    type Output = Self;

    fn add(self, other: Self) -> Self {
        Self { x: self.x + other.x, y: self.y + other.y, z: self.z + other.z, w: self.w + other.w }
    }
}


impl std::ops::AddAssign for Vec4 {
    fn add_assign(&mut self, rhs: Self) {
        self.x += rhs.x;
        self.y += rhs.y;
        self.z += rhs.z;
        self.w += rhs.w;
    }
}

Rust 1.47 - Ok

https://rust.godbolt.org/z/WdMYdT1rM

example::use_ref_add:
        mov     rax, rdi
        movups  xmm0, xmmword ptr [rsi]
        movups  xmm1, xmmword ptr [rdx]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdi], xmm1
        ret

example::Vec4::ref_add:
        movups  xmm0, xmmword ptr [rdi]
        movups  xmm1, xmmword ptr [rsi]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdx], xmm1
        ret

<example::Vec4 as core::ops::arith::Add>::add:
        mov     rax, rdi
        movups  xmm0, xmmword ptr [rsi]
        movups  xmm1, xmmword ptr [rdx]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdi], xmm1
        ret

<example::Vec4 as core::ops::arith::AddAssign>::add_assign:
        movups  xmm0, xmmword ptr [rsi]
        movups  xmm1, xmmword ptr [rdi]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdi], xmm1
        ret

Rust 1.57 - Regression

https://rust.godbolt.org/z/3zrcqc5E4

example::use_ref_add:
        movd    xmm0, esi
        mov     rax, rsi
        shld    rax, rdi, 32
        shr     rsi, 32
        movq    xmm1, rax
        movq    xmm2, rsi
        punpckldq       xmm2, xmm1
        movd    xmm1, edi
        movd    xmm3, edx
        addps   xmm3, xmm1
        movd    xmm1, ecx
        addss   xmm1, xmm0
        shrd    rdx, rcx, 32
        shr     rcx, 32
        movq    xmm0, rdx
        movq    xmm4, rcx
        punpckldq       xmm4, xmm0
        addps   xmm4, xmm2
        movd    eax, xmm3
        movd    ecx, xmm1
        movd    edx, xmm4
        shufps  xmm4, xmm4, 85
        movd    esi, xmm4
        shl     rsi, 32
        shl     rdx, 32
        or      rdx, rcx
        or      rax, rsi
        ret

example::Vec4::ref_add:
        movups  xmm0, xmmword ptr [rdi]
        movups  xmm1, xmmword ptr [rsi]
        addps   xmm1, xmm0
        movups  xmmword ptr [rdx], xmm1
        ret

<example::Vec4 as core::ops::arith::Add>::add:
        movd    xmm0, esi
        mov     rax, rsi
        shld    rax, rdi, 32
        shr     rsi, 32
        movq    xmm1, rax
        movq    xmm2, rsi
        punpckldq       xmm2, xmm1
        movd    xmm1, edi
        movd    xmm3, edx
        addps   xmm3, xmm1
        movd    xmm1, ecx
        addss   xmm1, xmm0
        shrd    rdx, rcx, 32
        shr     rcx, 32
        movq    xmm0, rdx
        movq    xmm4, rcx
        punpckldq       xmm4, xmm0
        addps   xmm4, xmm2
        movd    eax, xmm3
        movd    ecx, xmm1
        movd    edx, xmm4
        shufps  xmm4, xmm4, 85
        movd    esi, xmm4
        shl     rsi, 32
        shl     rdx, 32
        or      rdx, rcx
        or      rax, rsi
        ret

<example::Vec4 as core::ops::arith::AddAssign>::add_assign:
        movq    xmm0, rdx
        movq    xmm1, rsi
        punpcklqdq      xmm1, xmm0
        movups  xmm0, xmmword ptr [rdi]
        addps   xmm0, xmm1
        movups  xmmword ptr [rdi], xmm0
        ret

@shampoofactory
Copy link
Contributor Author

shampoofactory commented Dec 7, 2021

Ok. Despite not knowing much about rust internals, I spent the weekend bisecting and trying to figure things out.

The offending commits are listed below. My initial impression is that 62fe055 is primarily at fault.

Pull #76986: 'Return values up to 128 bits in registers' and pull #79547: 'Pass arguments up to 2*usize by value', close issue #26494: 'Rust should use registers more aggressively'. After this our [f32; 4] auto-vectorizations begin to fail. Note that [f32; 8] auto-vectorization is still ok.

The core modification, minus the new debug checks is here:

                // Pass and return structures up to 2 pointers in size by value, matching `ScalarPair`.
                // LLVM will usually pass these in 2 registers, which is more efficient than by-ref.
                let max_by_val_size = Pointer.size(self) * 2;
                let size = arg.layout.size;

                if arg.layout.is_unsized() || size > max_by_val_size {
                    arg.make_indirect();
                } else {
                    // We want to pass small aggregates as immediates, but using
                    // a LLVM aggregate type for this leads to bad optimizations,
                    // so we pick an appropriately sized integer type instead.
                    arg.cast_to(Reg { kind: RegKind::Integer, size });
                }

If we return 'let max_by_val_size = Pointer.size(self) * 2' to 'let max_by_val_size = Pointer.size(self)', remove the new debug checks and rebuild rustc; our vectorization appears to return to normal as per 1.47.

I'm sure someone will know what's going on. My guess is that LLVM doesn't vectorize 128-bit register types well, we are using the wrong type, or we are obscuring crucial context/ type information.

No idea if this comment is relevant, or if I'm barking up the wrong tree, specifically the section leading up to:
'SIMD values are unconditionally passed through memory between all functions'.

The commits in question are intended to improve performance, but they break basic auto-vectorization which is a big loss. Could we consider reverting them until this issue is ironed out?

I'll shortly leave a comment in issue #26494 to get some feedback.

Regression bisects

Regression in nightly-2020-09-28
2020-09-27UTC: Auto merge of #76986 - jonas-schievink:ret-in-reg, r=nagisa
62fe055 : Return values up to 128 bits in registers

Regression in nightly-2020-12-03
2020-12-02UTC: Auto merge of #79547 - erikdesjardins:byval, r=nagisa
a094ff9 : Pass arguments up to 2*usize by value

Regression in nightly-2021-03-05
2021-03-04UTC: Auto merge of #81451 - nikic:llvm-12, r=nagisa
4099208 : Upgrade to LLVM 12 #81451

@Urgau
Copy link
Member

Urgau commented Dec 7, 2021

@shampoofactory Happy to find out that we can to the same conclusion. I have already experiment this weekend with the code on the Rust-CI #91507 and results are not good for compile time.

I also find out that:

One thing to note here is that this affect the layout of parameters of function but if the function is inlined there is no more parameters and the cost of wrapping (before function call) -> dewrapping (start of function) and wrapping (end of function) -> dewrapping (after function call) in an integer completely disappear.

PS: the problem is not LLVM or let max_by_val_size = Pointer.size(self) * 2; but arg.cast_to(Reg { kind: RegKind::Integer, size }); which cast the aggregate type to an integer regardless of what the actual layout contains. Reducing max_by_val_size just means that the first branch is taken not the second one.

@dotdash
Copy link
Contributor

dotdash commented Dec 9, 2021

Strictly speaking, even 1.47 wasn't really ideal either. I had a few ABI related things in my backlog when I was still working on rustc, but unfortunately didn't get to actually implement them. I might try to tackle them during the holidays, but can't promise anything. If someone wants to work on this, the basic problem here is that we're not using the proper types on the LLVM IR level. The following is all from an x86_64 point of view.

In 1.47, the struct is passed by reference, which allows LLVM to use SIMD instructions for the addition, but forced the data to go through memory instead of registers. The function signature is:

define void @"_ZN61_$LT$example..Vec4$u20$as$u20$core..ops..arith..AddAssign$GT$10add_assign17ha933a0951500a174E"(%Vec4* nocapture align 4 dereferenceable(16) %self, %Vec4* noalias nocapture readonly dereferenceable(16) %rhs) unnamed_addr #1 !dbg !33 {

In 1.57 the whole thing is hidden in an i128 which means that while the values are passed in registers, they're passed in the integer registers, not in SSE registers. The function signature is:

define void @"_ZN61_$LT$example..Vec4$u20$as$u20$core..ops..arith..AddAssign$GT$10add_assign17h9886770253471322E"(%Vec4* noalias nocapture align 4 dereferenceable(16) %self, i128 %0) unnamed_addr #1 !dbg !27 {

What should be used as the argument type at the LLVM IR level is a struct of two vectors of two floats, as the C equivalent uses when compiling with clang:

  define dso_local { <2 x float>, <2 x float> } @case_2(<2 x float> %0, <2 x float> %1, <2 x float> %2, <2 x float> %3) local_unnamed_addr #0 {

This allows the arguments to be passed in SSE registers and results in this assembler code:

	addps	%xmm2, %xmm0
	addps	%xmm3, %xmm1
	retq

Note that functions using the extern "C" SysV ABI on x86_64 currently also produce similarly bad code for this, because it's using { double, double }. That's using the right registers, but the steps to convert from two floats to one double still trip up LLVM.

@shampoofactory
Copy link
Contributor Author

I've gathered some simple auto-vectorization cases here, together with compilation notes (mainly for myself). I've noted those cases that have regressed post 1.47 and those that are not efficient in 1.47 to begin with.

I've also reverted the #26494 issue's breaking changes here.

@shampoofactory
Copy link
Contributor Author

@rustbot label -E-needs-bisection

@rustbot rustbot removed the E-needs-bisection Call for participation: This issue needs bisection: https://github.com/rust-lang/cargo-bisect-rustc label Dec 9, 2021
@shampoofactory
Copy link
Contributor Author

Performance results for pull #91719: '[Experiment] revert #26494':

Finished benchmarking commit (9b34e70): comparison url.

Summary: This change led to very large relevant mixed results shrug in compiler performance.

* Very large improvement in instruction counts (up to -6.9% on `full` builds of `deeply-nested-async`)

* Moderate regression in instruction counts (up to 3.0% on `full` builds of `piston-image`)

@dotdash
Copy link
Contributor

dotdash commented Dec 10, 2021

Performance results for pull #91719: '[Experiment] revert #26494':

Finished benchmarking commit (9b34e70): comparison url.
Summary: This change led to very large relevant mixed results shrug in compiler performance.

IIRC the benchmark suite only tests the compiler's performance when checking/compiling code. The performance of the generated code is only indirectly measured by that (as the compiler compiles itself), but since these ABI change can enable (or block) a number of optimizations, that might affect the performance of the LLVM part of those build, so I'd take those results with a grain of salt.

@workingjubilee
Copy link
Member

Yes, there is no perf test for actual Rust executables aside from the compiler.

@jackh726 jackh726 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 28, 2022
@workingjubilee
Copy link
Member

Fixed and with a codegen regression test in #94570.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet