Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

Open
ejmahler opened this issue Apr 11, 2020 · 2 comments
Open

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

ejmahler opened this issue Apr 11, 2020 · 2 comments
Labels
A-codegen Area: Code generation A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ejmahler
Copy link

I tried this code (example 1), in which we have a public function mutate_array that internally calls mutate_chunk:

use std::arch::x86_64::*;

#[inline(always)]
pub unsafe fn mutate_chunk(rows: [__m256d; 4]) -> [__m256d; 4] {
    [
        _mm256_permute2f128_pd(rows[0], rows[1], 0x20),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x20),
        _mm256_permute2f128_pd(rows[0], rows[1], 0x31),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x31),
    ]
}


#[target_feature(enable = "avx")]
pub unsafe fn mutate_array(input: *const f64, output: *mut f64) {
    let mut input_data = [_mm256_setzero_pd(); 4];

    for i in 0..4 {
        input_data[i] = _mm256_loadu_pd(input.add(4*i));
    }

    let output_data = mutate_chunk(input_data);

    for i in 0..4 {
        _mm256_storeu_pd(output.add(4*i), output_data[i]);
    }
}

This is a very stripped-down example of code that appears all over my project. We load data into AVX registers, do some sort of operation on the loaded data, then store it back to memory. The (more or less) optimal assembly for this example code is:

example::mutate_array:
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vperm2f128      ymm5, ymm2, ymm3, 49
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 32
        vmovups ymmword ptr [rsi], ymm4
        vmovups ymmword ptr [rsi + 32], ymm1
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovups ymmword ptr [rsi + 96], ymm5
        vzeroupper
        ret

4 loads, 4 permutes, 4 stores.

As you can see from the godbolt link, the actual generated assembly is quite a bit longer:

example::mutate_array:
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 288
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vmovaps ymmword ptr [rsp + 96], ymm3
        vmovaps ymmword ptr [rsp + 64], ymm2
        vmovaps ymmword ptr [rsp + 32], ymm1
        vmovaps ymmword ptr [rsp], ymm0
        vmovaps ymm0, ymmword ptr [rsp]
        vmovaps ymm1, ymmword ptr [rsp + 32]
        vmovaps ymm2, ymmword ptr [rsp + 64]
        vmovaps ymm3, ymmword ptr [rsp + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vmovaps ymmword ptr [rsp + 128], ymm4
        vperm2f128      ymm4, ymm2, ymm3, 32
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 49
        vmovaps ymmword ptr [rsp + 160], ymm4
        vmovaps ymmword ptr [rsp + 192], ymm0
        vmovaps ymmword ptr [rsp + 224], ymm1
        vmovaps ymm0, ymmword ptr [rsp + 224]
        vmovups ymmword ptr [rsi + 96], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 192]
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 160]
        vmovups ymmword ptr [rsi + 32], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 128]
        vmovups ymmword ptr [rsi], ymm0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret

The second assembly block is the same as the first, except for the addition of reads/writes to the rsp (ie the stack). It loads the 4 values from memory fine -- but before running the permutes, it stores the values to rsp, then immediately reads them back. Same thing after the permutes: Before writing the data to the output, it stores it to rsp, then immediately reads it back.

It's possible to nudge the compiler into generating the correct output by partially unrolling the input and output loops.

By changing the input loop

for i in 0..4 {
    input_data[i] = _mm256_loadu_pd(input.add(4*i));
}

to

for i in 0..2 {
    input_data[i*2] =   _mm256_loadu_pd(input.add(8*i));
    input_data[i*2+1] = _mm256_loadu_pd(input.add(8*i + 4));
}

we can see that the loop is functionally identical, but the compiler no longer writes the inputs to the stack (example 2).

We can apply the same treatment to the output loop, completely eliminating the stack reads and writes: example 3.

Without knowing anything about the internals of the compiler, I can imagine two possibilities here:

  1. Example 1 demonstrates trivial missed optimization: the compiler is unrolling the loop, but fails to determine that it can eliminate the array. As a result, it more than doubles the instruction count of the function, tanking performance.
  2. Alternatively, something in the Rust standard requires all arrays to have an in-memory representation, and they aren't allowed to be completely optimized away to register storage, even entirely within a function. If this is the case, then examples 2 and 3 demonstrate a code generation bug, because we can clearly see that the storage to the array was completely optimized away.

Meta

rustc --version --verbose:

rustc 1.44.0-nightly (42abbd887 2020-04-07)
binary: rustc
commit-hash: 42abbd8878d3b67238f3611b0587c704ba94f39c
commit-date: 2020-04-07
host: x86_64-pc-windows-msvc
release: 1.44.0-nightly
LLVM version: 9.0
@ejmahler ejmahler added the C-bug Category: This is a bug. label Apr 11, 2020
@jonas-schievink jonas-schievink added A-codegen Area: Code generation A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed C-bug Category: This is a bug. labels Apr 11, 2020
@workingjubilee
Copy link
Member

workingjubilee commented Oct 5, 2021

This is not a failure of the "Rust standard" per se. Rather, rustc emitted miscompilations when we aggressively inlined this sort of thing so the compiler was adjusted to take an overly cautious escape hatch: we normally pass all vector registers via memory. This was done on the assumption LLVM should be able to "see through" it and inline these kinds of cases, so it is not entirely clear why this is happening nonetheless.

However, I do recommend not using raw pointers here if at all possible: Rust has much weaker non-aliasing assumptions about them, and optimizations can hinge on exploiting non-aliasing in principle. It is plausible using input: *const f64 and output: *mut f64 is pessimizing this function somewhat compared to e.g. using slice::get_unchecked and slice::iter_mut, or even just a function that demands stronger guarantees about its writes, e.g. ptr::copy_nonoverlapping.

@workingjubilee workingjubilee added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Oct 5, 2021
@ejmahler
Copy link
Author

ejmahler commented Oct 6, 2021

Unfortunately, slice operations aren't an option here because the whole reason the code exists is to take advantage of AVX. The SIMD operations in this example are trivial in order to create a minimal repro, but the real-world code is very nontrivial and depends heavily on this register-based access. I do use wrapper types around raw pointers that provide a little more runtime safety, but again - not important for a minimal repro.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants