Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

ejmahler · 2020-04-11T13:51:49Z

I tried this code (example 1), in which we have a public function mutate_array that internally calls mutate_chunk:

use std::arch::x86_64::*;

#[inline(always)]
pub unsafe fn mutate_chunk(rows: [__m256d; 4]) -> [__m256d; 4] {
    [
        _mm256_permute2f128_pd(rows[0], rows[1], 0x20),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x20),
        _mm256_permute2f128_pd(rows[0], rows[1], 0x31),
        _mm256_permute2f128_pd(rows[2], rows[3], 0x31),
    ]
}


#[target_feature(enable = "avx")]
pub unsafe fn mutate_array(input: *const f64, output: *mut f64) {
    let mut input_data = [_mm256_setzero_pd(); 4];

    for i in 0..4 {
        input_data[i] = _mm256_loadu_pd(input.add(4*i));
    }

    let output_data = mutate_chunk(input_data);

    for i in 0..4 {
        _mm256_storeu_pd(output.add(4*i), output_data[i]);
    }
}

This is a very stripped-down example of code that appears all over my project. We load data into AVX registers, do some sort of operation on the loaded data, then store it back to memory. The (more or less) optimal assembly for this example code is:

example::mutate_array:
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vperm2f128      ymm5, ymm2, ymm3, 49
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 32
        vmovups ymmword ptr [rsi], ymm4
        vmovups ymmword ptr [rsi + 32], ymm1
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovups ymmword ptr [rsi + 96], ymm5
        vzeroupper
        ret

4 loads, 4 permutes, 4 stores.

As you can see from the godbolt link, the actual generated assembly is quite a bit longer:

example::mutate_array:
        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 288
        vmovups ymm0, ymmword ptr [rdi]
        vmovups ymm1, ymmword ptr [rdi + 32]
        vmovups ymm2, ymmword ptr [rdi + 64]
        vmovups ymm3, ymmword ptr [rdi + 96]
        vmovaps ymmword ptr [rsp + 96], ymm3
        vmovaps ymmword ptr [rsp + 64], ymm2
        vmovaps ymmword ptr [rsp + 32], ymm1
        vmovaps ymmword ptr [rsp], ymm0
        vmovaps ymm0, ymmword ptr [rsp]
        vmovaps ymm1, ymmword ptr [rsp + 32]
        vmovaps ymm2, ymmword ptr [rsp + 64]
        vmovaps ymm3, ymmword ptr [rsp + 96]
        vperm2f128      ymm4, ymm0, ymm1, 32
        vmovaps ymmword ptr [rsp + 128], ymm4
        vperm2f128      ymm4, ymm2, ymm3, 32
        vperm2f128      ymm0, ymm0, ymm1, 49
        vperm2f128      ymm1, ymm2, ymm3, 49
        vmovaps ymmword ptr [rsp + 160], ymm4
        vmovaps ymmword ptr [rsp + 192], ymm0
        vmovaps ymmword ptr [rsp + 224], ymm1
        vmovaps ymm0, ymmword ptr [rsp + 224]
        vmovups ymmword ptr [rsi + 96], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 192]
        vmovups ymmword ptr [rsi + 64], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 160]
        vmovups ymmword ptr [rsi + 32], ymm0
        vmovaps ymm0, ymmword ptr [rsp + 128]
        vmovups ymmword ptr [rsi], ymm0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret

The second assembly block is the same as the first, except for the addition of reads/writes to the rsp (ie the stack). It loads the 4 values from memory fine -- but before running the permutes, it stores the values to rsp, then immediately reads them back. Same thing after the permutes: Before writing the data to the output, it stores it to rsp, then immediately reads it back.

It's possible to nudge the compiler into generating the correct output by partially unrolling the input and output loops.

By changing the input loop

for i in 0..4 {
    input_data[i] = _mm256_loadu_pd(input.add(4*i));
}

to

for i in 0..2 {
    input_data[i*2] =   _mm256_loadu_pd(input.add(8*i));
    input_data[i*2+1] = _mm256_loadu_pd(input.add(8*i + 4));
}

we can see that the loop is functionally identical, but the compiler no longer writes the inputs to the stack (example 2).

We can apply the same treatment to the output loop, completely eliminating the stack reads and writes: example 3.

Without knowing anything about the internals of the compiler, I can imagine two possibilities here:

Example 1 demonstrates trivial missed optimization: the compiler is unrolling the loop, but fails to determine that it can eliminate the array. As a result, it more than doubles the instruction count of the function, tanking performance.
Alternatively, something in the Rust standard requires all arrays to have an in-memory representation, and they aren't allowed to be completely optimized away to register storage, even entirely within a function. If this is the case, then examples 2 and 3 demonstrate a code generation bug, because we can clearly see that the storage to the array was completely optimized away.

Meta

rustc --version --verbose:

rustc 1.44.0-nightly (42abbd887 2020-04-07)
binary: rustc
commit-hash: 42abbd8878d3b67238f3611b0587c704ba94f39c
commit-date: 2020-04-07
host: x86_64-pc-windows-msvc
release: 1.44.0-nightly
LLVM version: 9.0

The text was updated successfully, but these errors were encountered:

workingjubilee · 2021-10-05T21:55:10Z

This is not a failure of the "Rust standard" per se. Rather, rustc emitted miscompilations when we aggressively inlined this sort of thing so the compiler was adjusted to take an overly cautious escape hatch: we normally pass all vector registers via memory. This was done on the assumption LLVM should be able to "see through" it and inline these kinds of cases, so it is not entirely clear why this is happening nonetheless.

However, I do recommend not using raw pointers here if at all possible: Rust has much weaker non-aliasing assumptions about them, and optimizations can hinge on exploiting non-aliasing in principle. It is plausible using input: *const f64 and output: *mut f64 is pessimizing this function somewhat compared to e.g. using slice::get_unchecked and slice::iter_mut, or even just a function that demands stronger guarantees about its writes, e.g. ptr::copy_nonoverlapping.

ejmahler · 2021-10-06T08:32:39Z

Unfortunately, slice operations aren't an option here because the whole reason the code exists is to take advantage of AVX. The SIMD operations in this example are trivial in order to create a minimal repro, but the real-world code is very nontrivial and depends heavily on this register-based access. I do use wrapper types around raw pointers that provide a little more runtime safety, but again - not important for a minimal repro.

ejmahler added the C-bug Category: This is a bug. label Apr 11, 2020

workingjubilee mentioned this issue Oct 5, 2021

Examples of bad Rust SIMD perf? rust-lang/portable-simd#135

Open

workingjubilee added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Oct 5, 2021

workingjubilee mentioned this issue Mar 25, 2022

Stacked Borrows: raw pointer usable only for T too strict? rust-lang/unsafe-code-guidelines#134

Open

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

ejmahler commented Apr 11, 2020

workingjubilee commented Oct 5, 2021 •

edited

Loading

ejmahler commented Oct 6, 2021

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025

Comments

ejmahler commented Apr 11, 2020

Meta

workingjubilee commented Oct 5, 2021 • edited Loading

ejmahler commented Oct 6, 2021

workingjubilee commented Oct 5, 2021 •

edited

Loading