Sub-optimal codegen: Unnecessarily dumping AVX registers to stack #71025
Labels
A-codegen
Area: Code generation
A-SIMD
Area: SIMD (Single Instruction Multiple Data)
C-enhancement
Category: An issue proposing an enhancement or a PR with one.
C-optimization
Category: An issue highlighting optimization opportunities or PRs implementing such
I-slow
Issue: Problems and improvements with respect to performance of generated code.
T-compiler
Relevant to the compiler team, which will review and decide on the PR/issue.
I tried this code (example 1), in which we have a public function
mutate_array
that internally callsmutate_chunk
:This is a very stripped-down example of code that appears all over my project. We load data into AVX registers, do some sort of operation on the loaded data, then store it back to memory. The (more or less) optimal assembly for this example code is:
4 loads, 4 permutes, 4 stores.
As you can see from the godbolt link, the actual generated assembly is quite a bit longer:
The second assembly block is the same as the first, except for the addition of reads/writes to the
rsp
(ie the stack). It loads the 4 values from memory fine -- but before running the permutes, it stores the values torsp
, then immediately reads them back. Same thing after the permutes: Before writing the data to the output, it stores it torsp
, then immediately reads it back.It's possible to nudge the compiler into generating the correct output by partially unrolling the input and output loops.
By changing the input loop
to
we can see that the loop is functionally identical, but the compiler no longer writes the inputs to the stack (example 2).
We can apply the same treatment to the output loop, completely eliminating the stack reads and writes: example 3.
Without knowing anything about the internals of the compiler, I can imagine two possibilities here:
Meta
rustc --version --verbose
:The text was updated successfully, but these errors were encountered: