Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLVM patch D14260 to improve SIMD code #14976

Merged

Conversation

eschnett
Copy link
Contributor

@eschnett eschnett commented Feb 7, 2016

Arch Robison proposed the patch http://reviews.llvm.org/D14260 "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via llvm_call in Julia. For an example package, see e.g. https://github.com/eschnett/SIMD.jl.

Some discussion relevant to this PR are in #5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here eschnett/SIMD.jl#1 (comment).

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0

Note that the SIMD vector is kept in register %ymm1, but is unnecessarily scalarized into registers %xmm{0,1,2,3} at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

L192:
    vaddpd  (%rdx), %ymm1, %ymm1
Source line: 62
    addq    %rsi, %rdx
    addq    %rcx, %rdi
    jne L192

which is perfect.

@@ -658,6 +658,12 @@ $$(LLVM_SRC_DIR)/$1.patch-applied: $(LLVM_SRC_DIR)/configure | $$(SRCDIR)/$1.pat
echo 1 > $$@
LLVM_PATCH_LIST += $$(LLVM_SRC_DIR)/$1.patch-applied
endef
define LLVM_PATCH0
$$(LLVM_SRC_DIR)/$1.patch-applied: $(LLVM_SRC_DIR)/configure | $$(SRCDIR)/$1.patch
cd $$(LLVM_SRC_DIR) && patch -p0 < $$(SRCDIR)/$1.patch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just adjust the patch content so it can be applied consistently with all the others instead of duplicating application code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the original LLVM patch without modification.

I can, of course, modify it -- but then it might be easier to just append it to one of the existing patches instead of introducing a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the existing patches would lead to the .patch-applied files indicating an incorrect state in everyone's existing build, so the new modifications would not get applied except in from-scratch builds of llvm.

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
vinsertf128 $1, %xmm3, %ymm1, %ymm1
vmovupd 8(%rcx), %xmm2
vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
vaddpd %ymm2, %ymm1, %ymm1
vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
vextractf128 $1, %ymm1, %xmm3
vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
vaddsd (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
addq	%rsi, %rdx
addq	%rcx, %rdi
jne	L192
```

which is perfect.
@eschnett eschnett force-pushed the eschnett/simd-llvm-optimizations branch from 23e0855 to 49d1019 Compare February 8, 2016 03:21
@eschnett
Copy link
Contributor Author

eschnett commented Feb 8, 2016

Addressed @tkelman's comments, rebased.

tkelman added a commit that referenced this pull request Feb 8, 2016
@tkelman tkelman merged commit f459262 into JuliaLang:master Feb 8, 2016
@eschnett eschnett deleted the eschnett/simd-llvm-optimizations branch February 8, 2016 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants