Add LLVM patch D14260 to improve SIMD code #14976

eschnett · 2016-02-07T15:29:10Z

Arch Robison proposed the patch http://reviews.llvm.org/D14260 "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via llvm_call in Julia. For an example package, see e.g. https://github.com/eschnett/SIMD.jl.

Some discussion relevant to this PR are in #5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here eschnett/SIMD.jl#1 (comment).

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0

Note that the SIMD vector is kept in register %ymm1, but is unnecessarily scalarized into registers %xmm{0,1,2,3} at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

L192:
    vaddpd  (%rdx), %ymm1, %ymm1
Source line: 62
    addq    %rsi, %rdx
    addq    %rcx, %rdi
    jne L192

which is perfect.

tkelman · 2016-02-07T15:32:22Z

deps/Makefile

@@ -658,6 +658,12 @@ $$(LLVM_SRC_DIR)/$1.patch-applied: $(LLVM_SRC_DIR)/configure | $$(SRCDIR)/$1.pat
 	echo 1 > $$@
 LLVM_PATCH_LIST += $$(LLVM_SRC_DIR)/$1.patch-applied
 endef
+define LLVM_PATCH0
+$$(LLVM_SRC_DIR)/$1.patch-applied: $(LLVM_SRC_DIR)/configure | $$(SRCDIR)/$1.patch
+	cd $$(LLVM_SRC_DIR) && patch -p0 < $$(SRCDIR)/$1.patch


why not just adjust the patch content so it can be applied consistently with all the others instead of duplicating application code

I wanted to keep the original LLVM patch without modification.

I can, of course, modify it -- but then it might be easier to just append it to one of the existing patches instead of introducing a new one.

Modifying the existing patches would lead to the .patch-applied files indicating an incorrect state in everyone's existing build, so the new modifications would not get applied except in from-scratch builds of llvm.

@ArchRobison

Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.

eschnett · 2016-02-08T03:22:33Z

Addressed @tkelman's comments, rebased.

Add LLVM patch D14260 to improve SIMD code

tkelman reviewed Feb 7, 2016
View reviewed changes

eschnett force-pushed the eschnett/simd-llvm-optimizations branch from 23e0855 to 49d1019 Compare February 8, 2016 03:21

tkelman added a commit that referenced this pull request Feb 8, 2016

Merge pull request #14976 from eschnett/eschnett/simd-llvm-optimizations

f459262

Add LLVM patch D14260 to improve SIMD code

tkelman merged commit f459262 into JuliaLang:master Feb 8, 2016

eschnett deleted the eschnett/simd-llvm-optimizations branch February 8, 2016 14:42

KristofferC mentioned this pull request Feb 9, 2016

Giant rewrite for Julia v0.5 JuliaDiff/ForwardDiff.jl#102

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLVM patch D14260 to improve SIMD code #14976

Add LLVM patch D14260 to improve SIMD code #14976

eschnett commented Feb 7, 2016

tkelman Feb 7, 2016

eschnett Feb 7, 2016

tkelman Feb 7, 2016

eschnett commented Feb 8, 2016

Add LLVM patch D14260 to improve SIMD code #14976

Add LLVM patch D14260 to improve SIMD code #14976

Conversation

eschnett commented Feb 7, 2016

tkelman Feb 7, 2016

Choose a reason for hiding this comment

eschnett Feb 7, 2016

Choose a reason for hiding this comment

tkelman Feb 7, 2016

Choose a reason for hiding this comment

eschnett commented Feb 8, 2016