-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLVM patch D14260 to improve SIMD code #14976
Add LLVM patch D14260 to improve SIMD code #14976
Conversation
@@ -658,6 +658,12 @@ $$(LLVM_SRC_DIR)/$1.patch-applied: $(LLVM_SRC_DIR)/configure | $$(SRCDIR)/$1.pat | |||
echo 1 > $$@ | |||
LLVM_PATCH_LIST += $$(LLVM_SRC_DIR)/$1.patch-applied | |||
endef | |||
define LLVM_PATCH0 | |||
$$(LLVM_SRC_DIR)/$1.patch-applied: $(LLVM_SRC_DIR)/configure | $$(SRCDIR)/$1.patch | |||
cd $$(LLVM_SRC_DIR) && patch -p0 < $$(SRCDIR)/$1.patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just adjust the patch content so it can be applied consistently with all the others instead of duplicating application code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the original LLVM patch without modification.
I can, of course, modify it -- but then it might be easier to just append it to one of the existing patches instead of introducing a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying the existing patches would lead to the .patch-applied
files indicating an incorrect state in everyone's existing build, so the new modifications would not get applied except in from-scratch builds of llvm.
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
23e0855
to
49d1019
Compare
Addressed @tkelman's comments, rebased. |
Add LLVM patch D14260 to improve SIMD code
Arch Robison proposed the patch http://reviews.llvm.org/D14260 "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via
llvm_call
in Julia. For an example package, see e.g. https://github.com/eschnett/SIMD.jl.Some discussion relevant to this PR are in #5355. @ArchRobison, please comment.
Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here eschnett/SIMD.jl#1 (comment).
Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):
Note that the SIMD vector is kept in register
%ymm1
, but is unnecessarily scalarized into registers%xmm{0,1,2,3}
at the end of the kernel, and re-assembled in the beginning.With this patch, the loop kernel looks like:
which is perfect.