-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow mod
and clamp
to accept Vec
#28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion error is on the equality check. I'll fix it.
Codecov Report
@@ Coverage Diff @@
## master #28 +/- ##
==========================================
- Coverage 55.25% 54.43% -0.82%
==========================================
Files 28 28
Lines 2662 2669 +7
==========================================
- Hits 1471 1453 -18
- Misses 1191 1216 +25
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Ready to merge?
Mind bumping the minor version so I can go ahead and make another release?
I've upgraded most of the tests to julia> check_within_limits(tovector(@inferred(f(m1, vi2))), f.(xi3, xi2))
ERROR: promotion of types VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}} and VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}} failed to change any arguments
Stacktrace:
[1] error(::String, ::String, ::String)
@ Base ./error.jl:42
[2] sametype_error(input::Tuple{VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}})
@ Base ./promotion.jl:330
[3] not_sametype(x::Tuple{VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}}, y::Tuple{VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}})
@ Base ./promotion.jl:324
[4] promote
@ ./promotion.jl:307 [inlined]
[5] fld(x::VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, y::VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}})
@ VectorizationBase ~/.julia/dev/VectorizationBase/src/special/misc.jl:5
[6] top-level scope
@ REPL[75]:1
julia> promote_type(typeof(m1), typeof(vi2))
VectorizationBase.VecUnroll{3, 4, Int64, V} where V<:VectorizationBase.AbstractSIMDVector{4, Int64}
julia> promote(m1, vi2)
ERROR: promotion of types VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}} and VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}} failed to change any arguments |
Thanks, fixed that on master. I made the promotion of |
Great, thanks. I still get a few errors from
|
Fixed and tested on master. Note that |
Although, even on AVX512 |
Perhaps we should replace the implementation with Integer -> Float -> Integer for the sake of SIMD? julia> using VectorizationBase, BenchmarkTools
julia> vi1 = VectorizationBase.VecUnroll((
Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...)
));
julia> vi2 = VectorizationBase.VecUnroll((
Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...)
));
julia> mydiv(a,b) = trunc(Int, float(a) / float(b))
mydiv (generic function with 1 method)
julia> mydiv(vi1,vi2)
3 x Vec{8, Int64}
Vec{8,Int64}<0, 0, -1, 0, -5, 0, -15, 0>
Vec{8,Int64}<0, -3, 1, -1, 0, 0, 2, 0>
Vec{8,Int64}<1, 8, -4, 6, 1, 0, 0, 0>
julia> vi1 ÷ vi2
3 x Vec{8, Int64}
Vec{8,Int64}<0, 0, -1, 0, -5, 0, -15, 0>
Vec{8,Int64}<0, -3, 1, -1, 0, 0, 2, 0>
Vec{8,Int64}<1, 8, -4, 6, 1, 0, 0, 0>
julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 13.583 ns (0.00% GC)
median time: 14.864 ns (0.00% GC)
mean time: 14.635 ns (0.00% GC)
maximum time: 27.249 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 185.452 ns (0.00% GC)
median time: 185.821 ns (0.00% GC)
mean time: 186.234 ns (0.00% GC)
maximum time: 216.690 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 671 EDIT: julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.359 ns (0.00% GC)
median time: 11.398 ns (0.00% GC)
mean time: 11.563 ns (0.00% GC)
maximum time: 21.709 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 154.994 ns (0.00% GC)
median time: 155.194 ns (0.00% GC)
mean time: 155.353 ns (0.00% GC)
maximum time: 172.370 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 803 Assembly julia> @code_native debuginfo=:none syntax=:intel mydiv(vi1, vi2)
.text
vcvtqq2pd zmm0, zmmword ptr [rsi]
vcvtqq2pd zmm1, zmmword ptr [rsi + 64]
vcvtqq2pd zmm2, zmmword ptr [rsi + 128]
vcvtqq2pd zmm3, zmmword ptr [rdx]
vdivpd zmm0, zmm0, zmm3
vcvtqq2pd zmm3, zmmword ptr [rdx + 64]
mov rax, rdi
vdivpd zmm1, zmm1, zmm3
vcvtqq2pd zmm3, zmmword ptr [rdx + 128]
vdivpd zmm2, zmm2, zmm3
vcvttpd2qq zmm0, zmm0
vcvttpd2qq zmm1, zmm1
vcvttpd2qq zmm2, zmm2
vmovaps zmmword ptr [rdi], zmm0
vmovaps zmmword ptr [rdi + 64], zmm1
vmovaps zmmword ptr [rdi + 128], zmm2
vzeroupper
ret
nop word ptr [rax + rax] Note that some of these instructions (the final vmovaps) are just repacking the |
Note that |
Computer with AVX2 but no AVX512: julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 66.249 ns (0.00% GC)
median time: 66.337 ns (0.00% GC)
mean time: 66.851 ns (0.00% GC)
maximum time: 109.525 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 978
julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 193.593 ns (0.00% GC)
median time: 194.165 ns (0.00% GC)
mean time: 196.425 ns (0.00% GC)
maximum time: 684.750 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 631 The new assembly for .text
mov rax, rdi
vmovapd xmm0, xmmword ptr [rsi]
vmovdqa xmm1, xmmword ptr [rsi + 16]
vpextrq rcx, xmm1, 1
vcvtsi2sd xmm2, xmm2, rcx
vmovdqa xmm3, xmmword ptr [rsi + 32]
vmovq rcx, xmm1
vcvtsi2sd xmm1, xmm4, rcx
vpextrq rcx, xmm0, 1
vcvtsi2sd xmm4, xmm4, rcx
vmovdqa xmm6, xmmword ptr [rsi + 48]
vmovq rcx, xmm0
vcvtsi2sd xmm0, xmm5, rcx
vpextrq rcx, xmm3, 1
vcvtsi2sd xmm7, xmm5, rcx
vunpcklpd xmm1, xmm1, xmm2 # xmm1 = xmm1[0],xmm2[0]
vmovq rcx, xmm3
vcvtsi2sd xmm2, xmm5, rcx
vpextrq rcx, xmm6, 1
vcvtsi2sd xmm3, xmm5, rcx
vunpcklpd xmm5, xmm0, xmm4 # xmm5 = xmm0[0],xmm4[0]
vmovq rcx, xmm6
vcvtsi2sd xmm0, xmm8, rcx
vunpcklpd xmm11, xmm2, xmm7 # xmm11 = xmm2[0],xmm7[0]
vmovdqa xmm4, xmmword ptr [rsi + 64]
vpextrq rcx, xmm4, 1
vunpcklpd xmm10, xmm0, xmm3 # xmm10 = xmm0[0],xmm3[0]
vcvtsi2sd xmm0, xmm8, rcx
vmovq rcx, xmm4
vcvtsi2sd xmm3, xmm8, rcx
vunpcklpd xmm8, xmm3, xmm0 # xmm8 = xmm3[0],xmm0[0]
vmovdqa xmm3, xmmword ptr [rsi + 80]
vpextrq rcx, xmm3, 1
vcvtsi2sd xmm4, xmm9, rcx
vmovq rcx, xmm3
vcvtsi2sd xmm3, xmm9, rcx
vunpcklpd xmm9, xmm3, xmm4 # xmm9 = xmm3[0],xmm4[0]
vmovdqa xmm3, xmmword ptr [rdx]
vmovdqa xmm4, xmmword ptr [rdx + 16]
vmovapd xmm0, xmmword ptr [rdx + 32]
vmovdqa xmm6, xmmword ptr [rdx + 48]
vpextrq rcx, xmm4, 1
vcvtsi2sd xmm2, xmm12, rcx
vmovq rcx, xmm4
vcvtsi2sd xmm4, xmm12, rcx
vunpcklpd xmm2, xmm4, xmm2 # xmm2 = xmm4[0],xmm2[0]
vpextrq rcx, xmm3, 1
vcvtsi2sd xmm7, xmm12, rcx
vdivpd xmm4, xmm1, xmm2
vmovq rcx, xmm3
vcvtsi2sd xmm1, xmm12, rcx
vpextrq rcx, xmm0, 1
vcvtsi2sd xmm2, xmm12, rcx
vunpcklpd xmm1, xmm1, xmm7 # xmm1 = xmm1[0],xmm7[0]
vmovq rcx, xmm0
vcvtsi2sd xmm0, xmm12, rcx
vdivpd xmm1, xmm5, xmm1
vunpcklpd xmm0, xmm0, xmm2 # xmm0 = xmm0[0],xmm2[0]
vpextrq rcx, xmm6, 1
vdivpd xmm3, xmm11, xmm0
vcvtsi2sd xmm0, xmm12, rcx
vmovq rcx, xmm6
vcvtsi2sd xmm2, xmm12, rcx
vunpcklpd xmm0, xmm2, xmm0 # xmm0 = xmm2[0],xmm0[0]
vmovdqa xmm2, xmmword ptr [rdx + 64]
vpextrq rcx, xmm2, 1
vcvtsi2sd xmm5, xmm12, rcx
vdivpd xmm6, xmm10, xmm0
vmovq rcx, xmm2
vcvtsi2sd xmm0, xmm12, rcx
vunpcklpd xmm0, xmm0, xmm5 # xmm0 = xmm0[0],xmm5[0]
vmovdqa xmm2, xmmword ptr [rdx + 80]
vpextrq rcx, xmm2, 1
vdivpd xmm0, xmm8, xmm0
vcvtsi2sd xmm5, xmm12, rcx
vmovq rcx, xmm2
vcvtsi2sd xmm2, xmm12, rcx
vunpcklpd xmm2, xmm2, xmm5 # xmm2 = xmm2[0],xmm5[0]
vdivpd xmm2, xmm9, xmm2
vcvttsd2si rcx, xmm1
vmovq xmm5, rcx
vpermilpd xmm1, xmm1, 1 # xmm1 = xmm1[1,0]
vcvttsd2si rcx, xmm1
vmovq xmm1, rcx
vpunpcklqdq xmm1, xmm5, xmm1 # xmm1 = xmm5[0],xmm1[0]
vcvttsd2si rcx, xmm4
vmovq xmm5, rcx
vpermilpd xmm4, xmm4, 1 # xmm4 = xmm4[1,0]
vcvttsd2si rcx, xmm4
vmovq xmm4, rcx
vpunpcklqdq xmm4, xmm5, xmm4 # xmm4 = xmm5[0],xmm4[0]
vcvttsd2si rcx, xmm6
vmovq xmm5, rcx
vpermilpd xmm6, xmm6, 1 # xmm6 = xmm6[1,0]
vcvttsd2si rcx, xmm6
vmovq xmm6, rcx
vpunpcklqdq xmm5, xmm5, xmm6 # xmm5 = xmm5[0],xmm6[0]
vcvttsd2si rcx, xmm3
vmovq xmm6, rcx
vpermilpd xmm3, xmm3, 1 # xmm3 = xmm3[1,0]
vcvttsd2si rcx, xmm3
vmovq xmm3, rcx
vpunpcklqdq xmm3, xmm6, xmm3 # xmm3 = xmm6[0],xmm3[0]
vcvttsd2si rcx, xmm2
vmovq xmm6, rcx
vpermilpd xmm2, xmm2, 1 # xmm2 = xmm2[1,0]
vcvttsd2si rcx, xmm2
vmovq xmm2, rcx
vpunpcklqdq xmm2, xmm6, xmm2 # xmm2 = xmm6[0],xmm2[0]
vcvttsd2si rcx, xmm0
vmovq xmm6, rcx
vpermilpd xmm0, xmm0, 1 # xmm0 = xmm0[1,0]
vcvttsd2si rcx, xmm0
vmovq xmm0, rcx
vpunpcklqdq xmm0, xmm6, xmm0 # xmm0 = xmm6[0],xmm0[0]
vmovdqa xmmword ptr [rdi + 16], xmm4
vmovdqa xmmword ptr [rdi], xmm1
vmovdqa xmmword ptr [rdi + 32], xmm3
vmovdqa xmmword ptr [rdi + 48], xmm5
vmovdqa xmmword ptr [rdi + 64], xmm0
vmovdqa xmmword ptr [rdi + 80], xmm2
ret
nop word ptr cs:[rax + rax] But if we use julia> using VectorizationBase, BenchmarkTools; using VectorizationBase: AbstractSIMD
julia> mydiv(a::AbstractSIMD{W,I}, b::AbstractSIMD{W,I}) where {W,I<:Integer} = trunc(Int32, float(a) / float(b))
mydiv (generic function with 2 methods)
julia> vi1 = VectorizationBase.VecUnroll((
Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...)
));
julia> vi2 = VectorizationBase.VecUnroll((
Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...)
));
julia> mydiv(vi1,vi2)
3 x Vec{4, Int32}
Vec{4,Int32}<0, -14, 0, 0>
Vec{4,Int32}<0, 0, -1, -1>
Vec{4,Int32}<3, 0, -3, 0>
julia> vi1 ÷ vi2
3 x Vec{4, Int32}
Vec{4,Int32}<0, -14, 0, 0>
Vec{4,Int32}<0, 0, -1, -1>
Vec{4,Int32}<3, 0, -3, 0>
julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 49.587 ns (0.00% GC)
median time: 49.595 ns (0.00% GC)
mean time: 50.290 ns (0.00% GC)
maximum time: 98.261 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 987
julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 69.621 ns (0.00% GC)
median time: 69.638 ns (0.00% GC)
mean time: 71.028 ns (0.00% GC)
maximum time: 133.334 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 975 And we have the clean SIMD asm: .text
vcvtdq2pd ymm0, xmmword ptr [rsi]
vcvtdq2pd ymm1, xmmword ptr [rsi + 16]
vcvtdq2pd ymm2, xmmword ptr [rsi + 32]
vcvtdq2pd ymm3, xmmword ptr [rdx]
vdivpd ymm0, ymm0, ymm3
vcvtdq2pd ymm3, xmmword ptr [rdx + 16]
vdivpd ymm1, ymm1, ymm3
vcvtdq2pd ymm3, xmmword ptr [rdx + 32]
mov rax, rdi
vdivpd ymm2, ymm2, ymm3
vcvttpd2dq xmm0, ymm0
vcvttpd2dq xmm1, ymm1
vcvttpd2dq xmm2, ymm2
vmovapd xmmword ptr [rdi], xmm0
vmovapd xmmword ptr [rdi + 16], xmm1
vmovapd xmmword ptr [rdi + 32], xmm2
vzeroupper
ret
nop dword ptr [rax] |
If you're willing to limit the effective range to +/- 2^31 on non-AVX512DQ platforms and +/- 2^53 given AVX512DQ ( |
@mcabbott All tests pass. |
Awesome, thanks! |
Great, all pass locally too, even the ones I had commented out. I'm surprised by these julia> mydiv(vi1,vi2)
3 x Vec{4, Int64}
Vec{4,Int64}<1, -1, 1, 6>
Vec{4,Int64}<1, 3, -8, -1>
Vec{4,Int64}<0, 0, 0, 0>
julia> vi1 ÷ vi2
3 x Vec{4, Int64}
Vec{4,Int64}<1, -1, 1, 6>
Vec{4,Int64}<1, 3, -8, -1>
Vec{4,Int64}<0, 0, 0, 0>
julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.866 ns (0.00% GC)
median time: 22.157 ns (0.00% GC)
mean time: 22.309 ns (0.00% GC)
maximum time: 144.809 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 94.274 ns (0.00% GC)
median time: 94.572 ns (0.00% GC)
mean time: 97.789 ns (0.00% GC)
maximum time: 826.215 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 951
julia> versioninfo()
Julia Version 1.7.0-DEV.149
Commit 5391030328* (2020-12-22 01:27 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, skylake) and for scalars julia> @btime trunc(Int, a[] / b[]) setup=(a=Ref(rand(Int)); b=Ref(rand(Int)););
5.592 ns (0 allocations: 0 bytes)
julia> @btime div(a[], b[]) setup=(a=Ref(rand(Int)); b=Ref(rand(Int)););
9.473 ns (0 allocations: 0 bytes) It seems that for many uses of |
Wait how do you get 2^31 here?
|
Okay, I'm surprised by the good performance you got with
By using If someone with AVX2 has a |
Do you think it'd be worth switching to these by default? That'd probably count as a breaking change. |
Oh of course, 31 is pretty close to 32 :) but not to Float32. But curious that my non-AVX512 laptop does well on the Float64 division path then? I guess it is spending most of its time converting -- here is pure Float64 division:
And for julia> mydiv(vi1,vi2)
3 x Vec{4, Int32}
Vec{4,Int32}<0, 0, -2, 1>
Vec{4,Int32}<1, -7, -1, 17>
Vec{4,Int32}<-2, 0, 4, 9>
julia> vi1 ÷ vi2
3 x Vec{4, Int32}
Vec{4,Int32}<0, 0, -2, 1>
Vec{4,Int32}<1, -7, -1, 17>
Vec{4,Int32}<-2, 0, 4, 9>
julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.937 ns (0.00% GC)
median time: 8.955 ns (0.00% GC)
mean time: 9.028 ns (0.00% GC)
maximum time: 98.598 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 32.898 ns (0.00% GC)
median time: 32.903 ns (0.00% GC)
mean time: 33.148 ns (0.00% GC)
maximum time: 138.323 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 994 |
I'd worry a bit that there might be e.g. cryptography users who care about large integers & overflow. But it would be nice if there was an easy way to opt in. Here and in Base really... maybe |
Mind sharing your assembly?
It should also be You can also see from my assembly from the benchmark: .text
vcvtdq2pd ymm0, xmmword ptr [rsi]
vcvtdq2pd ymm1, xmmword ptr [rsi + 16]
vcvtdq2pd ymm2, xmmword ptr [rsi + 32]
vcvtdq2pd ymm3, xmmword ptr [rdx]
vdivpd ymm0, ymm0, ymm3
vcvtdq2pd ymm3, xmmword ptr [rdx + 16]
vdivpd ymm1, ymm1, ymm3
vcvtdq2pd ymm3, xmmword ptr [rdx + 32]
mov rax, rdi
vdivpd ymm2, ymm2, ymm3
vcvttpd2dq xmm0, ymm0
vcvttpd2dq xmm1, ymm1
vcvttpd2dq xmm2, ymm2
vmovapd xmmword ptr [rdi], xmm0
vmovapd xmmword ptr [rdi + 16], xmm1
vmovapd xmmword ptr [rdi + 32], xmm2
vzeroupper
ret
nop dword ptr [rax] It loaded memory 16 bytes (128 bits) apart, but placed the results into full 256 bit So that's where the 2^31 came from: truncate
Yeah, I've been thinking I should probably actually use |
I just dropped all flags, didn't add the fastmath bit yet! |
With the Int64 vi1, as timed above:
Sorry this was a typo on my part, Int32. And
|
As discussed here JuliaSIMD/LoopVectorization.jl#174 (comment)
Doesn't yet test on
VecUnroll
, because if I try to includefld
there,which I think is caused by this
Int32
: