Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mod and clamp to accept Vec #28

Merged
merged 13 commits into from
Dec 24, 2020
Merged

Conversation

mcabbott
Copy link

@mcabbott mcabbott commented Dec 22, 2020

As discussed here JuliaSIMD/LoopVectorization.jl#174 (comment)

Doesn't yet test on VecUnroll, because if I try to include fld there,

julia> f = fld
fld (generic function with 11 methods)

julia> check_within_limits(tovector(@inferred(f(m1, i))), f.(xi3, i))
ERROR: AssertionError: sizeof(T1) == sizeof(T2)

which I think is caused by this Int32:

julia> x, y = promote(m1.data[2], 3)
(Vec{4,Int32}<1, 2, 3, 4>, Vec{4,Int64}<3, 3, 3, 3>)

julia> d = div(x, y)
Vec{4,Int64}<0, 0, 1, 1>

julia> d - (signbit(x  y) & (d * y != x))
ERROR: AssertionError: sizeof(T1) == sizeof(T2)

julia> convert(Vec{4, Int64}, m1.data[2])
Vec{4,Int32}<1, 2, 3, 4>

Copy link
Member

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion error is on the equality check. I'll fix it.

@codecov-io
Copy link

codecov-io commented Dec 22, 2020

Codecov Report

Merging #28 (e106582) into master (87989d2) will decrease coverage by 0.81%.
The diff coverage is 90.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
- Coverage   55.25%   54.43%   -0.82%     
==========================================
  Files          28       28              
  Lines        2662     2669       +7     
==========================================
- Hits         1471     1453      -18     
- Misses       1191     1216      +25     
Impacted Files Coverage Δ
src/special/misc.jl 84.61% <90.90%> (+51.28%) ⬆️
src/llvm_intrin/binary_ops.jl 33.33% <0.00%> (-20.00%) ⬇️
src/llvm_intrin/conversion.jl 63.33% <0.00%> (-6.67%) ⬇️
src/llvm_intrin/intrin_funcs.jl 60.28% <0.00%> (-4.31%) ⬇️
src/ranges.jl 38.95% <0.00%> (-0.59%) ⬇️
src/llvm_intrin/masks.jl 66.17% <0.00%> (-0.37%) ⬇️
src/fmap.jl 64.86% <0.00%> (+0.90%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87989d2...e106582. Read the comment docs.

Copy link
Member

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Ready to merge?

Mind bumping the minor version so I can go ahead and make another release?

@mcabbott
Copy link
Author

mcabbott commented Dec 22, 2020

I've upgraded most of the tests to VecUnroll now, but some still fail, not sure why yet:

julia> check_within_limits(tovector(@inferred(f(m1, vi2))), f.(xi3, xi2))
ERROR: promotion of types VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}} and VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}} failed to change any arguments
Stacktrace:
 [1] error(::String, ::String, ::String)
   @ Base ./error.jl:42
 [2] sametype_error(input::Tuple{VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}})
   @ Base ./promotion.jl:330
 [3] not_sametype(x::Tuple{VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}}, y::Tuple{VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}})
   @ Base ./promotion.jl:324
 [4] promote
   @ ./promotion.jl:307 [inlined]
 [5] fld(x::VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}}, y::VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}})
   @ VectorizationBase ~/.julia/dev/VectorizationBase/src/special/misc.jl:5
 [6] top-level scope
   @ REPL[75]:1

julia> promote_type(typeof(m1), typeof(vi2))
VectorizationBase.VecUnroll{3, 4, Int64, V} where V<:VectorizationBase.AbstractSIMDVector{4, Int64}

julia> promote(m1, vi2)
ERROR: promotion of types VectorizationBase.VecUnroll{3, 4, Int64, MM{4, 1, Int64}} and VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}} failed to change any arguments

@chriselrod
Copy link
Member

Thanks, fixed that on master. I made the promotion of VecUnroll's rely on the rules for the underlying types, which should make it more robust.

@mcabbott
Copy link
Author

mcabbott commented Dec 23, 2020

Great, thanks. I still get a few errors from mod tests, not sure why:

julia> x, y = m1, m2
(4 x Vec{4, Int64}
Vec{4,Int64}<7, 8, 9, 10>
Vec{4,Int64}<1, 2, 3, 4>
Vec{4,Int64}<13, 14, 15, 16>
Vec{4,Int64}<32, 33, 34, 35>, 4 x Vec{4, Int64}
Vec{4,Int64}<3, 5, 7, 9>
Vec{4,Int64}<8, 10, 12, 14>
Vec{4,Int64}<39, 41, 43, 45>
Vec{4,Int64}<17, 19, 21, 23>)

julia> IfElse.ifelse(y == -1, zero(x), x - fld(x, y) * y)
ERROR: TypeError: non-boolean (VectorizationBase.VecUnroll{3, 4, VectorizationBase.Bit, Mask{4, UInt8}}) used in boolean context
Stacktrace:
 [1] ifelse(::VectorizationBase.VecUnroll{3, 4, VectorizationBase.Bit, Mask{4, UInt8}}, ::VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}, ::VectorizationBase.VecUnroll{3, 4, Int32, Vec{4, Int32}})
   @ IfElse ~/.julia/packages/IfElse/u1l5W/src/IfElse.jl:3
 [2] top-level scope
   @ REPL[65]:1
   
julia> typeof.((y == -1, zero(x), x - fld(x, y) * y))
(VectorizationBase.VecUnroll{3, 4, VectorizationBase.Bit, Mask{4, UInt8}}, VectorizationBase.VecUnroll{3, 4, Int64, Vec{4, Int64}}, VectorizationBase.VecUnroll{3, 4, Int32, Vec{4, Int32}})

@chriselrod
Copy link
Member

Fixed and tested on master.

Note that x - fld(x, y) * y seems to be promoting (demoting?) the element type to Int32. Some operations try to do that on some arches, because Int64 is a low slower/doesn't SIMD for some operations/arch combinations.
That could honestly be done a lot better than it is at the moment.
But something to be aware of.

@chriselrod
Copy link
Member

Although, even on AVX512 div doesn't actually SIMD for Int32 or Int64.

@chriselrod
Copy link
Member

chriselrod commented Dec 23, 2020

Perhaps we should replace the implementation with Integer -> Float -> Integer for the sake of SIMD?

julia> using VectorizationBase, BenchmarkTools

julia> vi1 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
           Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
           Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...)
       ));

julia> vi2 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
           Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...),
           Vec(ntuple(_ -> rand(Int), VectorizationBase.pick_vector_width_val(Int))...)
       ));

julia> mydiv(a,b) = trunc(Int, float(a) / float(b))
mydiv (generic function with 1 method)

julia> mydiv(vi1,vi2)
3 x Vec{8, Int64}
Vec{8,Int64}<0, 0, -1, 0, -5, 0, -15, 0>
Vec{8,Int64}<0, -3, 1, -1, 0, 0, 2, 0>
Vec{8,Int64}<1, 8, -4, 6, 1, 0, 0, 0>

julia> vi1 ÷ vi2
3 x Vec{8, Int64}
Vec{8,Int64}<0, 0, -1, 0, -5, 0, -15, 0>
Vec{8,Int64}<0, -3, 1, -1, 0, 0, 2, 0>
Vec{8,Int64}<1, 8, -4, 6, 1, 0, 0, 0>

julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.583 ns (0.00% GC)
  median time:      14.864 ns (0.00% GC)
  mean time:        14.635 ns (0.00% GC)
  maximum time:     27.249 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     185.452 ns (0.00% GC)
  median time:      185.821 ns (0.00% GC)
  mean time:        186.234 ns (0.00% GC)
  maximum time:     216.690 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     671

EDIT:
second computer:

julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.359 ns (0.00% GC)
  median time:      11.398 ns (0.00% GC)
  mean time:        11.563 ns (0.00% GC)
  maximum time:     21.709 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     154.994 ns (0.00% GC)
  median time:      155.194 ns (0.00% GC)
  mean time:        155.353 ns (0.00% GC)
  maximum time:     172.370 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     803

Assembly

julia> @code_native debuginfo=:none syntax=:intel mydiv(vi1, vi2)
        .text
        vcvtqq2pd       zmm0, zmmword ptr [rsi]
        vcvtqq2pd       zmm1, zmmword ptr [rsi + 64]
        vcvtqq2pd       zmm2, zmmword ptr [rsi + 128]
        vcvtqq2pd       zmm3, zmmword ptr [rdx]
        vdivpd  zmm0, zmm0, zmm3
        vcvtqq2pd       zmm3, zmmword ptr [rdx + 64]
        mov     rax, rdi
        vdivpd  zmm1, zmm1, zmm3
        vcvtqq2pd       zmm3, zmmword ptr [rdx + 128]
        vdivpd  zmm2, zmm2, zmm3
        vcvttpd2qq      zmm0, zmm0
        vcvttpd2qq      zmm1, zmm1
        vcvttpd2qq      zmm2, zmm2
        vmovaps zmmword ptr [rdi], zmm0
        vmovaps zmmword ptr [rdi + 64], zmm1
        vmovaps zmmword ptr [rdi + 128], zmm2
        vzeroupper
        ret
        nop     word ptr [rax + rax]

Note that some of these instructions (the final vmovaps) are just repacking the VecUnroll and would be eliminated in actual use whenever inlined.

@chriselrod
Copy link
Member

Note that Int64 <--> Float64 is slow on AVX2, but Int32 <--> Float64 is fast, if that has any baring on your code/implementations.

@chriselrod
Copy link
Member

Computer with AVX2 but no AVX512:

julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.249 ns (0.00% GC)
  median time:      66.337 ns (0.00% GC)
  mean time:        66.851 ns (0.00% GC)
  maximum time:     109.525 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     978

julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     193.593 ns (0.00% GC)
  median time:      194.165 ns (0.00% GC)
  mean time:        196.425 ns (0.00% GC)
  maximum time:     684.750 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     631

The new assembly for mydiv is:

	.text
	mov	rax, rdi
	vmovapd	xmm0, xmmword ptr [rsi]
	vmovdqa	xmm1, xmmword ptr [rsi + 16]
	vpextrq	rcx, xmm1, 1
	vcvtsi2sd	xmm2, xmm2, rcx
	vmovdqa	xmm3, xmmword ptr [rsi + 32]
	vmovq	rcx, xmm1
	vcvtsi2sd	xmm1, xmm4, rcx
	vpextrq	rcx, xmm0, 1
	vcvtsi2sd	xmm4, xmm4, rcx
	vmovdqa	xmm6, xmmword ptr [rsi + 48]
	vmovq	rcx, xmm0
	vcvtsi2sd	xmm0, xmm5, rcx
	vpextrq	rcx, xmm3, 1
	vcvtsi2sd	xmm7, xmm5, rcx
	vunpcklpd	xmm1, xmm1, xmm2        # xmm1 = xmm1[0],xmm2[0]
	vmovq	rcx, xmm3
	vcvtsi2sd	xmm2, xmm5, rcx
	vpextrq	rcx, xmm6, 1
	vcvtsi2sd	xmm3, xmm5, rcx
	vunpcklpd	xmm5, xmm0, xmm4        # xmm5 = xmm0[0],xmm4[0]
	vmovq	rcx, xmm6
	vcvtsi2sd	xmm0, xmm8, rcx
	vunpcklpd	xmm11, xmm2, xmm7       # xmm11 = xmm2[0],xmm7[0]
	vmovdqa	xmm4, xmmword ptr [rsi + 64]
	vpextrq	rcx, xmm4, 1
	vunpcklpd	xmm10, xmm0, xmm3       # xmm10 = xmm0[0],xmm3[0]
	vcvtsi2sd	xmm0, xmm8, rcx
	vmovq	rcx, xmm4
	vcvtsi2sd	xmm3, xmm8, rcx
	vunpcklpd	xmm8, xmm3, xmm0        # xmm8 = xmm3[0],xmm0[0]
	vmovdqa	xmm3, xmmword ptr [rsi + 80]
	vpextrq	rcx, xmm3, 1
	vcvtsi2sd	xmm4, xmm9, rcx
	vmovq	rcx, xmm3
	vcvtsi2sd	xmm3, xmm9, rcx
	vunpcklpd	xmm9, xmm3, xmm4        # xmm9 = xmm3[0],xmm4[0]
	vmovdqa	xmm3, xmmword ptr [rdx]
	vmovdqa	xmm4, xmmword ptr [rdx + 16]
	vmovapd	xmm0, xmmword ptr [rdx + 32]
	vmovdqa	xmm6, xmmword ptr [rdx + 48]
	vpextrq	rcx, xmm4, 1
	vcvtsi2sd	xmm2, xmm12, rcx
	vmovq	rcx, xmm4
	vcvtsi2sd	xmm4, xmm12, rcx
	vunpcklpd	xmm2, xmm4, xmm2        # xmm2 = xmm4[0],xmm2[0]
	vpextrq	rcx, xmm3, 1
	vcvtsi2sd	xmm7, xmm12, rcx
	vdivpd	xmm4, xmm1, xmm2
	vmovq	rcx, xmm3
	vcvtsi2sd	xmm1, xmm12, rcx
	vpextrq	rcx, xmm0, 1
	vcvtsi2sd	xmm2, xmm12, rcx
	vunpcklpd	xmm1, xmm1, xmm7        # xmm1 = xmm1[0],xmm7[0]
	vmovq	rcx, xmm0
	vcvtsi2sd	xmm0, xmm12, rcx
	vdivpd	xmm1, xmm5, xmm1
	vunpcklpd	xmm0, xmm0, xmm2        # xmm0 = xmm0[0],xmm2[0]
	vpextrq	rcx, xmm6, 1
	vdivpd	xmm3, xmm11, xmm0
	vcvtsi2sd	xmm0, xmm12, rcx
	vmovq	rcx, xmm6
	vcvtsi2sd	xmm2, xmm12, rcx
	vunpcklpd	xmm0, xmm2, xmm0        # xmm0 = xmm2[0],xmm0[0]
	vmovdqa	xmm2, xmmword ptr [rdx + 64]
	vpextrq	rcx, xmm2, 1
	vcvtsi2sd	xmm5, xmm12, rcx
	vdivpd	xmm6, xmm10, xmm0
	vmovq	rcx, xmm2
	vcvtsi2sd	xmm0, xmm12, rcx
	vunpcklpd	xmm0, xmm0, xmm5        # xmm0 = xmm0[0],xmm5[0]
	vmovdqa	xmm2, xmmword ptr [rdx + 80]
	vpextrq	rcx, xmm2, 1
	vdivpd	xmm0, xmm8, xmm0
	vcvtsi2sd	xmm5, xmm12, rcx
	vmovq	rcx, xmm2
	vcvtsi2sd	xmm2, xmm12, rcx
	vunpcklpd	xmm2, xmm2, xmm5        # xmm2 = xmm2[0],xmm5[0]
	vdivpd	xmm2, xmm9, xmm2
	vcvttsd2si	rcx, xmm1
	vmovq	xmm5, rcx
	vpermilpd	xmm1, xmm1, 1           # xmm1 = xmm1[1,0]
	vcvttsd2si	rcx, xmm1
	vmovq	xmm1, rcx
	vpunpcklqdq	xmm1, xmm5, xmm1        # xmm1 = xmm5[0],xmm1[0]
	vcvttsd2si	rcx, xmm4
	vmovq	xmm5, rcx
	vpermilpd	xmm4, xmm4, 1           # xmm4 = xmm4[1,0]
	vcvttsd2si	rcx, xmm4
	vmovq	xmm4, rcx
	vpunpcklqdq	xmm4, xmm5, xmm4        # xmm4 = xmm5[0],xmm4[0]
	vcvttsd2si	rcx, xmm6
	vmovq	xmm5, rcx
	vpermilpd	xmm6, xmm6, 1           # xmm6 = xmm6[1,0]
	vcvttsd2si	rcx, xmm6
	vmovq	xmm6, rcx
	vpunpcklqdq	xmm5, xmm5, xmm6        # xmm5 = xmm5[0],xmm6[0]
	vcvttsd2si	rcx, xmm3
	vmovq	xmm6, rcx
	vpermilpd	xmm3, xmm3, 1           # xmm3 = xmm3[1,0]
	vcvttsd2si	rcx, xmm3
	vmovq	xmm3, rcx
	vpunpcklqdq	xmm3, xmm6, xmm3        # xmm3 = xmm6[0],xmm3[0]
	vcvttsd2si	rcx, xmm2
	vmovq	xmm6, rcx
	vpermilpd	xmm2, xmm2, 1           # xmm2 = xmm2[1,0]
	vcvttsd2si	rcx, xmm2
	vmovq	xmm2, rcx
	vpunpcklqdq	xmm2, xmm6, xmm2        # xmm2 = xmm6[0],xmm2[0]
	vcvttsd2si	rcx, xmm0
	vmovq	xmm6, rcx
	vpermilpd	xmm0, xmm0, 1           # xmm0 = xmm0[1,0]
	vcvttsd2si	rcx, xmm0
	vmovq	xmm0, rcx
	vpunpcklqdq	xmm0, xmm6, xmm0        # xmm0 = xmm6[0],xmm0[0]
	vmovdqa	xmmword ptr [rdi + 16], xmm4
	vmovdqa	xmmword ptr [rdi], xmm1
	vmovdqa	xmmword ptr [rdi + 32], xmm3
	vmovdqa	xmmword ptr [rdi + 48], xmm5
	vmovdqa	xmmword ptr [rdi + 64], xmm0
	vmovdqa	xmmword ptr [rdi + 80], xmm2
	ret
	nop	word ptr cs:[rax + rax]

But if we use Int32 instead...

julia> using VectorizationBase, BenchmarkTools; using VectorizationBase: AbstractSIMD

julia> mydiv(a::AbstractSIMD{W,I}, b::AbstractSIMD{W,I}) where {W,I<:Integer} = trunc(Int32, float(a) / float(b))
mydiv (generic function with 2 methods)

julia> vi1 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...)
       ));

julia> vi2 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...)
       ));

julia> mydiv(vi1,vi2)
3 x Vec{4, Int32}
Vec{4,Int32}<0, -14, 0, 0>
Vec{4,Int32}<0, 0, -1, -1>
Vec{4,Int32}<3, 0, -3, 0>

julia> vi1 ÷ vi2
3 x Vec{4, Int32}
Vec{4,Int32}<0, -14, 0, 0>
Vec{4,Int32}<0, 0, -1, -1>
Vec{4,Int32}<3, 0, -3, 0>

julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.587 ns (0.00% GC)
  median time:      49.595 ns (0.00% GC)
  mean time:        50.290 ns (0.00% GC)
  maximum time:     98.261 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.621 ns (0.00% GC)
  median time:      69.638 ns (0.00% GC)
  mean time:        71.028 ns (0.00% GC)
  maximum time:     133.334 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     975

And we have the clean SIMD asm:

	.text
	vcvtdq2pd	ymm0, xmmword ptr [rsi]
	vcvtdq2pd	ymm1, xmmword ptr [rsi + 16]
	vcvtdq2pd	ymm2, xmmword ptr [rsi + 32]
	vcvtdq2pd	ymm3, xmmword ptr [rdx]
	vdivpd	ymm0, ymm0, ymm3
	vcvtdq2pd	ymm3, xmmword ptr [rdx + 16]
	vdivpd	ymm1, ymm1, ymm3
	vcvtdq2pd	ymm3, xmmword ptr [rdx + 32]
	mov	rax, rdi
	vdivpd	ymm2, ymm2, ymm3
	vcvttpd2dq	xmm0, ymm0
	vcvttpd2dq	xmm1, ymm1
	vcvttpd2dq	xmm2, ymm2
	vmovapd	xmmword ptr [rdi], xmm0
	vmovapd	xmmword ptr [rdi + 16], xmm1
	vmovapd	xmmword ptr [rdi + 32], xmm2
	vzeroupper
	ret
	nop	dword ptr [rax]

@chriselrod
Copy link
Member

If you're willing to limit the effective range to +/- 2^31 on non-AVX512DQ platforms and +/- 2^53 given AVX512DQ (Float64 will exactly represent up to 2^53), then you can get a substantial performance boost on the integer div operations.

@chriselrod
Copy link
Member

@mcabbott All tests pass.

@mcabbott mcabbott marked this pull request as ready for review December 23, 2020 23:55
@chriselrod chriselrod merged commit 3ecbf27 into JuliaSIMD:master Dec 24, 2020
@chriselrod
Copy link
Member

Awesome, thanks!

@mcabbott
Copy link
Author

Great, all pass locally too, even the ones I had commented out.

I'm surprised by these div results, my laptop gives:

julia> mydiv(vi1,vi2)
3 x Vec{4, Int64}
Vec{4,Int64}<1, -1, 1, 6>
Vec{4,Int64}<1, 3, -8, -1>
Vec{4,Int64}<0, 0, 0, 0>

julia> vi1 ÷ vi2
3 x Vec{4, Int64}
Vec{4,Int64}<1, -1, 1, 6>
Vec{4,Int64}<1, 3, -8, -1>
Vec{4,Int64}<0, 0, 0, 0>

julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.866 ns (0.00% GC)
  median time:      22.157 ns (0.00% GC)
  mean time:        22.309 ns (0.00% GC)
  maximum time:     144.809 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     94.274 ns (0.00% GC)
  median time:      94.572 ns (0.00% GC)
  mean time:        97.789 ns (0.00% GC)
  maximum time:     826.215 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     951

julia> versioninfo()
Julia Version 1.7.0-DEV.149
Commit 5391030328* (2020-12-22 01:27 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)

and for scalars

julia> @btime trunc(Int, a[] / b[])  setup=(a=Ref(rand(Int)); b=Ref(rand(Int)););
5.592 ns (0 allocations: 0 bytes)

julia> @btime div(a[], b[])  setup=(a=Ref(rand(Int)); b=Ref(rand(Int)););
9.473 ns (0 allocations: 0 bytes)

It seems that for many uses of div, 2^53 is more than enough, and this should be used widely. For instance every linear to CartesianIndex calculation involves mod and unless you're indexing from typemin or something crazy, or live far in the future, can't have numbers this big.

@mcabbott mcabbott deleted the modclamp branch December 24, 2020 00:14
@mcabbott
Copy link
Author

If you're willing to limit the effective range to +/- 2^31 on non-AVX512DQ platforms and +/- 2^53 given AVX512DQ (Float64 will exactly represent up to 2^53), then you can get a substantial performance boost on the integer div operations.

Wait how do you get 2^31 here?

julia> eps(0.0 + 2^52)
1.0

julia> eps(0f0 + 2^23)
1.0f0

@chriselrod
Copy link
Member

chriselrod commented Dec 24, 2020

Okay, I'm surprised by the good performance you got with Vecs of Int64. What do you get with the Int32 vecs?

Wait how do you get 2^31 here?

By using Int32 instead of Int64, because SIMD conversion between Int64 and Float64 requires AVX512DQ (i.e., AVX2 cannot do it).
I.e., AVX2 can efficiently convert between <4 x Int32> and <4 x Float64>, and the Int32 limits the expressible size.

If someone with AVX2 has a <8 x Int32>, it'd probably be worth splitting it into 2 <4 x Int32>, do the divisions with Float64, and then recombine.

@chriselrod
Copy link
Member

chriselrod commented Dec 24, 2020

Do you think it'd be worth switching to these by default?

That'd probably count as a breaking change.

@mcabbott
Copy link
Author

mcabbott commented Dec 24, 2020

Oh of course, 31 is pretty close to 32 :) but not to Float32.

But curious that my non-AVX512 laptop does well on the Float64 division path then? I guess it is spending most of its time converting -- here is pure Float64 division:

julia> vf1 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Float64), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Float64), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Float64), VectorizationBase.pick_vector_width_val(Float64))...)
       ));

julia> vf2 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Float64), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Float64), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Float64), VectorizationBase.pick_vector_width_val(Float64))...)
       ))
3 x Vec{4, Float64}
Vec{4,Float64}<0.13508764382415084, 0.18823755847820633, 0.29929030095012, 0.43807051017356824>
Vec{4,Float64}<0.6747103894887554, 0.9151196779454043, 0.3903431430572155, 0.330261035139215>
Vec{4,Float64}<0.17875131348038997, 0.43389583433915746, 0.4613364578216672, 0.49480678051459326>

julia> @benchmark $(Ref(vf1))[] / $(Ref(vf2))[]
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.704 ns (0.00% GC)
  median time:      6.719 ns (0.00% GC)
  mean time:        6.831 ns (0.00% GC)
  maximum time:     274.079 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

And for Float32, Int32 [sorry!] your div tests from above:

julia> mydiv(vi1,vi2)
3 x Vec{4, Int32}
Vec{4,Int32}<0, 0, -2, 1>
Vec{4,Int32}<1, -7, -1, 17>
Vec{4,Int32}<-2, 0, 4, 9>

julia> vi1 ÷ vi2
3 x Vec{4, Int32}
Vec{4,Int32}<0, 0, -2, 1>
Vec{4,Int32}<1, -7, -1, 17>
Vec{4,Int32}<-2, 0, 4, 9>

julia> @benchmark mydiv($(Ref(vi1))[], $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.937 ns (0.00% GC)
  median time:      8.955 ns (0.00% GC)
  mean time:        9.028 ns (0.00% GC)
  maximum time:     98.598 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark ($(Ref(vi1))[] ÷ $(Ref(vi2))[])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.898 ns (0.00% GC)
  median time:      32.903 ns (0.00% GC)
  mean time:        33.148 ns (0.00% GC)
  maximum time:     138.323 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

@mcabbott
Copy link
Author

Do you think it'd be worth switching to these by default?

That'd probably count as a breaking change.

I'd worry a bit that there might be e.g. cryptography users who care about large integers & overflow. But it would be nice if there was an easy way to opt in. Here and in Base really... maybe @fastmath div(a,b) should do this?

@chriselrod
Copy link
Member

chriselrod commented Dec 24, 2020

But curious that my non-AVX512 laptop does well on the Float64 division path then?

Mind sharing your assembly?
I'm guessing it's just much better at out of order stuff than my old laptop, but the assembly will probably look basically the same as what I shared earlier.

And for Float32, your div tests from above:

It should also be Float64 -- it was when I benchmarked it.
Can you call float(vi1)?
On your computer, float(vi1::Vec{4,Int32}) should return Vec{4,Float64}.
float(v::Vec{8,Int32}), however, should return Vec{8,Float32}.

You can also see from my assembly from the benchmark:

.text
	vcvtdq2pd	ymm0, xmmword ptr [rsi]
	vcvtdq2pd	ymm1, xmmword ptr [rsi + 16]
	vcvtdq2pd	ymm2, xmmword ptr [rsi + 32]
	vcvtdq2pd	ymm3, xmmword ptr [rdx]
	vdivpd	ymm0, ymm0, ymm3
	vcvtdq2pd	ymm3, xmmword ptr [rdx + 16]
	vdivpd	ymm1, ymm1, ymm3
	vcvtdq2pd	ymm3, xmmword ptr [rdx + 32]
	mov	rax, rdi
	vdivpd	ymm2, ymm2, ymm3
	vcvttpd2dq	xmm0, ymm0
	vcvttpd2dq	xmm1, ymm1
	vcvttpd2dq	xmm2, ymm2
	vmovapd	xmmword ptr [rdi], xmm0
	vmovapd	xmmword ptr [rdi + 16], xmm1
	vmovapd	xmmword ptr [rdi + 32], xmm2
	vzeroupper
	ret
	nop	dword ptr [rax]

It loaded memory 16 bytes (128 bits) apart, but placed the results into full 256 bit ymm registers. It also ended with 128 bit xmm registers that it again stored 128 bits apart.
The conversion function vcvtdq2pd converts (vct) Int32s ("double quads" or dq) into (to=2) packed doubles (pd).
And then same for going the other way, vcvtpd2dq converts packed doubles to double quads.

So that's where the 2^31 came from: truncate Int64 into Int32, and then convert Int32 into Float64.

I'd worry a bit that there might be e.g. cryptography users who care about large integers & overflow. But it would be nice if there was an easy way to opt in. Here and in Base really... maybe @fastmath div(a,b) should do this?

Yeah, I've been thinking I should probably actually use @fastmath. LoopVectorization could also apply it by default, but have a keyword argument not to.
I think @haampie has a fork where he added @fastmath, or maybe he just removed the fast flags that're already there.

@haampie
Copy link
Contributor

haampie commented Dec 24, 2020

I just dropped all flags, didn't add the fastmath bit yet!

@mcabbott
Copy link
Author

mcabbott commented Dec 24, 2020

Mind sharing your assembly?

With the Int64 vi1, as timed above:

julia> @code_native debuginfo=:none syntax=:intel mydiv(vi1, vi2)
	.section	__TEXT,__text,regular,pure_instructions
	mov	rax, rdi
	vmovapd	xmm0, xmmword ptr [rsi]
	vmovdqa	xmm1, xmmword ptr [rsi + 16]
	vpextrq	rcx, xmm1, 1
	vcvtsi2sd	xmm2, xmm2, rcx
	vmovdqa	xmm3, xmmword ptr [rsi + 32]
	vmovq	rcx, xmm1
	vcvtsi2sd	xmm1, xmm4, rcx
	vpextrq	rcx, xmm0, 1
	vmovdqa	xmm6, xmmword ptr [rsi + 48]
	vcvtsi2sd	xmm4, xmm4, rcx
	vmovq	rcx, xmm0
	vcvtsi2sd	xmm0, xmm5, rcx
	vunpcklpd	xmm7, xmm1, xmm2        ## xmm7 = xmm1[0],xmm2[0]
	vunpcklpd	xmm5, xmm0, xmm4        ## xmm5 = xmm0[0],xmm4[0]
	vpextrq	rcx, xmm3, 1
	vcvtsi2sd	xmm0, xmm8, rcx
	vmovq	rcx, xmm3
	vcvtsi2sd	xmm1, xmm8, rcx
	vunpcklpd	xmm4, xmm1, xmm0        ## xmm4 = xmm1[0],xmm0[0]
	vpextrq	rcx, xmm6, 1
	vcvtsi2sd	xmm0, xmm8, rcx
	vmovq	rcx, xmm6
	vcvtsi2sd	xmm1, xmm8, rcx
	vunpcklpd	xmm9, xmm1, xmm0        ## xmm9 = xmm1[0],xmm0[0]
	vmovdqa	xmm0, xmmword ptr [rsi + 64]
	vpextrq	rcx, xmm0, 1
	vcvtsi2sd	xmm2, xmm8, rcx
	vmovq	rcx, xmm0
	vcvtsi2sd	xmm0, xmm8, rcx
	vunpcklpd	xmm8, xmm0, xmm2        ## xmm8 = xmm0[0],xmm2[0]
	vmovdqa	xmm2, xmmword ptr [rsi + 80]
	vpextrq	rcx, xmm2, 1
	vcvtsi2sd	xmm3, xmm10, rcx
	vmovq	rcx, xmm2
	vcvtsi2sd	xmm2, xmm10, rcx
	vunpcklpd	xmm10, xmm2, xmm3       ## xmm10 = xmm2[0],xmm3[0]
	vmovdqa	xmm6, xmmword ptr [rdx]
	vmovdqa	xmm3, xmmword ptr [rdx + 16]
	vmovapd	xmm0, xmmword ptr [rdx + 32]
	vmovdqa	xmm1, xmmword ptr [rdx + 48]
	vpextrq	rcx, xmm3, 1
	vcvtsi2sd	xmm2, xmm11, rcx
	vmovq	rcx, xmm3
	vcvtsi2sd	xmm3, xmm11, rcx
	vunpcklpd	xmm2, xmm3, xmm2        ## xmm2 = xmm3[0],xmm2[0]
	vdivpd	xmm3, xmm7, xmm2
	vpextrq	rcx, xmm6, 1
	vcvtsi2sd	xmm2, xmm11, rcx
	vmovq	rcx, xmm6
	vcvtsi2sd	xmm6, xmm11, rcx
	vunpcklpd	xmm2, xmm6, xmm2        ## xmm2 = xmm6[0],xmm2[0]
	vdivpd	xmm2, xmm5, xmm2
	vpextrq	rcx, xmm0, 1
	vcvtsi2sd	xmm5, xmm11, rcx
	vmovq	rcx, xmm0
	vcvtsi2sd	xmm0, xmm11, rcx
	vunpcklpd	xmm0, xmm0, xmm5        ## xmm0 = xmm0[0],xmm5[0]
	vdivpd	xmm4, xmm4, xmm0
	vpextrq	rcx, xmm1, 1
	vcvtsi2sd	xmm0, xmm11, rcx
	vmovq	rcx, xmm1
	vcvtsi2sd	xmm1, xmm11, rcx
	vunpcklpd	xmm0, xmm1, xmm0        ## xmm0 = xmm1[0],xmm0[0]
	vmovdqa	xmm1, xmmword ptr [rdx + 64]
	vpextrq	rcx, xmm1, 1
	vdivpd	xmm5, xmm9, xmm0
	vcvtsi2sd	xmm0, xmm11, rcx
	vmovq	rcx, xmm1
	vcvtsi2sd	xmm1, xmm11, rcx
	vunpcklpd	xmm0, xmm1, xmm0        ## xmm0 = xmm1[0],xmm0[0]
	vmovdqa	xmm1, xmmword ptr [rdx + 80]
	vpextrq	rcx, xmm1, 1
	vcvtsi2sd	xmm6, xmm11, rcx
	vdivpd	xmm0, xmm8, xmm0
	vmovq	rcx, xmm1
	vcvtsi2sd	xmm1, xmm11, rcx
	vunpcklpd	xmm1, xmm1, xmm6        ## xmm1 = xmm1[0],xmm6[0]
	vdivpd	xmm1, xmm10, xmm1
	vcvttsd2si	rcx, xmm2
	vmovq	xmm6, rcx
	vpermilpd	xmm2, xmm2, 1           ## xmm2 = xmm2[1,0]
	vcvttsd2si	rcx, xmm2
	vmovq	xmm2, rcx
	vpunpcklqdq	xmm2, xmm6, xmm2        ## xmm2 = xmm6[0],xmm2[0]
	vcvttsd2si	rcx, xmm3
	vmovq	xmm6, rcx
	vpermilpd	xmm3, xmm3, 1           ## xmm3 = xmm3[1,0]
	vcvttsd2si	rcx, xmm3
	vmovq	xmm3, rcx
	vpunpcklqdq	xmm3, xmm6, xmm3        ## xmm3 = xmm6[0],xmm3[0]
	vcvttsd2si	rcx, xmm5
	vmovq	xmm6, rcx
	vpermilpd	xmm5, xmm5, 1           ## xmm5 = xmm5[1,0]
	vcvttsd2si	rcx, xmm5
	vmovq	xmm5, rcx
	vpunpcklqdq	xmm5, xmm6, xmm5        ## xmm5 = xmm6[0],xmm5[0]
	vcvttsd2si	rcx, xmm4
	vmovq	xmm6, rcx
	vpermilpd	xmm4, xmm4, 1           ## xmm4 = xmm4[1,0]
	vcvttsd2si	rcx, xmm4
	vmovq	xmm4, rcx
	vpunpcklqdq	xmm4, xmm6, xmm4        ## xmm4 = xmm6[0],xmm4[0]
	vcvttsd2si	rcx, xmm1
	vmovq	xmm6, rcx
	vpermilpd	xmm1, xmm1, 1           ## xmm1 = xmm1[1,0]
	vcvttsd2si	rcx, xmm1
	vmovq	xmm1, rcx
	vpunpcklqdq	xmm1, xmm6, xmm1        ## xmm1 = xmm6[0],xmm1[0]
	vcvttsd2si	rcx, xmm0
	vmovq	xmm6, rcx
	vpermilpd	xmm0, xmm0, 1           ## xmm0 = xmm0[1,0]
	vcvttsd2si	rcx, xmm0
	vmovq	xmm0, rcx
	vpunpcklqdq	xmm0, xmm6, xmm0        ## xmm0 = xmm6[0],xmm0[0]
	vmovdqa	xmmword ptr [rdi + 16], xmm3
	vmovdqa	xmmword ptr [rdi], xmm2
	vmovdqa	xmmword ptr [rdi + 32], xmm4
	vmovdqa	xmmword ptr [rdi + 48], xmm5
	vmovdqa	xmmword ptr [rdi + 64], xmm0
	vmovdqa	xmmword ptr [rdi + 80], xmm1
	ret
	nop	word ptr cs:[rax + rax]

julia> vi1
3 x Vec{4, Int64}

And for Float32, your div tests from above:

It should also be Float64 -- it was when I benchmarked it.
Can you call float(vi1)?

Sorry this was a typo on my part, Int32. And Vec{4,Int32} does go to Float64:

julia> vi1 = VectorizationBase.VecUnroll((
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...),
           Vec(ntuple(_ -> rand(Int32), VectorizationBase.pick_vector_width_val(Float64))...)
       ));

julia> float(vi1)
3 x Vec{4, Float64}
Vec{4,Float64}<1.26866162e8, 2.047682163e9, -7.49188072e8, 2.7746487e7>
Vec{4,Float64}<9.40454266e8, -1.082813363e9, -6.84906416e8, -2.82213622e8>
Vec{4,Float64}<4.18512731e8, 9.903253e6, -4.7204438e7, -1.216191205e9>

julia> Vec(rand(Int32,8)...)
Vec{8,Int32}<-262135610, -263139069, 1168709318, 973597860, 163087960, 634977000, 1684598960, -39188666>

julia> float(ans)
Vec{8,Float32}<-2.6213562f8, -2.6313907f8, 1.1687094f9, 9.735979f8, 1.6308797f8, 6.34977f8, 1.6845989f9, -3.9188664f7>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants