-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace Base.tanh with faster tanh #1272
Conversation
SLEEFPirates doesn't seem to seem to do much on the machine I tested with, beyond the performance boost with julia> d1 = Dense(128, 50, Base.tanh)
Dense(128, 50, tanh)
julia> d2 = Dense(128, 50, SLEEFPirates.tanh)
Dense(128, 50, tanh)
julia> x2 = rand(Float32, 128, 500);
julia> d1(x2);
julia> d2(x2);
julia> @btime $d1($x2);
1.011 ms (4 allocations: 195.53 KiB)
julia> @btime $d2($x2);
1.325 ms (4 allocations: 195.53 KiB)
julia> struct D3{F,T,S}
σ::F
W::T
b::S
end
julia> (c::D3)(x::AbstractArray) = @avx c.σ.(c.W * x .+ c.b)
julia> d3 = D3(Base.tanh, d2.W, d2.b);
julia> d3(x2);
julia> @btime $d3($x2);
237.393 μs (4 allocations: 195.53 KiB)
julia> d4 = D3(SLEEFPirates.tanh, d2.W, d2.b);
julia> @btime $d4($x2);
236.328 μs (4 allocations: 195.53 KiB) Running with GPU Arrays, however would need special handling julia> gd3 = gpu(d3);
julia> gx2 = gpu(x2);
julia> gd3(gx2);
ERROR: MethodError: no method matching VectorizationBase.PackedStridedPointer(::CUDAdrv.CuPtr{Float32}, ::Tuple{Int64})
Closest candidates are:
VectorizationBase.PackedStridedPointer(::Ptr{T}, ::Tuple{Vararg{Int64,N}}) where {T, N} at /home/dhairyagandhi96/.julia/packages/VectorizationBase/jUlXp/src/vectorizable.jl:313
Stacktrace:
[1] stridedpointer_for_broadcast at /home/dhairyagandhi96/.julia/packages/VectorizationBase/jUlXp/src/vectorizable.jl:794 [inlined]
[2] macro expansion at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:247 [inlined]
[3] vmaterialize! at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:247 [inlined]
[4] vmaterialize at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:317 [inlined]
[5] (::D3{typeof(tanh),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}})(::CuArrays.CuArray{Float32,2,Nothing}) at ./REPL[169]:1
[6] top-level scope at REPL[186]:1 |
Yeah the dispatch with |
Could LV make the GPU passes no-ops? I would rather not lose the generic nature of code, esp since its diminishing returns for the majority of models where batchsize is usually not that large. |
You could just do |
If LV did that (even if as a wrapper macro), it could be used by downstream packages without adding special handling at least for different accelerators. I tested it with using I would also be interested in seeing how it would interact with more general use cases like Just listing things that we should have clarity on in order to understand what the tradeoffs are. |
I think it would be much harder for LV to do that since it's a macro, so it is done at parse time and not compile time and doesn't have type information. It could add type checks for everything that shows up in the expression or something, but that seems difficult to get right. |
@DhairyaLGandhi It's this already implemented in FluxML/NNlib.jl#199? |
julia> @macroexpand @avx for i in 1:3
s += x[i]
end
quote
begin
end
if LoopVectorization.check_args(x)
var"##vptr##_x" = LoopVectorization.stridedpointer(x)
local var"##s_0"
begin
$(Expr(:gc_preserve, :(var"##s_0" = begin
begin
var"##Tloopeltype##" = eltype(x)
var"##Wvecwidth##" = LoopVectorization.pick_vector_width_val(var"##Tloopeltype##")
end
LoopVectorization._avx_!(Val{(0, 0, 0, LoopVectorization.unwrap(var"##Wvecwidth##"))}(), Tuple{:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02), :LoopVectorization, :vadd, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000000102, LoopVectorization.compute, 0x00, 0x01)}, Tuple{LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{3}, Tuple{1}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, Tuple{:i}, (LoopVectorization.StaticUnitRange{1, 3}(),), var"##vptr##_x", s)
end), :x))
end
s = LoopVectorization.reduced_add(var"##s_0", s)
else
$(Expr(:inbounds, true))
local var"#19#val" = for i = 1:3
#= REPL[3]:2 =#
s = Base.FastMath.add_fast(s, x[i])
end
$(Expr(:inbounds, :pop))
var"#19#val"
end
end The generated function of course has type information, but it also (currently) loses the original representation of the loops, and instead is only given a summary of what the loops did, meaning it wouldn't necessarily currently be able to reconstruct the original loops. Broadcasting doesn't use that right now, but I've been saying for a while that I was planning to switch it for the same approach. Currently: julia> @macroexpand @avx y = foo.(x)
quote
var"##469" = Base.broadcasted(foo, x)
var"##470" = LoopVectorization.vmaterialize(var"##469", Val{:Main}())
y = var"##470" It basically vmaterialize!(dest, bc, ::Val{mod}) where {mod} = Base.Broadcast.materialize!(dest, bc) With Unfortunately, I believe because AbstractDeviceArray{<:Base.HWReal} <: DenseArray{<:Base.HWReal}, that All the checks within the library assume |
Yes, we should really have a trait for whether things opt in or not, and start spreading that around the ecosystem to make this better supported. Guessing can get you pretty far, but at some point libraries need a way to tell you this information. |
(Cribbing my comment from the discourse thread) @AStupidBear #199 helps substantially for standalone Case in point:
Is there a generalizable way to hook into this kind of fused broadcast as well? |
If you do a 5-arg mul, then that is pulled into the matmul kernel and doesn't need to fuse with the other one. |
@chriselrod do I understand this correctly; overloading |
@DhairyaLGandhi I'll update broadcasting to use For example, I'd like LoopVectorization to support
VectorizationBase.maybestaticfirst(::StaticArrays.SOneTo) = VectorizationBase.Static{1}()
VectorizationBase.maybestaticlast(::StaticArrays.SOneTo{N}) where {N} = VectorizationBase.Static{N}()
@generated VectorizationBase.maybestaticsize(::MArray{S}, ::Val{I}) where {S,I} = VectorizationBase.Static{(S.parameters[I])::Int}()
VectorizationBase.maybestaticlength(::MArray{S,T,N,L}) where {S,T,N,L} = VectorizationBase.Static{L}() But there'd have to be a place to define this. A library like EDIT: julia> using LoopVectorization, Test
julia> x = rand(1000); # should be long enough to make zero differences incredibly unlikely
julia> struct FallbackArrayWrapper{T,N} <: DenseArray{T,N} # Subtypes DenseArray
data::Array{T,N}
end
julia> Base.size(A::FallbackArrayWrapper) = size(A.data)
julia> Base.@propagate_inbounds Base.getindex(A::FallbackArrayWrapper, i::Vararg{Int, N}) where {N} = getindex(A.data, i...)
julia> Base.@propagate_inbounds Base.setindex!(A::FallbackArrayWrapper, v, i::Vararg{Int, N}) where {N} = setindex!(A.data, v, i...)
julia> Base.IndexStyle(::Type{<:FallbackArrayWrapper}) = IndexLinear()
julia> Base.pointer(A::FallbackArrayWrapper) = pointer(A.data)
julia> @test exp.(x) == (@avx exp.(FallbackArrayWrapper(x))) # Some elements aren't exactly equal
Test Failed at REPL[11]:1
Expression: exp.(x) == #= REPL[11]:1 =# @avx(exp.(FallbackArrayWrapper(x)))
Evaluated: [1.1221417842854389, 2.244121039137226, 2.4589471125063094, 1.4534952264782455, 2.532038566639118, 2.2091284493072854, 1.725068467675581, 1.6073147438178572, 1.5600736225518292, 1.3611858244420798 … 1.7180695976532183, 1.2980672830526578, 1.5196765354463808, 1.961205825911335, 2.020849417983115, 1.6405069022562084, 2.306307169464506, 1.3112862093793074, 1.2128433365019216, 1.3482334355458843] == [1.1221417842854389, 2.244121039137226, 2.45894711250631, 1.4534952264782453, 2.5320385666391174, 2.2091284493072854, 1.7250684676755808, 1.6073147438178574, 1.560073622551829, 1.3611858244420798 … 1.718069597653218, 1.2980672830526578, 1.519676535446381, 1.9612058259113347, 2.0208494179831153, 1.6405069022562089, 2.306307169464506, 1.3112862093793074, 1.2128433365019216, 1.3482334355458843]
ERROR: There was an error during testing
julia> LoopVectorization.check_args(::FallbackArrayWrapper) = false # Make `check_args` false
julia> @test exp.(x) == (@avx exp.(FallbackArrayWrapper(x))) # exact equality
Test Passed |
A couple points here, (1) is that SLEEFPirates by itself relies on the autovectorizer for SIMD. This does not always work. For example: julia> x = rand(Float32, 512); y = similar(x);
julia> @btime $y .= SLEEFPirates.tanh.($x);
9.092 μs (0 allocations: 0 bytes)
julia> function vtanh!(y, x)
@inbounds for i ∈ eachindex(x,y)
y[i] = SLEEFPirates.tanh(x[i])
end
end
vtanh! (generic function with 1 method)
julia> @btime vtanh!($y, $x)
1.146 μs (0 allocations: 0 bytes)
julia> @btime @avx $y .= SLEEFPirates.tanh.($x);
253.366 ns (0 allocations: 0 bytes)
julia> @btime @avx $y .= tanh.($x);
247.944 ns (0 allocations: 0 bytes) It works with the simple loop, but inspecting the native/llvm code from the broadcast shows we didn't get SIMD. Starting with Julia >= 1.5, the
costs.jl defines costs, as well as an In the future, once it integrates more with the compiler so we can introspect functions to ensure they're defined for Recently, Tim Holy suggested something like this for defining costs of unknown functions julia> function cost(f, tt)
params = Core.Compiler.Params(typemax(UInt))
mi = Base.method_instances(f, tt)[1]
ci = code_typed(f, tt)[1][1]
opt = Core.Compiler.OptimizationState(mi, params)
cost(stmt::Expr) = Core.Compiler.statement_cost(stmt, -1, ci, opt.sptypes, opt.slottypes, opt.params)
cost(stmt) = 0
sum(cost, ci.code)
end
cost (generic function with 1 method)
julia> isfourthv1(i) = iszero(rem(i, 4))
isfourthv1 (generic function with 1 method)
julia> isfourthv2(i) = iszero(i & 3)
isfourthv2 (generic function with 1 method)
julia> cost(isfourthv1, Tuple{Int})
41
julia> cost(isfourthv2, Tuple{Int})
2 Not perfect -- both If the cost it uses doesn't represent the actual cost well, it may make bad unrolling decisions. The current assumption is that unknowns are expensive, so that unrolling wont be profitable. I figure it is better to play things safe. For a function using LoopVectorization, BenchmarkTools, Test
function clenshaw(x, coeff)
len_c = length(coeff)
tmp = zero(x)
ret = zero(x)
for i in len_c:-1:2
ret = muladd(x,2tmp,coeff[i]-ret)
ret,tmp = tmp,ret
end
ret = muladd(x,tmp,coeff[1]-ret)
return ret
end
function clenshaw!(ret,x,coeff)
for j in 1:length(ret)
ret[j] = clenshaw(x[j], coeff)
end
end
function clenshawavx!(ret,x,coeff)
@avx for j in 1:length(ret)
ret[j] = clenshaw(x[j], coeff)
end
end
T = Float32; c = rand(T,100); x = rand(T,10^4); y1 = similar(x); y2 = similar(x);
@btime clenshaw!($y1, $x, $c)
# 1.589 ms (0 allocations: 0 bytes)
@btime clenshawavx!($y2, $x, $c)
# 109.479 μs (0 allocations: 0 bytes)
@test y1 ≈ y2
# Test Passed About 14.5x faster, roughly what we'd expect from AVX512 + single precision. You can test and benchmark how functions via: julia> using VectorizationBase, SLEEFPirates
julia> W = VectorizationBase.pick_vector_width(Float32) # Chosen SIMD vector width
16
julia> sx32 = SVec(ntuple(_ -> Core.VecElement(randn(Float32)), W))
SVec{16,Float32}<0.1209027f0, 0.26148129f0, 0.81624657f0, -1.1177294f0, -1.8269225f0, -0.542883f0, 2.3699f0, 0.7374754f0, 0.56119925f0, 1.3130764f0, -0.75493294f0, 0.22216304f0, -0.2675562f0, -0.3545847f0, -0.54735f0, 1.676646f0>
julia> @btime tanh($(Ref(sx32))[])
8.462 ns (0 allocations: 0 bytes)
SVec{16,Float32}<0.12031703f0, 0.2556805f0, 0.6730218f0, -0.80677766f0, -0.9495241f0, -0.49516717f0, 0.98267066f0, 0.6276174f0, 0.50886667f0, 0.86505175f0, -0.6380826f0, 0.21857873f0, -0.2613494f0, -0.3404352f0, -0.49853143f0, 0.93242496f0>
julia> @btime tanh($(Ref(sx32[1]))[])
7.561 ns (0 allocations: 0 bytes)
0.120317034f0
julia> @btime SLEEFPirates.tanh_fast($(Ref(sx32))[])
7.083 ns (0 allocations: 0 bytes)
SVec{16,Float32}<0.120317034f0, 0.2556805f0, 0.6730218f0, -0.8067777f0, -0.9495241f0, -0.49516723f0, 0.98267066f0, 0.6276174f0, 0.50886667f0, 0.86505175f0, -0.6380826f0, 0.21857871f0, -0.2613494f0, -0.34043518f0, -0.4985314f0, 0.93242496f0> Calculating 16 To be clear, in this broadcast: (c::D3)(x::AbstractArray) = @avx c.σ.(c.W * x .+ c.b) It'll use |
The design goal of this piece to me is as follows:
I'm not sure exactly how it would interact with other ADs in the ecosystem just yet, but it is a fairly common trick to replace Zygote with something else to test for performance/ correctness etc. I would appreciate some feedback on this if these represent roughly the right expectations since it directly impacts the API for user defined functions.
Here we usually deal with |
I think the basic goal is to accelerate known functions, and supporting LV more generally across the framework, while not dropping support for existing language primitives like control flow. |
This is a speculative change based on some recent discussions to help speedup common tasks in Flux, and the tanh from SLEEFPirates was found to make a significant difference. This is to discuss the viability of doing this by default in Flux and any different kinds of considerations we would have while doing so.
cc @ChrisRackauckas
PR Checklist
@dhairyagandhi96
(for API changes).