Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace Base.tanh with faster tanh #1272

Closed
wants to merge 1 commit into from

Conversation

DhairyaLGandhi
Copy link
Member

This is a speculative change based on some recent discussions to help speedup common tasks in Flux, and the tanh from SLEEFPirates was found to make a significant difference. This is to discuss the viability of doing this by default in Flux and any different kinds of considerations we would have while doing so.

cc @ChrisRackauckas

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

@DhairyaLGandhi
Copy link
Member Author

SLEEFPirates doesn't seem to seem to do much on the machine I tested with, beyond the performance boost with @avx

julia> d1 = Dense(128, 50, Base.tanh)
Dense(128, 50, tanh)

julia> d2 = Dense(128, 50, SLEEFPirates.tanh)
Dense(128, 50, tanh)

julia> x2 = rand(Float32, 128, 500);

julia> d1(x2);

julia> d2(x2);

julia> @btime $d1($x2);
  1.011 ms (4 allocations: 195.53 KiB)

julia> @btime $d2($x2);
  1.325 ms (4 allocations: 195.53 KiB)

julia> struct D3{F,T,S}
         σ::F
         W::T
         b::S
       end

julia> (c::D3)(x::AbstractArray) = @avx c.σ.(c.W * x .+ c.b)

julia> d3 = D3(Base.tanh, d2.W, d2.b);

julia> d3(x2);

julia> @btime $d3($x2);
  237.393 μs (4 allocations: 195.53 KiB)

julia> d4 = D3(SLEEFPirates.tanh, d2.W, d2.b);

julia> @btime $d4($x2);
  236.328 μs (4 allocations: 195.53 KiB)

Running with GPU Arrays, however would need special handling

julia> gd3 = gpu(d3);

julia> gx2 = gpu(x2);

julia> gd3(gx2);
ERROR: MethodError: no method matching VectorizationBase.PackedStridedPointer(::CUDAdrv.CuPtr{Float32}, ::Tuple{Int64})
Closest candidates are:
  VectorizationBase.PackedStridedPointer(::Ptr{T}, ::Tuple{Vararg{Int64,N}}) where {T, N} at /home/dhairyagandhi96/.julia/packages/VectorizationBase/jUlXp/src/vectorizable.jl:313
Stacktrace:
 [1] stridedpointer_for_broadcast at /home/dhairyagandhi96/.julia/packages/VectorizationBase/jUlXp/src/vectorizable.jl:794 [inlined]
 [2] macro expansion at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:247 [inlined]
 [3] vmaterialize! at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:247 [inlined]
 [4] vmaterialize at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:317 [inlined]
 [5] (::D3{typeof(tanh),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}})(::CuArrays.CuArray{Float32,2,Nothing}) at ./REPL[169]:1
 [6] top-level scope at REPL[186]:1

@ChrisRackauckas
Copy link
Member

Yeah the dispatch with @avx would be only for CPU, so GPUs would need a separate dispatch. Yeah, I didn't know that @avx would SIMD through that broadcast well, but it looks like it does. It must just clump the broadcast and do the whole computation of sigma 4 at a time? If so, it probably fails if the activation function isn't a Julia function, so it might need to be careful and check if things are in an approved list?

@chriselrod

@DhairyaLGandhi
Copy link
Member Author

Could LV make the GPU passes no-ops? I would rather not lose the generic nature of code, esp since its diminishing returns for the majority of models where batchsize is usually not that large.

@ChrisRackauckas
Copy link
Member

You could just do typeof(x) <: CuArray ? sigma.(x) : @avx sigma.(x)

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Jul 10, 2020

If LV did that (even if as a wrapper macro), it could be used by downstream packages without adding special handling at least for different accelerators. I tested it with using Colors, like AbstractArray{RGB} (as an example, and it worked), but we would want to maintain supporting that. It errored with more common activations like relu (basically element-wise defined functions in general, I guess).

I would also be interested in seeing how it would interact with more general use cases like Flux.Zeros for switching bias off in any layer etc.

Just listing things that we should have clarity on in order to understand what the tradeoffs are.

@ChrisRackauckas
Copy link
Member

I think it would be much harder for LV to do that since it's a macro, so it is done at parse time and not compile time and doesn't have type information. It could add type checks for everything that shows up in the expression or something, but that seems difficult to get right.

@AStupidBear
Copy link
Contributor

@DhairyaLGandhi It's this already implemented in FluxML/NNlib.jl#199?

@chriselrod
Copy link

I think it would be much harder for LV to do that since it's a macro, so it is done at parse time and not compile time and doesn't have type information. It could add type checks for everything that shows up in the expression or something, but that seems difficult to get right.

LoopVectorization.@avx does normally use type information. When applied to loops, it does use type checking with LoopVectorization.check_args to choose between using the original loop with @fastmath @inbounds vs calling the generated function LoopVectorization._avx_! to evaluate the loop:

julia> @macroexpand @avx for i in 1:3
          s += x[i]
       end
quote
    begin
    end
    if LoopVectorization.check_args(x)
        var"##vptr##_x" = LoopVectorization.stridedpointer(x)
        local var"##s_0"
        begin
            $(Expr(:gc_preserve, :(var"##s_0" = begin
          begin
              var"##Tloopeltype##" = eltype(x)
              var"##Wvecwidth##" = LoopVectorization.pick_vector_width_val(var"##Tloopeltype##")
          end
          LoopVectorization._avx_!(Val{(0, 0, 0, LoopVectorization.unwrap(var"##Wvecwidth##"))}(), Tuple{:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02), :LoopVectorization, :vadd, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000000102, LoopVectorization.compute, 0x00, 0x01)}, Tuple{LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{3}, Tuple{1}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, Tuple{:i}, (LoopVectorization.StaticUnitRange{1, 3}(),), var"##vptr##_x", s)
      end), :x))
        end
        s = LoopVectorization.reduced_add(var"##s_0", s)
    else
        $(Expr(:inbounds, true))
        local var"#19#val" = for i = 1:3
                    #= REPL[3]:2 =#
                    s = Base.FastMath.add_fast(s, x[i])
                end
        $(Expr(:inbounds, :pop))
        var"#19#val"
    end
end

The generated function of course has type information, but it also (currently) loses the original representation of the loops, and instead is only given a summary of what the loops did, meaning it wouldn't necessarily currently be able to reconstruct the original loops.
Using check_args seemed (a) simpler to implement at the time, and (b) provides a clear interface that others can overload to turn LoopVectorization support on or off.

Broadcasting doesn't use that right now, but I've been saying for a while that I was planning to switch it for the same approach. Currently:

julia> @macroexpand @avx y = foo.(x)
quote
    var"##469" = Base.broadcasted(foo, x)
    var"##470" = LoopVectorization.vmaterialize(var"##469", Val{:Main}())
    y = var"##470"

It basically Meta.lowers the expression, and then swaps out Base.materialize and Base.materialize! for LoopVectorization.vmaterliaze and LoopVectorization.vmaterliaze!.
Currently, there's this fallback definition:

vmaterialize!(dest, bc, ::Val{mod}) where {mod} = Base.Broadcast.materialize!(dest, bc)

With vmaterialize!(::Union{A,Adjoint{T,A},Transpose{T,A}}, args...) where {T <: Union{Bool,Base.HWReal}, A <: StridedArray{T}} being defined to use LoopVectorization.

Unfortunately, I believe because AbstractDeviceArray{<:Base.HWReal} <: DenseArray{<:Base.HWReal}, that LoopVectorization.check_args(x::CuArray) returns true and that vmaterialize! won't hit the fallback definition either.

All the checks within the library assume DenseArrays are safe, and many more broadly support StridedArrays. The assumption having been that these subtypes mean "C/Fortran/BLAS-compatible strided arrays you can get pointers to". It's unfortunate they violate that assumption.
Could ArrayInterface.jl help / features be added and supported across the ecosystem to get more reliable dispatching?

@ChrisRackauckas
Copy link
Member

Could ArrayInterface.jl help / features be added and supported across the ecosystem to get more reliable dispatching?

Yes, we should really have a trait for whether things opt in or not, and start spreading that around the ecosystem to make this better supported. Guessing can get you pretty far, but at some point libraries need a way to tell you this information.

@ToucheSir
Copy link
Member

(Cribbing my comment from the discourse thread)

@AStupidBear #199 helps substantially for standalone tanh, but does not appear to trigger when the broadcast is fused with + in Flux's Dense

Case in point:

julia> Base.broadcasted(::typeof(tanh), x) = typeof(x)

julia> Dense(8, 64, tanh)(xx)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Array{Float32,2},Array{Float32,1}}}

Is there a generalizable way to hook into this kind of fused broadcast as well?

@ChrisRackauckas
Copy link
Member

If you do a 5-arg mul, then that is pulled into the matmul kernel and doesn't need to fuse with the other one.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Jul 12, 2020

@chriselrod do I understand this correctly; overloading LoopVectorisation.check_args(::CuArray) = false would allow us to dispatch to CUDA as usual? `ArrayInterface might be the right place for this. I also suppose having that overload in CUDA.jl is fair game too since that would always hold, since CuArrays do not opt in to CPU semantics.

@chriselrod
Copy link

chriselrod commented Jul 13, 2020

@DhairyaLGandhi I'll update broadcasting to use check_args.
It'd be nice to add something to ArrayInterface.jl with a well defined meaning like: "has strides, supports Ptr to local CPU-memory, and should be compatible with general C/Fortran/BLAS/LAPACK routines".

For example, I'd like LoopVectorization to support MArrays from StaticArrays.jl, which it currently does not. For good support, two things would be needed:

  1. Get check_args(::MArray{<:Base.HWReal}) to return true.
  2. A bit off topic, but still on the subject of getting different libraries to communicate properly on types, is communicating static size information to LoopVectorization's optimizer. LoopVectorization special cases VectorizationBase.Static[Upper/Lower]UnitRanges to glean static size info. Defining the following four methods would probably be enough for most uses:
VectorizationBase.maybestaticfirst(::StaticArrays.SOneTo) = VectorizationBase.Static{1}()
VectorizationBase.maybestaticlast(::StaticArrays.SOneTo{N}) where {N} = VectorizationBase.Static{N}()
@generated VectorizationBase.maybestaticsize(::MArray{S}, ::Val{I}) where {S,I} = VectorizationBase.Static{(S.parameters[I])::Int}()
VectorizationBase.maybestaticlength(::MArray{S,T,N,L}) where {S,T,N,L} = VectorizationBase.Static{L}()

But there'd have to be a place to define this. StaticArrays.SizedArray should have similar overloads.

A library like StaticRanges.jl, but with an emphasis on fast load times, would be nice.
StaticRanges.jl itself however takes much longer to load on my computers than LoopVectorization.jl, making it a much heavier dependency than I want to take on (and I imagine libraries like StaticArrays.jl would feel similarly).
I could factor all the relevant code out of VectorizationBase.jl into a new library in a few weeks.

EDIT:
I will release this shortly.

julia> using LoopVectorization, Test

julia> x = rand(1000); # should be long enough to make zero differences incredibly unlikely

julia> struct FallbackArrayWrapper{T,N} <: DenseArray{T,N} # Subtypes DenseArray
           data::Array{T,N}
       end

julia> Base.size(A::FallbackArrayWrapper) = size(A.data)

julia> Base.@propagate_inbounds Base.getindex(A::FallbackArrayWrapper, i::Vararg{Int, N}) where {N} = getindex(A.data, i...)

julia> Base.@propagate_inbounds Base.setindex!(A::FallbackArrayWrapper, v, i::Vararg{Int, N}) where {N} = setindex!(A.data, v, i...)

julia> Base.IndexStyle(::Type{<:FallbackArrayWrapper}) = IndexLinear()

julia> Base.pointer(A::FallbackArrayWrapper) = pointer(A.data)

julia> @test exp.(x) == (@avx exp.(FallbackArrayWrapper(x))) # Some elements aren't exactly equal
Test Failed at REPL[11]:1
  Expression: exp.(x) == #= REPL[11]:1 =# @avx(exp.(FallbackArrayWrapper(x)))
   Evaluated: [1.1221417842854389, 2.244121039137226, 2.4589471125063094, 1.4534952264782455, 2.532038566639118, 2.2091284493072854, 1.725068467675581, 1.6073147438178572, 1.5600736225518292, 1.3611858244420798    1.7180695976532183, 1.2980672830526578, 1.5196765354463808, 1.961205825911335, 2.020849417983115, 1.6405069022562084, 2.306307169464506, 1.3112862093793074, 1.2128433365019216, 1.3482334355458843] == [1.1221417842854389, 2.244121039137226, 2.45894711250631, 1.4534952264782453, 2.5320385666391174, 2.2091284493072854, 1.7250684676755808, 1.6073147438178574, 1.560073622551829, 1.3611858244420798    1.718069597653218, 1.2980672830526578, 1.519676535446381, 1.9612058259113347, 2.0208494179831153, 1.6405069022562089, 2.306307169464506, 1.3112862093793074, 1.2128433365019216, 1.3482334355458843]
ERROR: There was an error during testing

julia> LoopVectorization.check_args(::FallbackArrayWrapper) = false # Make `check_args` false

julia> @test exp.(x) == (@avx exp.(FallbackArrayWrapper(x))) # exact equality
Test Passed

@chriselrod
Copy link

@DhairyaLGandhi

SLEEFPirates doesn't seem to seem to do much on the machine I tested with, beyond the performance boost with @avx

A couple points here, (1) is that SLEEFPirates by itself relies on the autovectorizer for SIMD. This does not always work. For example:

julia> x = rand(Float32, 512); y = similar(x);

julia> @btime $y .= SLEEFPirates.tanh.($x);
  9.092 μs (0 allocations: 0 bytes)

julia> function vtanh!(y, x)
           @inbounds for i  eachindex(x,y)
               y[i] = SLEEFPirates.tanh(x[i])
           end
       end
vtanh! (generic function with 1 method)

julia> @btime vtanh!($y, $x)
  1.146 μs (0 allocations: 0 bytes)

julia> @btime @avx $y .= SLEEFPirates.tanh.($x);
  253.366 ns (0 allocations: 0 bytes)

julia> @btime @avx $y .= tanh.($x);
  247.944 ns (0 allocations: 0 bytes)

It works with the simple loop, but inspecting the native/llvm code from the broadcast shows we didn't get SIMD. Starting with Julia >= 1.5, the @avx version should be much faster than the SLEEFPirates version for tanh.

@ChrisRackauckas

If so, it probably fails if the activation function isn't a Julia function, so it might need to be careful and check if things are in an approved list?

costs.jl defines costs, as well as an IdDict that broadcasting uses to identify functions from types in the broadcasted objects when possible.
Functions that aren't recognized are assumed to be relatively expensive, but are assumed to work.

In the future, once it integrates more with the compiler so we can introspect functions to ensure they're defined for VectorizationBase.SVec inputs (SIMDPirates and SLEEFPirates heavily pirate base functions by adding ::SVec methods), then we could revert to fallback behavior on an unrecognized function.
However, for now, I think it is better to work for unrecognized functions. If a relatively simple function doesn't work, you can file a bug and I can add support. Otherwise, it may be better to not use @avx in that case. @avx SIMDs across loop iterations, for complicated functions, that probably isn't optimal, or is impractical.

Recently, Tim Holy suggested something like this for defining costs of unknown functions

julia> function cost(f, tt)
           params = Core.Compiler.Params(typemax(UInt))
           mi = Base.method_instances(f, tt)[1]
           ci = code_typed(f, tt)[1][1]
           opt = Core.Compiler.OptimizationState(mi, params)
           cost(stmt::Expr) = Core.Compiler.statement_cost(stmt, -1, ci, opt.sptypes, opt.slottypes, opt.params)
           cost(stmt) = 0
           sum(cost, ci.code)
       end
cost (generic function with 1 method)

julia> isfourthv1(i) = iszero(rem(i, 4))
isfourthv1 (generic function with 1 method)

julia> isfourthv2(i) = iszero(i & 3)
isfourthv2 (generic function with 1 method)

julia> cost(isfourthv1, Tuple{Int})
41

julia> cost(isfourthv2, Tuple{Int})
2

Not perfect -- both isfourthv1 and isfourthv2 produce the exact same llvm -- but pretty good.

If the cost it uses doesn't represent the actual cost well, it may make bad unrolling decisions. The current assumption is that unknowns are expensive, so that unrolling wont be profitable. I figure it is better to play things safe.

For a function f(::T) to be compatible with @avx, then f(::SVec{W,T}) must work. This should be true for reasonably simple duck-typed functions without branches (or using SIMDPirates.vifelse if they need select-type statements), and ultimately calling relatively primitive functions. It's probably more restrictive than ForwardDiff.jl, but should cover most functions someone reasonably wants to try SIMD-ing.
This is a non-trivial example taken from LoopVectorization's test suite:

using LoopVectorization, BenchmarkTools, Test
function clenshaw(x, coeff)
    len_c = length(coeff)
    tmp = zero(x)
    ret = zero(x)
    for i in len_c:-1:2
        ret     = muladd(x,2tmp,coeff[i]-ret)
        ret,tmp = tmp,ret
    end
    ret = muladd(x,tmp,coeff[1]-ret)
    return ret
end
function clenshaw!(ret,x,coeff)
    for j in 1:length(ret)
        ret[j] = clenshaw(x[j], coeff)
    end
end
function clenshawavx!(ret,x,coeff)
    @avx for j in 1:length(ret)
        ret[j] = clenshaw(x[j], coeff)
    end
end
T = Float32; c = rand(T,100); x = rand(T,10^4); y1 = similar(x); y2 = similar(x);
@btime clenshaw!($y1, $x, $c)
#  1.589 ms (0 allocations: 0 bytes)
@btime clenshawavx!($y2, $x, $c)
# 109.479 μs (0 allocations: 0 bytes)
@test y1  y2
# Test Passed

About 14.5x faster, roughly what we'd expect from AVX512 + single precision.
It would be good to make sure all the activation functions are supported with efficient implementations.

You can test and benchmark how functions via:

julia> using VectorizationBase, SLEEFPirates

julia> W = VectorizationBase.pick_vector_width(Float32) # Chosen SIMD vector width
16

julia> sx32 = SVec(ntuple(_ -> Core.VecElement(randn(Float32)), W))
SVec{16,Float32}<0.1209027f0, 0.26148129f0, 0.81624657f0, -1.1177294f0, -1.8269225f0, -0.542883f0, 2.3699f0, 0.7374754f0, 0.56119925f0, 1.3130764f0, -0.75493294f0, 0.22216304f0, -0.2675562f0, -0.3545847f0, -0.54735f0, 1.676646f0>

julia> @btime tanh($(Ref(sx32))[])
  8.462 ns (0 allocations: 0 bytes)
SVec{16,Float32}<0.12031703f0, 0.2556805f0, 0.6730218f0, -0.80677766f0, -0.9495241f0, -0.49516717f0, 0.98267066f0, 0.6276174f0, 0.50886667f0, 0.86505175f0, -0.6380826f0, 0.21857873f0, -0.2613494f0, -0.3404352f0, -0.49853143f0, 0.93242496f0>

julia> @btime tanh($(Ref(sx32[1]))[])
  7.561 ns (0 allocations: 0 bytes)
0.120317034f0

julia> @btime SLEEFPirates.tanh_fast($(Ref(sx32))[])
  7.083 ns (0 allocations: 0 bytes)
SVec{16,Float32}<0.120317034f0, 0.2556805f0, 0.6730218f0, -0.8067777f0, -0.9495241f0, -0.49516723f0, 0.98267066f0, 0.6276174f0, 0.50886667f0, 0.86505175f0, -0.6380826f0, 0.21857871f0, -0.2613494f0, -0.34043518f0, -0.4985314f0, 0.93242496f0>

Calculating 16 tanh doesn't take much longer than calculating 1 (with AVX512 on Julia >=1.5). 1 takes longer than 16 computed with SLEEFPirates.tanh_fast (with AVX512 on Linux with recent GLIBC).

To be clear, in this broadcast:

(c::D3)(x::AbstractArray) = @avx c.σ.(c.W * x .+ c.b)

It'll use SVecs of width VectorizationBase.pick_vector_width(Float32) (Float32, because that will be the element type of all the arrays). That'll be 8 with AVX, 16 with AVX512, or 4 without them.

@DhairyaLGandhi
Copy link
Member Author

DhairyaLGandhi commented Jul 14, 2020

The design goal of this piece to me is as follows:

  1. Existing user code breakages should be minimised. And existing test suites should pass without regressions in runtime performance.
  2. Generic code blocks should remain generic across devices. Cpu and gpu targets included. The ArrayInterface proposal probably covers that.
  3. When avx is available, try avx; if a function is defined elementwise (ie no f(::SVec) method exists) it should warn to implement the method (with a simple switch to turn off warnings, but not error) and move to the existing default broadcasting behavior.
  4. We should minimise the number of methods users (as well as Flux) have to define for any single activation function for it to work across GPUs as well (scalar indexing notwithstanding). CUDA can manage to run with f.(::CuArray) and has meant we avoided much of the duplication. LV's goals are not to support arbitrary code afaik, but to accelerate known common cases. In Flux we don't know the f we have to differentiate beforehand. In common cases we would need to accelerate it, in others we shouldn't error.

I'm not sure exactly how it would interact with other ADs in the ecosystem just yet, but it is a fairly common trick to replace Zygote with something else to test for performance/ correctness etc.

I would appreciate some feedback on this if these represent roughly the right expectations since it directly impacts the API for user defined functions.

For a function f(::T) to be compatible with @avx, then f(::SVec{W,T}) must work. This should be true for reasonably simple duck-typed functions without branches (or using SIMDPirates.vifelse if they need select-type statements), and ultimately calling relatively primitive functions

Here we usually deal with T<:Number, maybe it means that we could potentially add any missing activations not already in the autovectoriser that LV recognises. I would personally not expect to have users define simple functions using non-Base blessed if-else statements, which seems to add to the complexity of the API. Tooling around that would take a while to mature, so maybe topic for a different discussion.

@DhairyaLGandhi
Copy link
Member Author

I think the basic goal is to accelerate known functions, and supporting LV more generally across the framework, while not dropping support for existing language primitives like control flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants