Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with StaticArray broadcast #30124

Closed
mohamed82008 opened this issue Nov 23, 2018 · 5 comments · Fixed by #30420
Closed

Performance regression with StaticArray broadcast #30124

mohamed82008 opened this issue Nov 23, 2018 · 5 comments · Fixed by #30420
Labels
broadcast Applying a function over a collection performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@mohamed82008
Copy link
Contributor

mohamed82008 commented Nov 23, 2018

EDIT by @maleadt: see #30124 (comment) for a non-CUDAnative specific reproducer.

The following code works on Julia 1.0.2 and fails in Julia Commit 8a4f20b (2018-11-19 01:50 UTC). The issue first started at JuliaGPU/CUDAnative.jl#291, and @maleadt found the causing commit to be 9e98386. If you replace the multiplication line by the next commented one, it works fine.

using CUDAnative, CuArrays, StaticArrays

Ks = [SMatrix{1, 1, Float64}(rand(1,1))] |> CuArrays.CuArray
fs = [SVector{1, Float64}(rand(1))] |> CuArrays.CuArray

function kernel(fs::AbstractVector{TV}, Ks) where {N, T, TV<:SVector{N,T}}
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    m = 2.0
    if i == 1
	    fs[i] = m * (Ks[i] * fs[i])
	    #fs[i] = SVector{1,T}((m,)) .* (Ks[i] * fs[i])
    end
    return
end

@cuda blocks=1 threads=1 kernel(fs,  Ks)

#=
ERROR: InvalidIRError: compiling kernel(CuDeviceArray{SArray{Tuple{3},Float64,1,3},1,CUDAnative.AS.Global}, CuDeviceArray{SArray{Tuple{3,3},Float64,2,9},1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_invoke)
Stacktrace:
 [1] Type at broadcast.jl:141
 [2] result_style at broadcast.jl:397
 [3] combine_styles at broadcast.jl:390
 [4] broadcasted at broadcast.jl:1162
 [5] broadcast at broadcast.jl:702
 [6] * at /home/mohd/.julia/packages/StaticArrays/WmJnA/src/linalg.jl:25
 [7] kernel at REPL[18]:5
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerContext, ::LLVM.Module) at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/compiler/validation.jl:123
 [2] #compile#80(::Bool, ::Function, ::CUDAnative.CompilerContext) at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/compiler/driver.jl:74
 [3] compile at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/compiler/driver.jl:49 [inlined]
 [4] #compile#79(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::CUDAdrv.CuDevice, ::Any, ::Any) at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/compiler/driver.jl:28
 [5] compile at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/compiler/driver.jl:16 [inlined]
 [6] macro expansion at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/execution.jl:255 [inlined]
 [7] #cufunction#91(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(kernel), ::Type{Tuple{CuDeviceArray{SArray{Tuple{3},Float64,1,3},1,CUDAnative.AS.Global},CuDeviceArray{SArray{Tuple{3,3},Float64,2,9},1,CUDAnative.AS.Global}}}) at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/execution.jl:230
 [8] cufunction(::Function, ::Type) at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/execution.jl:230
 [9] top-level scope at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/execution.jl:198
 [10] top-level scope at gcutils.jl:87
 [11] top-level scope at /home/mohd/.julia/packages/CUDAnative/5H3Dk/src/execution.jl:195
=#

My versioninfo() is:

Julia Version 1.1.0-DEV.681
Commit 8a4f20b887 (2018-11-19 01:50 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Good luck!

@maleadt
Copy link
Member

maleadt commented Nov 23, 2018

Good luck!

...

That's not what I meant with a reduced version. Try inspecting the typed IR (@device_code_warntype) on a working and failing set-up, ideally reproducing the issue without StaticArrays or CUDAnative. There's probably an inference issue at the bottom of this.

@maleadt maleadt added regression Regression in behavior compared to a previous version gpu Affects running Julia on a GPU labels Nov 23, 2018
@mohamed82008
Copy link
Contributor Author

That's not what I meant with a reduced version. Try inspecting the typed IR (@device_code_warntype) on a working and failing set-up, ideally reproducing the issue without StaticArrays or CUDAnative. There's probably an inference issue at the bottom of this.

Oops, sorry that's all I have time for at the moment. Inspecting the typed IR is beyond my comfort zone at this point.

@maleadt
Copy link
Member

maleadt commented Nov 29, 2018

Further reduced (had already been bisected to 9e98386 #29843):

using StaticArrays # 0.10.0

K = SMatrix{1,1}(rand(1,1))

using InteractiveUtils
versioninfo()
@code_warntype 2.0 * K

using BenchmarkTools
@btime 2.0 * $K

Before and after the offending commit:

Julia Version 1.1.0-DEV.590
Commit 817f6fc7ee* (2018-11-01 16:31 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_DEPOT_PATH = /tmp/x
Body::SArray{Tuple{1,1},Float64,2,1}
25 1 ─ %1 = (Base.getfield)(b, :data)::Tuple{Float64}                                │╻╷╷╷╷╷╷ broadcast
   │   %2 = (Base.getfield)(%1, 1, false)::Float64                                   ││╻       materialize
   │   %3 = (Base.mul_float)(a, %2)::Float64                                         │││╻       copy
   │   %4 = (StaticArrays.tuple)(%3)::Tuple{Float64}                                 ││││┃│      _broadcast
   │   %5 = %new(SArray{Tuple{1,1},Float64,2,1}, %4)::SArray{Tuple{1,1},Float64,2,1} │││││╻       macro expansion
   └──      return %5                                                                │       
  0.020 ns (0 allocations: 0 bytes)
Julia Version 1.1.0-DEV.591
Commit 9e98386bea* (2018-11-01 16:31 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_DEPOT_PATH = /tmp/x
Body::SArray{Tuple{1,1},Float64,2,1}
25 1 ─ %1 = Base.typename::typeof(Base.typename)                                     │╻╷╷╷╷   broadcast
   │        invoke %1(Base.Broadcast.DefaultArrayStyle{0}::DataType)                 ││┃│││    broadcasted
   │   %3 = Base.typename::typeof(Base.typename)                                     │││┃││     combine_styles
   │        invoke %3(StaticArrays.StaticArrayStyle{2}::DataType)                    ││││┃│      result_style
   │   %5 = (Base.getfield)(b, :data)::Tuple{Float64}                                │││╻╷╷╷╷   copy
   │   %6 = (Base.getfield)(%5, 1, false)::Float64                                   ││││╻       _broadcast
   │   %7 = (Base.mul_float)(a, %6)::Float64                                         │││││╻       macro expansion
   │   %8 = (StaticArrays.tuple)(%7)::Tuple{Float64}                                 ││││││  
   │   %9 = %new(SArray{Tuple{1,1},Float64,2,1}, %8)::SArray{Tuple{1,1},Float64,2,1} ││││││╻       Type
   └──      return %9                                                                │       
  6.712 ns (0 allocations: 0 bytes)

So not limited to CUDAnative's Julia support, but a serious broadcast/StaticArrays regression instead.

@maleadt maleadt added performance Must go faster broadcast Applying a function over a collection and removed gpu Affects running Julia on a GPU labels Nov 29, 2018
@maleadt maleadt changed the title Scalar multiplying static vector in CUDAnative kernel makes unsupported call to Julia runtime in Julia master but not 1.0.2 Performance regression with StaticArray broadcast Nov 29, 2018
@mbauman
Copy link
Member

mbauman commented Nov 29, 2018

Nice work reducing and bisecting to the commit. My hunch is that the cause is the removal of the @pure here: https://github.com/JuliaLang/julia/pull/29843/files#diff-8e941cd1878722cc12a54e01537b8e18L137. It looks like Julia is just calling typename for the sake of a possible side-effect? I had hoped that throwing a @pure annotation on just the typename methods would be sufficient, but that doesn't appear to do the trick interactively.

@KristofferC KristofferC added this to the 1.1 milestone Nov 30, 2018
@JeffBezanson
Copy link
Member

Bump. I assume this is a blocker for 1.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants