Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inlining-related performance regressions #6981

Closed
simonster opened this issue May 27, 2014 · 11 comments
Closed

Inlining-related performance regressions #6981

simonster opened this issue May 27, 2014 · 11 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@simonster
Copy link
Member

I benchmarked a simple loop that does nothing but sum x*conj(x) for values in a large vector, and the slowdown is very nearly a factor of 2.

This is inlined in Julia 0.2. It looks like the regression happened in e805fd6. I know that there were some tweaks to the inlining threshold in #6605 / 24011fc, but I don't see this getting inlined with that commit or master. cc @vtjnash

@vtjnash
Copy link
Member

vtjnash commented May 27, 2014

basing this on the last time i checked (while working on tweaking that threshold), the issue is that it takes a lot of instructions (and a lot more julia box/unbox intrinsics) to describe the behavior of a complex operator

as such, i'm looking for suggestions (and evidence) on more advanced heuristics.

current thoughts include:

  • inlining next/done more aggressively
  • detecting loops and inlining more aggressively in predicted hotspots
  • inlining arithmetic functions more aggressively
  • inline functions taking lots of arguments more aggressively
  • incorporating information on whether the function uses only immutable types
  • inline functions returning new boxed values less aggressively

altering this should be relatively simple, if you want to propose some smarter heuristics, since it is handled through inline_worthy & it's call-sites

@JeffBezanson JeffBezanson added this to the 0.3 milestone May 27, 2014
@JeffBezanson
Copy link
Member

@simonster it would be good to post the exact code used, and/or add a perf test for this.

@simonster
Copy link
Member Author

Here's the code I tested:

function f(a)
    b = 0.0+0.0im
    for x in a
        b += x*conj(x)
    end
    b
end
a = complex(rand(10^8), rand(10^8))
f(a)
@time f(a)

On my MacBook Pro, this takes ~0.18 seconds on Julia 0.2.1 and ~0.35 seconds for current master.

@simonster
Copy link
Member Author

Another inlining-related regression case:

function f(a)
    x = 0
    for v in a
        x += v
    end
    x
end
a = randbool(1000000000)
f(a)

On my system this takes ~0.44 seconds in Julia 0.2.1 and ~2.6 seconds on current master. The culprit seems to be inlining of next for BitArrays. This could presumably be solved by the first item in @vtjnash's list above.

@timholy
Copy link
Member

timholy commented Jun 1, 2014

Since next also wasn't inlined in Julia 0.2.1, what do you think the difference is? Did randbool by any chance return an Array{Uint8,1} rather than BitArray{1}?

@simonster
Copy link
Member Author

I believe that next was inlined in Julia 0.2.1:

julia> code_typed(f, (BitArray{1},))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:a}, {{:x,:#s9,:#s8,:v,:_var0,:_var1},{{:a,BitArray{1},0},{:x,Int64,2},{:#s9,Int64,2},{:#s8,(Bool,Int64),18},{:v,Bool,18},{:_var0,Bool,2},{:_var1,Int64,2}},{}}, quote  # none, line 2:
        x = 0 # line 3:
        #s9 = 0
        1: 
        unless top(box)(Bool,top(not_int)(top(sle_int)(top(getfield)(a::BitArray{1},:len)::Int64,#s9::Int64)::Bool))::Bool goto 2
        _var0 = top(box)(Bool,top(not_int)(top(box)(Bool,top(and_int)(top(sle_int)(0,0)::Bool,===(top(box)(Uint64,top(and_int)(arrayref(top(getfield)(a::BitArray{1},:chunks)::Array{Uint64,1},top(box)(Int64,top(add_int)(top(box)(Int64,top(lshr_int)(#s9::Int64,top(box)($(Int32),top(trunc_int)($(Int32),6))::Int32))::Int64,1))::Int64)::Uint64,top(box)(Uint64,top(shl_int)(top(box)($(Uint64),1)::Uint64,top(box)($(Int32),top(trunc_int)($(Int32),top(box)(Int64,top(and_int)(#s9::Int64,63))::Int64))::Int32))::Uint64))::Uint64,top(box)($(Uint64),0)::Uint64)::Bool))::Bool))::Bool
        _var1 = top(box)(Int64,top(add_int)(#s9::Int64,1))::Int64
        v = _var0::Bool
        #s9 = _var1::Int64 # line 4:
        x = top(box)(Int64,top(add_int)(x::Int64,top(box)($(Int64),top(zext_int)($(Int64),v::Bool))::Int64))::Int64
        3: 
        goto 1
        2: 
        0:  # line 6:
        return x::Int64
    end)))

@timholy
Copy link
Member

timholy commented Jun 1, 2014

I see. Good catch. It seems small enough to be worthy of inlining even without taking into account that it happens to be called next.

@JeffBezanson
Copy link
Member

Yes; it used to be that any "single expression" function was inlined. While clearly not sufficient, this was actually a surprisingly good heuristic.

@timholy
Copy link
Member

timholy commented Jun 1, 2014

Presumably an || isa_single_expr(ex) would give us the best of all worlds?

@vtjnash
Copy link
Member

vtjnash commented Jun 1, 2014

I had worked on tweaking the performance threshold, but forgot it wasn't on master yet: #6677

no, you don't want a heuristic that only considers SLOC. as part of my changes to inlining, I also significantly improved what is possible to inline into a single line, causing a very significant performance impact if you have it think a single expr is always worth inlining (the example code in the issue would have taken days to compile)

@simonster simonster changed the title Complex multiplication not inlined Inlining-related performance regressions Jun 2, 2014
@JeffBezanson
Copy link
Member

fixed by #7075

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

4 participants