inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

andreasnoack · 2016-03-23T05:24:31Z

I've just noticed that randn and randexp are much slower than they used to be. This might the reason that the tests for Distributions are timing out but I'm not completely sure. I've run a git bisect and the blame goes to ~~6382116~~ 79b08ca so cc: @vtjnash

The text was updated successfully, but these errors were encountered:

andreasnoack · 2016-04-05T00:09:26Z

The slowdown here is more than 100pct. on my machine when comparing before and after the commit listed above. It's a quite severe regression in two important functions so I've marked it for 0.5.0.

vtjnash · 2016-04-28T15:22:58Z

it should be possible to disable that commit by commenting out 6382116?w=1#diff-c0d983631c937ee5acfd0137cd7332deR1681

andreasnoack · 2016-04-29T00:27:57Z

It appears that the commit I listed above is slightly wrong. A new bisection showed that it is the nearby commit 79b08ca and I consistently see a 50pct. slow down before and after that commit. Below, I've listed the full output from @code_warntype and @code_llvm. The LLVM codes are very similar and I had to use diff to see that the only real differen in @code_llvm

Good

%28 = call %jl_value_t* @"julia_randn!_24681"(%jl_value_t* inttoptr (i64 139931478441072 to %jl_value_t*), %jl_value_t* %27) #0

Bad

%28 = call %jl_value_t* @"julia_randn!_24614"(%jl_value_t* %27) #0

`@code_warntype`

Good

julia> @code_warntype randn(10)
Variables:
  #self#::Base.Random.#randn
  dims::Tuple{Int64}

Body:
  begin  # random.jl, line 1138: # boot.jl, line 349: # boot.jl, line 331:
      GenSym(0) = (Core.convert)(Core.Int,(top(getfield))(dims::Tuple{Int64},1)::Int64)::
Any
      GenSym(1) = (top(ccall))(:jl_alloc_array_1d,(top(apply_type))(Core.Array,Float64,1)
::Type{Array{Float64,1}},(top(svec))(Core.Any,Core.Int)::SimpleVector,Array{Float64,1},0,
(top(unsafe_convert))(Core.Int,GenSym(0))::Int64,GenSym(0))::Array{Float64,1} # random.jl
, line 1136:
      return (Base.Random.randn!)(Base.Random.GLOBAL_RNG,GenSym(1))::Array{Float64,1}
  end::Array{Float64,1}

Bad

julia> @code_warntype randn(10)
Variables:
  #self#::Base.Random.#randn
  dims::Tuple{Int64}

Body:
  begin  # random.jl, line 1138: # boot.jl, line 349: # boot.jl, line 331:
      GenSym(0) = (Core.convert)(Core.Int,(top(getfield))(dims::Tuple{Int64},1)::Int64)::
Any
      return (Base.Random.randn!)((top(ccall))(:jl_alloc_array_1d,(top(apply_type))(Core.
Array,Float64,1)::Type{Array{Float64,1}},(top(svec))(Core.Any,Core.Int)::SimpleVector,Arr
ay{Float64,1},0,(top(unsafe_convert))(Core.Int,GenSym(0))::Int64,GenSym(0))::Array{Float6
4,1})::Array{Float64,1}
  end::Array{Float64,1}

`@code_llvm`

Good

julia> @code_llvm randn(10)

define %jl_value_t* @julia_randn_24680(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states() #1
  %4 = alloca [11 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64
 0
  %5 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 6
  %7 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 9
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %8 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %10, align 8
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %11 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 8
  store %jl_value_t* null, %jl_value_t** %12, align 8
  store %jl_value_t* null, %jl_value_t** %7, align 8
  %13 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 10
  store %jl_value_t* null, %jl_value_t** %13, align 8
  %14 = bitcast [11 x %jl_value_t*]* %4 to i64*
  store i64 18, i64* %14, align 8
  %15 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 1
  %16 = bitcast %jl_value_t*** %3 to i64*
  %17 = load i64, i64* %16, align 8
  %18 = bitcast %jl_value_t** %15 to i64*
  store i64 %17, i64* %18, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %19 = icmp eq i32 %2, 0
  br i1 %19, label %fail, label %pass

fail:                                             ; preds = %top
  call void @jl_bounds_error_tuple_int(%jl_value_t** %1, i64 0, i64 1)
  unreachable

pass:                                             ; preds = %top
  %20 = bitcast %jl_value_t** %1 to i64*
  %21 = load i64, i64* %20, align 8
  store %jl_value_t* inttoptr (i64 139931467105232 to %jl_value_t*), %jl_value_t** %7, align 8
  %22 = bitcast %jl_value_t** %13 to i64*
  store i64 %21, i64* %22, align 8
  %23 = call %jl_value_t* @julia_convert.12(%jl_value_t* inttoptr (i64 139931467142064 to %jl_value_t*), %jl_value_t** %7, i32 2)
  store %jl_value_t* %23, %jl_value_t** %5, align 8
  store %jl_value_t* inttoptr (i64 139931490264856 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 139931467105232 to %jl_value_t*), %jl_value_t** %11, align 8
  store %jl_value_t* %23, %jl_value_t** %12, align 8
  %24 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 3)
  store %jl_value_t* %24, %jl_value_t** %8, align 8
  %25 = bitcast %jl_value_t* %24 to i64*
  %26 = load i64, i64* %25, align 16
  %27 = call %jl_value_t* inttoptr (i64 139940288099664 to %jl_value_t* (%jl_value_t*, i64)*)(%jl_value_t* inttoptr (i64 139931467991696 to %jl_value_t*), i64 %26)
  store %jl_value_t* %27, %jl_value_t** %9, align 8
  %28 = call %jl_value_t* @"julia_randn!_24681"(%jl_value_t* inttoptr (i64 139931478441072 to %jl_value_t*), %jl_value_t* %27) #0
  store %jl_value_t* %28, %jl_value_t** %10, align 8
  %29 = load i64, i64* %18, align 8
  store i64 %29, i64* %16, align 8
  ret %jl_value_t* %28
}

Bad

julia> @code_llvm randn(10)

define %jl_value_t* @julia_randn_24613(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states() #1
  %4 = alloca [11 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64
 0
  %5 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 6
  %7 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 9
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %8 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %10, align 8
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %11 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 8
  store %jl_value_t* null, %jl_value_t** %12, align 8
  store %jl_value_t* null, %jl_value_t** %7, align 8
  %13 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 10
  store %jl_value_t* null, %jl_value_t** %13, align 8
  %14 = bitcast [11 x %jl_value_t*]* %4 to i64*
  store i64 18, i64* %14, align 8
  %15 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 1
  %16 = bitcast %jl_value_t*** %3 to i64*
  %17 = load i64, i64* %16, align 8
  %18 = bitcast %jl_value_t** %15 to i64*
  store i64 %17, i64* %18, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %19 = icmp eq i32 %2, 0
  br i1 %19, label %fail, label %pass

fail:                                             ; preds = %top
  call void @jl_bounds_error_tuple_int(%jl_value_t** %1, i64 0, i64 1)
  unreachable

pass:                                             ; preds = %top
  %20 = bitcast %jl_value_t** %1 to i64*
  %21 = load i64, i64* %20, align 8
  store %jl_value_t* inttoptr (i64 140126559208400 to %jl_value_t*), %jl_value_t** %7, align 8
  %22 = bitcast %jl_value_t** %13 to i64*
  store i64 %21, i64* %22, align 8
  %23 = call %jl_value_t* @julia_convert.4(%jl_value_t* inttoptr (i64 140126559240952 to %jl_value_t*), %jl_value_t** %7, i32 2)
  store %jl_value_t* %23, %jl_value_t** %5, align 8
  store %jl_value_t* inttoptr (i64 140126575941512 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 140126559208400 to %jl_value_t*), %jl_value_t** %11, align 8
  store %jl_value_t* %23, %jl_value_t** %12, align 8
  %24 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 3)
  store %jl_value_t* %24, %jl_value_t** %8, align 8
  %25 = bitcast %jl_value_t* %24 to i64*
  %26 = load i64, i64* %25, align 16
  %27 = call %jl_value_t* inttoptr (i64 140135380260176 to %jl_value_t* (%jl_value_t*, i64)*)(%jl_value_t* inttoptr (i64 140126560969936 to %jl_value_t*), i64 %26)
  store %jl_value_t* %27, %jl_value_t** %9, align 8
  %28 = call %jl_value_t* @"julia_randn!_24614"(%jl_value_t* %27) #0
  store %jl_value_t* %28, %jl_value_t** %10, align 8
  %29 = load i64, i64* %18, align 8
  store i64 %29, i64* %16, align 8
  ret %jl_value_t* %28
}

andreasnoack · 2016-04-29T18:01:20Z

@vtjnash I tried what you suggested in #15596 (comment) and it fixed the issue. I had to comment out a few things in sysimg.jl and precompile.jl for the compilation to succeed.

vtjnash · 2016-04-29T18:44:00Z

With andreas's help, I tracked this down to an instability in the inlining heuristic. Since we are resolving a cycle here, we need to split finish into two stages (finalizing type-information marking and doing optimization) and pick a predictable edge in the function graph from which to do inlining (e.g. I think we should start with the smallest function and resolve inlining working backwards from there. not certain yet if that requires doing a more exact computation of the shape of a cycle than we currently perform).

for a quick fix, it might be enough to just invert the fixed-point finish loop at 6382116?w=1#diff-c0d983631c937ee5acfd0137cd7332deR1702. right now it's in provably the wrong order. at least if we did this loop in reverse, it wouldn't be always wrong.

JeffBezanson · 2016-04-29T19:44:10Z

Ah, very interesting!

workaround to fix #15596

ViralBShah added the performance Must go faster label Mar 23, 2016

andreasnoack added the regression Regression in behavior compared to a previous version label Mar 23, 2016

andreasnoack added this to the 0.5.0 milestone Apr 5, 2016

tkelman mentioned this issue Apr 21, 2016

clean up some cases of messy IR generation #15989

Merged

andreasnoack self-assigned this Apr 28, 2016

vtjnash changed the title ~~2x slowdown in randn and randexp~~ inlining heuristic is unstable [was: 2x slowdown in randn and randexp] Apr 29, 2016

JeffBezanson mentioned this issue Apr 29, 2016

misc. benchmark regressions since 0.4 #16128

Closed

tkelman assigned vtjnash and unassigned andreasnoack May 5, 2016

vtjnash added a commit that referenced this issue May 6, 2016

improve inlining order

9fe97af

workaround to fix #15596

vtjnash mentioned this issue May 6, 2016

improve inlining order #16236

Merged

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label May 6, 2016

vtjnash closed this as completed in #16236 May 7, 2016

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

andreasnoack commented Mar 23, 2016 •

edited

Loading

andreasnoack commented Apr 5, 2016

vtjnash commented Apr 28, 2016

andreasnoack commented Apr 29, 2016

andreasnoack commented Apr 29, 2016

vtjnash commented Apr 29, 2016

JeffBezanson commented Apr 29, 2016

inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

Comments

andreasnoack commented Mar 23, 2016 • edited Loading

andreasnoack commented Apr 5, 2016

vtjnash commented Apr 28, 2016

andreasnoack commented Apr 29, 2016

Good

Bad

@code_warntype

Good

Bad

@code_llvm

Good

Bad

andreasnoack commented Apr 29, 2016

vtjnash commented Apr 29, 2016

JeffBezanson commented Apr 29, 2016

andreasnoack commented Mar 23, 2016 •

edited

Loading

`@code_warntype`

`@code_llvm`