Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

Closed
andreasnoack opened this issue Mar 23, 2016 · 6 comments · Fixed by #16236
Closed

inlining heuristic is unstable [was: 2x slowdown in randn and randexp] #15596

andreasnoack opened this issue Mar 23, 2016 · 6 comments · Fixed by #16236
Assignees
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@andreasnoack
Copy link
Member

andreasnoack commented Mar 23, 2016

I've just noticed that randn and randexp are much slower than they used to be. This might the reason that the tests for Distributions are timing out but I'm not completely sure. I've run a git bisect and the blame goes to 6382116 79b08ca so cc: @vtjnash

@ViralBShah ViralBShah added the performance Must go faster label Mar 23, 2016
@andreasnoack andreasnoack added the regression Regression in behavior compared to a previous version label Mar 23, 2016
@andreasnoack andreasnoack added this to the 0.5.0 milestone Apr 5, 2016
@andreasnoack
Copy link
Member Author

The slowdown here is more than 100pct. on my machine when comparing before and after the commit listed above. It's a quite severe regression in two important functions so I've marked it for 0.5.0.

@vtjnash
Copy link
Member

vtjnash commented Apr 28, 2016

it should be possible to disable that commit by commenting out 6382116?w=1#diff-c0d983631c937ee5acfd0137cd7332deR1681

@andreasnoack andreasnoack self-assigned this Apr 28, 2016
@andreasnoack
Copy link
Member Author

It appears that the commit I listed above is slightly wrong. A new bisection showed that it is the nearby commit 79b08ca and I consistently see a 50pct. slow down before and after that commit. Below, I've listed the full output from @code_warntype and @code_llvm. The LLVM codes are very similar and I had to use diff to see that the only real differen in @code_llvm

Good
%28 = call %jl_value_t* @"julia_randn!_24681"(%jl_value_t* inttoptr (i64 139931478441072 to %jl_value_t*), %jl_value_t* %27) #0
Bad
%28 = call %jl_value_t* @"julia_randn!_24614"(%jl_value_t* %27) #0

@code_warntype

Good
julia> @code_warntype randn(10)
Variables:
  #self#::Base.Random.#randn
  dims::Tuple{Int64}

Body:
  begin  # random.jl, line 1138: # boot.jl, line 349: # boot.jl, line 331:
      GenSym(0) = (Core.convert)(Core.Int,(top(getfield))(dims::Tuple{Int64},1)::Int64)::
Any
      GenSym(1) = (top(ccall))(:jl_alloc_array_1d,(top(apply_type))(Core.Array,Float64,1)
::Type{Array{Float64,1}},(top(svec))(Core.Any,Core.Int)::SimpleVector,Array{Float64,1},0,
(top(unsafe_convert))(Core.Int,GenSym(0))::Int64,GenSym(0))::Array{Float64,1} # random.jl
, line 1136:
      return (Base.Random.randn!)(Base.Random.GLOBAL_RNG,GenSym(1))::Array{Float64,1}
  end::Array{Float64,1}
Bad
julia> @code_warntype randn(10)
Variables:
  #self#::Base.Random.#randn
  dims::Tuple{Int64}

Body:
  begin  # random.jl, line 1138: # boot.jl, line 349: # boot.jl, line 331:
      GenSym(0) = (Core.convert)(Core.Int,(top(getfield))(dims::Tuple{Int64},1)::Int64)::
Any
      return (Base.Random.randn!)((top(ccall))(:jl_alloc_array_1d,(top(apply_type))(Core.
Array,Float64,1)::Type{Array{Float64,1}},(top(svec))(Core.Any,Core.Int)::SimpleVector,Arr
ay{Float64,1},0,(top(unsafe_convert))(Core.Int,GenSym(0))::Int64,GenSym(0))::Array{Float6
4,1})::Array{Float64,1}
  end::Array{Float64,1}

@code_llvm

Good
julia> @code_llvm randn(10)

define %jl_value_t* @julia_randn_24680(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states() #1
  %4 = alloca [11 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64
 0
  %5 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 6
  %7 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 9
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %8 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %10, align 8
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %11 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 8
  store %jl_value_t* null, %jl_value_t** %12, align 8
  store %jl_value_t* null, %jl_value_t** %7, align 8
  %13 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 10
  store %jl_value_t* null, %jl_value_t** %13, align 8
  %14 = bitcast [11 x %jl_value_t*]* %4 to i64*
  store i64 18, i64* %14, align 8
  %15 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 1
  %16 = bitcast %jl_value_t*** %3 to i64*
  %17 = load i64, i64* %16, align 8
  %18 = bitcast %jl_value_t** %15 to i64*
  store i64 %17, i64* %18, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %19 = icmp eq i32 %2, 0
  br i1 %19, label %fail, label %pass

fail:                                             ; preds = %top
  call void @jl_bounds_error_tuple_int(%jl_value_t** %1, i64 0, i64 1)
  unreachable

pass:                                             ; preds = %top
  %20 = bitcast %jl_value_t** %1 to i64*
  %21 = load i64, i64* %20, align 8
  store %jl_value_t* inttoptr (i64 139931467105232 to %jl_value_t*), %jl_value_t** %7, align 8
  %22 = bitcast %jl_value_t** %13 to i64*
  store i64 %21, i64* %22, align 8
  %23 = call %jl_value_t* @julia_convert.12(%jl_value_t* inttoptr (i64 139931467142064 to %jl_value_t*), %jl_value_t** %7, i32 2)
  store %jl_value_t* %23, %jl_value_t** %5, align 8
  store %jl_value_t* inttoptr (i64 139931490264856 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 139931467105232 to %jl_value_t*), %jl_value_t** %11, align 8
  store %jl_value_t* %23, %jl_value_t** %12, align 8
  %24 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 3)
  store %jl_value_t* %24, %jl_value_t** %8, align 8
  %25 = bitcast %jl_value_t* %24 to i64*
  %26 = load i64, i64* %25, align 16
  %27 = call %jl_value_t* inttoptr (i64 139940288099664 to %jl_value_t* (%jl_value_t*, i64)*)(%jl_value_t* inttoptr (i64 139931467991696 to %jl_value_t*), i64 %26)
  store %jl_value_t* %27, %jl_value_t** %9, align 8
  %28 = call %jl_value_t* @"julia_randn!_24681"(%jl_value_t* inttoptr (i64 139931478441072 to %jl_value_t*), %jl_value_t* %27) #0
  store %jl_value_t* %28, %jl_value_t** %10, align 8
  %29 = load i64, i64* %18, align 8
  store i64 %29, i64* %16, align 8
  ret %jl_value_t* %28
}
Bad
julia> @code_llvm randn(10)

define %jl_value_t* @julia_randn_24613(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states() #1
  %4 = alloca [11 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64
 0
  %5 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 2
  %6 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 6
  %7 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 9
  store %jl_value_t* null, %jl_value_t** %5, align 8
  %8 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %9, align 8
  %10 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %10, align 8
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %11 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 7
  store %jl_value_t* null, %jl_value_t** %11, align 8
  %12 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 8
  store %jl_value_t* null, %jl_value_t** %12, align 8
  store %jl_value_t* null, %jl_value_t** %7, align 8
  %13 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 10
  store %jl_value_t* null, %jl_value_t** %13, align 8
  %14 = bitcast [11 x %jl_value_t*]* %4 to i64*
  store i64 18, i64* %14, align 8
  %15 = getelementptr [11 x %jl_value_t*], [11 x %jl_value_t*]* %4, i64 0, i64 1
  %16 = bitcast %jl_value_t*** %3 to i64*
  %17 = load i64, i64* %16, align 8
  %18 = bitcast %jl_value_t** %15 to i64*
  store i64 %17, i64* %18, align 8
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %19 = icmp eq i32 %2, 0
  br i1 %19, label %fail, label %pass

fail:                                             ; preds = %top
  call void @jl_bounds_error_tuple_int(%jl_value_t** %1, i64 0, i64 1)
  unreachable

pass:                                             ; preds = %top
  %20 = bitcast %jl_value_t** %1 to i64*
  %21 = load i64, i64* %20, align 8
  store %jl_value_t* inttoptr (i64 140126559208400 to %jl_value_t*), %jl_value_t** %7, align 8
  %22 = bitcast %jl_value_t** %13 to i64*
  store i64 %21, i64* %22, align 8
  %23 = call %jl_value_t* @julia_convert.4(%jl_value_t* inttoptr (i64 140126559240952 to %jl_value_t*), %jl_value_t** %7, i32 2)
  store %jl_value_t* %23, %jl_value_t** %5, align 8
  store %jl_value_t* inttoptr (i64 140126575941512 to %jl_value_t*), %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 140126559208400 to %jl_value_t*), %jl_value_t** %11, align 8
  store %jl_value_t* %23, %jl_value_t** %12, align 8
  %24 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %6, i32 3)
  store %jl_value_t* %24, %jl_value_t** %8, align 8
  %25 = bitcast %jl_value_t* %24 to i64*
  %26 = load i64, i64* %25, align 16
  %27 = call %jl_value_t* inttoptr (i64 140135380260176 to %jl_value_t* (%jl_value_t*, i64)*)(%jl_value_t* inttoptr (i64 140126560969936 to %jl_value_t*), i64 %26)
  store %jl_value_t* %27, %jl_value_t** %9, align 8
  %28 = call %jl_value_t* @"julia_randn!_24614"(%jl_value_t* %27) #0
  store %jl_value_t* %28, %jl_value_t** %10, align 8
  %29 = load i64, i64* %18, align 8
  store i64 %29, i64* %16, align 8
  ret %jl_value_t* %28
}

@andreasnoack
Copy link
Member Author

@vtjnash I tried what you suggested in #15596 (comment) and it fixed the issue. I had to comment out a few things in sysimg.jl and precompile.jl for the compilation to succeed.

@vtjnash vtjnash changed the title 2x slowdown in randn and randexp inlining heuristic is unstable [was: 2x slowdown in randn and randexp] Apr 29, 2016
@vtjnash
Copy link
Member

vtjnash commented Apr 29, 2016

With andreas's help, I tracked this down to an instability in the inlining heuristic. Since we are resolving a cycle here, we need to split finish into two stages (finalizing type-information marking and doing optimization) and pick a predictable edge in the function graph from which to do inlining (e.g. I think we should start with the smallest function and resolve inlining working backwards from there. not certain yet if that requires doing a more exact computation of the shape of a cycle than we currently perform).

for a quick fix, it might be enough to just invert the fixed-point finish loop at 6382116?w=1#diff-c0d983631c937ee5acfd0137cd7332deR1702. right now it's in provably the wrong order. at least if we did this loop in reverse, it wouldn't be always wrong.

@JeffBezanson
Copy link
Member

Ah, very interesting!

@tkelman tkelman assigned vtjnash and unassigned andreasnoack May 5, 2016
vtjnash added a commit that referenced this issue May 6, 2016
workaround to fix #15596
@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label May 6, 2016
@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants