-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance for linspace #13401
Comments
I'm not sure I understand the concern. I don't think there's anything special about |
It's a bit weirder than that. The comprehension syntax uses the iterator interface, so bound checks should not be involved here. However, look at this:
Now try to simplify it by just removing the varargs:
now it takes more than twice the time!
By the way I checked that forcing inlining of the |
I had followed the exact same trail, but for me |
@carlobaldassi I get no substantial timings difference. @timholy is this a bound checking issue? I don't see any differences in the following:
|
None of those should check bounds. On the mailing list there was discussion of |
@timholy, in evaluating |
julia> @time collect(x);
0.048368 seconds (7 allocations: 76.294 MB, 18.70% gc time)
julia> @time vc(x);
0.125762 seconds (6 allocations: 76.294 MB, 22.66% gc time) LLVMs: https://gist.github.com/KristofferC/0d915e63097010a9bfa1 |
Shoot, I missed that you were timing Still, for me, those are identical. Running
Possibly fixed by #13355? (Which is included in |
Changed to same commit, still a difference. Odd. I get same LLVM code as before. julia> @time collect(x);
0.049824 seconds (7 allocations: 76.294 MB)
julia> @time vc(x);
0.130104 seconds (6 allocations: 76.294 MB, 29.40% gc time)
|
I just built with the latest 0.5 (0.5.0-dev+7982), and |
I just tested that on the latest master branch (v"0.5.0-dev+564") the difference between
@stevengj I don't observe the slowdown you mention. We seem to have a different "latest" 0.5 version (I just pulled and compiled):
|
Similarly, I cannot see any difference between For SGJ's original julia> @code_llvm f(x)
define %jl_value_t* @julia_f_22480(%LinSpace*) {
; ...
L1: ; preds = %pass3, %L1.preheader
%"#s237.0" = phi i64 [ %43, %pass3 ], [ 0, %L1.preheader ]
%"#s238.0" = phi i64 [ %44, %pass3 ], [ 1, %L1.preheader ]
; This is the LinSpace next calculation:
%26 = load double* %6, align 8
%27 = sitofp i64 %"#s238.0" to double
%28 = fsub double %26, %27
%29 = load double* %22, align 8
%30 = fmul double %28, %29
%31 = add i64 %"#s238.0", -1
%32 = sitofp i64 %31 to double
%33 = load double* %23, align 8
%34 = fmul double %32, %33
%35 = fadd double %30, %34
%36 = load double* %24, align 8
%37 = fdiv double %35, %36
; Call sin
%38 = call double inttoptr (i64 13391592272 to double (double)*)(double %37)
; … identical after this The thing that's really remarkable is that there's a bounds check when using the full array, and it's still faster (ranges don't check bounds in julia> @code_llvm f(collect(x))
define %jl_value_t* @julia_f_22666(%jl_value_t*, %jl_value_t**, i32) {
; ...
L1: ; preds = %pass, %L1.preheader
%"#s237.0" = phi i64 [ %29, %pass ], [ 0, %L1.preheader ]
%"#s238.0" = phi i64 [ %30, %pass ], [ 1, %L1.preheader ]
; Check bounds
%16 = add i64 %"#s238.0", -1
%17 = load i64* %10, align 8
%18 = icmp ult i64 %16, %17
br i1 %18, label %idxend, label %oob
oob: ; preds = %L1
%19 = alloca i64, align 8
store i64 %"#s238.0", i64* %19, align 8
call void @jl_bounds_error_ints(%jl_value_t* %8, i64* %19, i64 1)
unreachable
idxend: ; preds = %L1
; load element
%20 = load i8** %14, align 8
%21 = bitcast i8* %20 to double*
%22 = getelementptr double* %21, i64 %16
%23 = load double* %22, align 8
; Call sin
%24 = call double inttoptr (i64 13391592272 to double (double)*)(double %23)
; … identical after this The other strange thing about this is that there seems to be some sort of interaction between the LinSpace calculation and the math. If I just "collect" the LinSpace in a comprehension without doing anything it takes about 60ms longer than collecting a regular Array… but
Similarly, if we just compare the cost of computing one element to the cost of the scalar computation that we're doing in the comprehension, it would imply that there shouldn't be more than a 10-20% overhead:
Even stranger: things get worse with |
I wondered if there might be a similar fix as for #13866. That didn't pan out, but I did discover something interesting ( # The following function will be tested with and without the @simd
function foo(X)
s = zero(eltype(X))
@inbounds @simd for x in X
s += x
end
s
end With the @simdjulia> @time foo(linspace(1,5,10^7))
0.011315 seconds (7 allocations: 240 bytes)
3.000000000000007e7
julia> @code_llvm foo(linspace(1,5,10^7))
define double @julia_foo_21514(%LinSpace*) {
L:
%1 = load %LinSpace* %0, align 8
%2 = extractvalue %LinSpace %1, 0
%3 = extractvalue %LinSpace %1, 1
%4 = extractvalue %LinSpace %1, 2
%5 = extractvalue %LinSpace %1, 3
%6 = fadd double %4, -1.000000e+00
%7 = bitcast double %6 to i64
%8 = icmp sgt i64 %7, -1
%9 = fadd double %4, 1.000000e+00
%10 = select i1 %8, double %4, double %9
%11 = fptosi double %10 to i64
%12 = sitofp i64 %11 to double
%13 = fptosi double %12 to i64
%14 = icmp eq i64 %11, %13
%15 = fcmp oeq double %10, %12
%16 = and i1 %15, %14
%17 = icmp slt i64 %11, 1
%18 = fmul double %2, %6
%19 = fdiv double %18, %5
%20 = fcmp ugt double %4, 0.000000e+00
%21 = fsub double %3, %2
%22 = fdiv double %21, %5
%23 = select i1 %20, double %22, double 0x7FF8000000000000
br i1 %16, label %pass, label %fail
fail: ; preds = %L
%24 = load %jl_value_t** @jl_inexact_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %24, i32 67)
unreachable
pass: ; preds = %L
br i1 %17, label %L9, label %L3
L3: ; preds = %L3, %pass
%"##i#7058.0" = phi i64 [ %29, %L3 ], [ 0, %pass ]
%s.1 = phi double [ %28, %L3 ], [ 0.000000e+00, %pass ]
%25 = sitofp i64 %"##i#7058.0" to double
%26 = fmul double %25, %23
%27 = fadd double %19, %26
%28 = fadd fast double %s.1, %27
%29 = add i64 %"##i#7058.0", 1
%exitcond = icmp eq i64 %29, %11
br i1 %exitcond, label %L9, label %L3
L9: ; preds = %L3, %pass
%s.3 = phi double [ 0.000000e+00, %pass ], [ %28, %L3 ]
ret double %s.3
} Without the @simdjulia> @time foo(linspace(1,5,10^7))
0.083340 seconds (7 allocations: 240 bytes)
3.000000000000007e7
julia> @code_llvm foo(linspace(1,5,10^7))
define double @julia_foo_21534(%LinSpace*) {
top:
%1 = load %LinSpace* %0, align 8
%2 = extractvalue %LinSpace %1, 2
%3 = fadd double %2, -1.000000e+00
%4 = bitcast double %3 to i64
%5 = icmp sgt i64 %4, -1
%6 = fadd double %2, 1.000000e+00
%7 = select i1 %5, double %2, double %6
%8 = fptosi double %7 to i64
%9 = sitofp i64 %8 to double
%10 = fptosi double %9 to i64
%11 = icmp eq i64 %8, %10
%12 = fcmp oeq double %7, %9
%13 = and i1 %12, %11
br i1 %13, label %pass, label %fail
fail: ; preds = %top
%14 = load %jl_value_t** @jl_inexact_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %14, i32 3)
unreachable
pass: ; preds = %top
%15 = icmp eq i64 %8, 0
br i1 %15, label %L5, label %L.preheader
L.preheader: ; preds = %pass
%16 = extractvalue %LinSpace %1, 0
%17 = extractvalue %LinSpace %1, 1
%18 = extractvalue %LinSpace %1, 3
br label %pass4
pass4: ; preds = %pass4, %L.preheader
%"#s1.0" = phi i64 [ %28, %pass4 ], [ 1, %L.preheader ]
%s.0 = phi double [ %27, %pass4 ], [ 0.000000e+00, %L.preheader ]
%19 = sitofp i64 %"#s1.0" to double
%20 = fsub double %2, %19
%21 = fmul double %16, %20
%22 = add i64 %"#s1.0", -1
%23 = sitofp i64 %22 to double
%24 = fmul double %17, %23
%25 = fadd double %21, %24
%26 = fdiv double %25, %18
%27 = fadd double %s.0, %26
%28 = add i64 %"#s1.0", 1
%29 = icmp eq i64 %"#s1.0", %8
br i1 %29, label %L5, label %pass4
L5: ; preds = %pass4, %pass
%s.1 = phi double [ 0.000000e+00, %pass ], [ %27, %pass4 ]
ret double %s.1
} Reference: using an Arrayfunction foo2(X)
s = zero(eltype(X))
@inbounds @simd for i = 1:length(X)
s += X[i]
end
s
end
julia> @time foo2(X)
0.013635 seconds (5 allocations: 176 bytes)
3.0000000000000022e7
julia> @code_llvm foo2(X)
define double @julia_foo2_21527(%jl_value_t*) {
L:
%1 = getelementptr inbounds %jl_value_t* %0, i64 1
%2 = bitcast %jl_value_t* %1 to i64*
%3 = load i64* %2, align 8
%4 = icmp sgt i64 %3, 0
%5 = select i1 %4, i64 %3, i64 0
%6 = bitcast %jl_value_t* %0 to i8**
%7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %5, i64 1)
%8 = extractvalue { i64, i1 } %7, 1
br i1 %8, label %fail, label %pass
fail: ; preds = %L
%9 = load %jl_value_t** @jl_overflow_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)
unreachable
pass: ; preds = %L
%10 = extractvalue { i64, i1 } %7, 0
%11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
%12 = extractvalue { i64, i1 } %11, 1
br i1 %12, label %fail1, label %pass2
fail1: ; preds = %pass
%13 = load %jl_value_t** @jl_overflow_exception, align 8
call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)
unreachable
pass2: ; preds = %pass
%14 = extractvalue { i64, i1 } %11, 0
%15 = icmp slt i64 %14, 1
br i1 %15, label %L11, label %if3
if3: ; preds = %pass2
%16 = load i8** %6, align 8
%17 = bitcast i8* %16 to double*
%n.vec = and i64 %14, -4
%cmp.zero = icmp eq i64 %n.vec, 0
br i1 %cmp.zero, label %middle.block, label %vector.body
vector.body: ; preds = %vector.body, %if3
%index = phi i64 [ %index.next, %vector.body ], [ 0, %if3 ]
%vec.phi = phi <2 x double> [ %22, %vector.body ], [ zeroinitializer, %if3 ]
%vec.phi13 = phi <2 x double> [ %23, %vector.body ], [ zeroinitializer, %if3 ]
%18 = getelementptr double* %17, i64 %index
%19 = bitcast double* %18 to <2 x double>*
%wide.load = load <2 x double>* %19, align 8
%.sum19 = or i64 %index, 2
%20 = getelementptr double* %17, i64 %.sum19
%21 = bitcast double* %20 to <2 x double>*
%wide.load14 = load <2 x double>* %21, align 8
%22 = fadd <2 x double> %vec.phi, %wide.load
%23 = fadd <2 x double> %vec.phi13, %wide.load14
%index.next = add i64 %index, 4
%24 = icmp eq i64 %index.next, %n.vec
br i1 %24, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body, %if3
%resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
%rdx.vec.exit.phi = phi <2 x double> [ zeroinitializer, %if3 ], [ %22, %vector.body ]
%rdx.vec.exit.phi17 = phi <2 x double> [ zeroinitializer, %if3 ], [ %23, %vector.body ]
%bin.rdx = fadd <2 x double> %rdx.vec.exit.phi17, %rdx.vec.exit.phi
%rdx.shuf = shufflevector <2 x double> %bin.rdx, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
%bin.rdx18 = fadd <2 x double> %bin.rdx, %rdx.shuf
%25 = extractelement <2 x double> %bin.rdx18, i32 0
%cmp.n = icmp eq i64 %14, %resume.val
br i1 %cmp.n, label %L11, label %L5
L5: ; preds = %L5, %middle.block
%"##i#6971.0" = phi i64 [ %29, %L5 ], [ %resume.val, %middle.block ]
%s.1 = phi double [ %28, %L5 ], [ %25, %middle.block ]
%26 = getelementptr double* %17, i64 %"##i#6971.0"
%27 = load double* %26, align 8
%28 = fadd fast double %s.1, %27
%29 = add i64 %"##i#6971.0", 1
%exitcond = icmp eq i64 %29, %14
br i1 %exitcond, label %L11, label %L5
L11: ; preds = %L5, %middle.block, %pass2
%s.3 = phi double [ 0.000000e+00, %pass2 ], [ %28, %L5 ], [ %25, %middle.block ]
ret double %s.3
} |
On master (and a different machine) things are even more dramatic, because it vectorizes with the |
With |
Ah, I bet it all comes down to that division---one cannot account for a 7-fold effect from a couple of extra adds or multiplies. I don't have time to check now, but perhaps we should be storing, and multiplying by, |
Over at #14420 I've been reconsidering the However, |
(sorry, finger slipped when scrolling) |
Fix #13401 (disable inbound checking for arrays generated on the fly)
This seems odd to me:
I can understand why e.g.
x.^2
might be slower for a linspace, since the computation is so cheap compared to computing the elements of the linspace on the fly, but I don't understand why a more complex operation like this, in a comprehension where the elements are calculated only once per computation, is so much slower.See also the mailing list discussion.
The text was updated successfully, but these errors were encountered: