Add LLVM level allocation optimization pass #22684

yuyichao · 2017-07-05T04:08:21Z

Test might be added later.

Just to give a taste of it though

julia> function test_opt(a::Float64)
           s = Ref{Float64}()
           c = Ref{Float64}()
           ccall((:sincos, Base.libm_name), Void, (Float64, Ptr{Float64}, Ptr{Float64}), a, s, c)
           s[], c[]
       end
test_opt (generic function with 1 method)

julia> for i in 1
           @time test_opt(1.0)
       end
  0.000005 seconds

julia> @code_llvm test_opt(1.0)

define void @julia_test_opt_63022([2 x double] addrspace(11)* noalias nocapture sret, double) #0 !dbg !5 {
top:
  %2 = alloca i64, align 8
  %3 = bitcast i64* %2 to %jl_value_t*
  %4 = alloca [2 x double], align 8
  %5 = bitcast %jl_value_t* %3 to double*
  %6 = getelementptr inbounds [2 x double], [2 x double]* %4, i64 0, i64 0
  call void inttoptr (i64 140684203667904 to void (double, double*, double*)*)(double %1, double* %6, double* %5)
  %7 = load i64, i64* %2, align 16
  %8 = getelementptr inbounds [2 x double], [2 x double]* %4, i64 0, i64 1
  %9 = bitcast double* %8 to i64*
  store i64 %7, i64* %9, align 8
  %10 = bitcast [2 x double]* %4 to i8*
  %11 = bitcast [2 x double] addrspace(11)* %0 to i8 addrspace(11)*
  call void @llvm.memcpy.p11i8.p0i8.i32(i8 addrspace(11)* %11, i8* %10, i32 16, i32 8, i1 false)
  ret void
}

julia> function f(a)
           b = Ref(a)
           b[] += 1
           b[] -= 1
           b[] += 2
           b[]
       end
f (generic function with 1 method)

julia> @code_llvm f(2)

define i64 @julia_f_63068(i64) #0 !dbg !5 {
top:
  %1 = add i64 %0, 2
  ret i64 %1
}

This takes advantage of the LLVM optimizations to get more precise esape info but it doesn't replace allocation elimination in typeinf, which can also split allocations and do more fancy transformations.

The code generated is not 100% optimum (issue from LLVM optimization order) though that might be better when we can run this later in the pipeline. The current placement for 5.0 is arbitrary....

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-07-05T04:27:52Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`make -j3`, ProcessExited(2)) [2]

Logs and partial data can be found here
cc @jrevels

yuyichao · 2017-07-05T04:32:59Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-07-05T04:55:01Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`/home/nanosoldier/workdir/tmpSVWZ30/julia -e Pkg.update()`, ProcessSignaled(11)) [0]

Logs and partial data can be found here
cc @jrevels

yuyichao · 2017-07-05T05:48:59Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-07-05T06:11:05Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`/home/nanosoldier/workdir/tmpSzyGHq/julia -e Pkg.update()`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @jrevels

yuyichao · 2017-07-05T15:51:12Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

Keno · 2017-07-05T18:25:17Z

Stealing my thunder, eh? That's ok :). I'll try to review this tomorrow.

tkelman · 2017-07-05T18:42:10Z

src/llvm-alloc-opt.cpp

+    if (auto call = dyn_cast<CallInst>(I)) {
+        if (ptr_from_objref && ptr_from_objref == call->getCalledFunction())
+            return true;
+        // Only use in argument counts, uses in operand bundle doesn't since it cannot escape.


doesn't what?

Doesn't count.

ah then "uses ... don't" or "use ... doesn't"

nanosoldier · 2017-07-05T18:42:12Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

tkelman · 2017-07-05T19:02:11Z

src/llvm-alloc-opt.cpp

+        bool ignore_tag = true;
+        auto orig = it.first;
+        if (optimize && checkUses(orig, 0, ignore_tag)) {
+            // The allocation does not escape or be used in a phi node so none of the derived


or get used ?

tkelman · 2017-07-05T19:10:32Z

Does this mean we can finally deprecate & in ccall?

yuyichao · 2017-07-05T20:29:06Z

Stealing my thunder, eh? That's ok :)

FWIW, the main part of this is just ~4hrs of work after I suddenly got interested in checking if the gcframe placement pass is late enough for us to do this. So nothing much would be lost if you already have a better version working ;-p (Plus I've been implementing Jameson's ideas all along )

Does this mean we can finally deprecate & in ccall?

Yes, that should be the case this PR is the best at doing. There are still rare cases where the & in ccall can be used to reduce allocation but that only happen with a custom ptr_arg_cconvert and ptr_arg_unsafe_convert and I don't think anyone is doing it. I think master will be open for PR's replacing &s with Refs after this is merged and we can deprecate & after making sure nothing goes wrong.

maleadt · 2017-07-06T09:15:35Z

src/cgutils.cpp

@@ -2156,25 +2156,10 @@ static Value *emit_allocobj(jl_codectx_t &ctx, size_t static_size, Value *jt)
 {
    JL_FEAT_REQUIRE(ctx, dynamic_alloc);
    JL_FEAT_REQUIRE(ctx, runtime);


Should probably postpone these checks to the actual lowering (but we don't have a cgctx there).

That can be passed as a paramter. What would be the correct behavior if the test fail though?

abort compilation. currently JL_FEAT_REQUIRE just calls jl_error, and we use the cgctx for current function name & line number, so we'd need to do something else anyway (use DebugLoc, I assume). better leave that to a different PR, for now maybe just:

if (!JL_FEAT_TEST(ctx, dynamic_alloc)) jl_error(...)

I mean, I don't think we can throw an error in the llvm pass.

Hmm I hadn't considered that, as it worked with _dump_function but seems to mess up the JIT indeed. Then again, I only use that feature through _dump_function...
Maybe a "used features mask" to be checked after lowering? Let's just leave it at what it is now and put that in a different PR.

Keno · 2017-07-07T09:28:36Z

src/codegen.cpp

+    gc_alloc_args.push_back(T_prjlvalue);
+    jl_alloc_obj_func = Function::Create(FunctionType::get(T_prjlvalue, gc_alloc_args, false),
+                                         Function::ExternalLinkage,
+                                         "julia.gc_alloc_obj");


Ideally, this function should get a noalias attribute.

Keno · 2017-07-07T09:37:25Z

Couple of suggestions:

Use LLVM's PtrUseVisitor to avoid excessive stack growth in the recursive approach
You can call mem2reg as a cleanup step: http://llvm.org/doxygen/namespacellvm.html#a033a44177ba94b77622aae61ff4fb4b2

yuyichao · 2017-07-08T15:40:51Z

noalias attribute added.
I can't figure how to use PtrUseVisitor to handle AddrSpaceCastInst and it also seems hard to use it to handle the mutation recursion so I decided to use the pattern I'm more familiar with and used a manual stack similar to the one in the GC.
mem2reg actually doesn't handle this case (at all). instcombine and sroa does and the new placement for 5.0 takes advantage of that.
Added llvm lifetime intrinsics to reuse stack space. Not sure if there's a LLVM helper function that can make doing that easier...

yuyichao · 2017-07-08T16:43:24Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

Keno · 2017-07-17T19:08:17Z

Why is the lowering not relatively trivial?

Keno · 2017-07-17T19:09:02Z

Or if it is, GC root lowering looks at all the calls and rewrites most of them anyway (because of the cc convention), so it seems like a fine place to do it.

Keno · 2017-07-17T19:09:38Z

The concern is that this pass might not know whether it's the last one (in IPO pipelines the pass manager will automatically rerun parts of the pipeline).

yuyichao · 2017-07-22T21:07:35Z

Moved the intrinsic lowering to gcframe lowering pass and updated the test.

yuyichao · 2017-07-23T00:35:47Z

Travis failure looks unrelated and is happenning everywhere....

yuyichao · 2017-07-23T22:43:32Z

More comments?

Keno · 2017-07-23T22:55:47Z

I'll try to take another pass through this tonight or tomorrow morning.

Keno · 2017-07-24T19:36:41Z

src/llvm-alloc-opt.cpp

+    }
+}
+
+bool AllocOpt::isSafepoint(Instruction *inst)


Pull this out into a helper for both this and the GC lowering code?

Does the lowering pass actually need this function? It has similar logic but does different things for different branches.

It doesn't need this function, but I'd like to at least share the "Known functions emitted in codegen that are not safepoints" part, so we don't have to make changes there in multiple places.

Also note that when the lowering pass recognize more functions as non safepoint, we don't necessarily want to update this pass to include those. Here I included the same list just to be safe. In principle, this needs a list of functions that codegen assums to be not safepoint. There can be functions that are never safepoint but as long as neither codegen or llvm can insert a call into a unsafe use chain it doesn't need to be treated as non safepoint here.

Keno · 2017-07-24T19:38:32Z

src/llvm-alloc-opt.cpp

+{
+    if (!alloc_obj)
+        return false;
+    std::map<CallInst*,size_t> allocs;


std::map is generally discouraged in LLVM passes because of nondeterministic iteration order. Can you use a data structure with deterministic iteration?

Seems that I actually don't need to lookup anything in it so I'll just use a vector instead...

Keno · 2017-07-24T19:42:50Z

src/llvm-alloc-opt.cpp

+            if (ptr_from_objref && ptr_from_objref == call->getCalledFunction())
+                return true;
+            auto opno = use->getOperandNo();
+            // Uses in `jl_roots` operand bundle are not counted as escaping, everything else do.


Nit: everything else is.

Keno · 2017-07-24T19:45:36Z

src/llvm-alloc-opt.cpp

+    };
+    // Both `orig_i` and `new_i` should be pointer of the same type
+    // but possibly different address spaces. `new_i` is always in addrspace 0.
+    auto replace_inst = [&] (Instruction *user) {


You can technically mutate instructions in place, but I'm fine with this implementation as well.

yuyichao · 2017-07-24T20:11:31Z

Updated comment and switched to a vector instead for recording the allocations.
I did not pull the isSafePoint out since there currently isn't a use in the lowering pass. We can certainly split it out when there is one.

This can obtain escape information with much higher precision than what we can currently do in typeinf. However, it does not replace the alloc_elim_pass! in type inference either since this cannot handle objects with reference fields. Fix #20452

Fix #21591

Fix #22004

yuyichao · 2017-07-27T19:22:24Z

I'd like to not make any other changes about the organization, especially since I still don't think it's a good idea to move the allocation that late in the pipeline. It disables the llvm constant folding of write barrier on 5.0....

There are two FreeBSD timeout and I have no way to debug it so I'll just wait for someone else to figure out what's wrong there or fix it or merge this PR.

yuyichao · 2017-07-27T19:49:03Z

And for anyone who want to use this on master, the hanging test on FreeBSD is file.

iblislin · 2017-07-28T03:03:05Z

the distributed testsuit absent in log also.
FreeBSD CI keep hanging randomly ... It seems start from 5ea8c7c
I guess that the test case added in #22566 trigger another bug...

yuyichao · 2017-07-28T03:36:44Z

the distributed testsuit absent in log also.

JULIA_TEST_MAX_RSS is set so it is moved to worker one and will only run if everything else finishes.

iblislin · 2017-07-28T04:02:51Z

👌
In case of hanging file, feel free to rerun CI.
There is a rebuild button at the top-right corner of buildbot UI.

yuyichao · 2017-07-28T11:23:40Z

Restarted one passed so I'm merging this now....

iblislin · 2017-07-28T11:47:07Z

got some compliation warnning from master e1a604e

/usr/home/iblis/git/julia/src/llvm-alloc-opt.cpp:639:53: warning: braces around scalar initializer [-Wbraced-scalar-init]                         
                                                    {ConstantInt::get(T_size, -1)});                                                              
                                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                
/usr/home/iblis/git/julia/src/llvm-late-gc-lowering.cpp:1173:50: warning: braces around scalar initializer [-Wbraced-scalar-init]                 
                                                 {ConstantInt::get(T_size, -1)});                                                                 
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                   
1 warning generated.                                                                                                                              
1 warning generated.

with

└─[iblis@abeing]% clang --version
FreeBSD clang version 4.0.0 (branches/release_40 296509) (based on LLVM 4.0.0)
Target: x86_64-unknown-freebsd12.0
Thread model: posix
InstalledDir: /usr/bin

JeffBezanson · 2017-07-28T17:52:31Z

This is awesome!

From the nanosoldier run, it looks like we don't have any benchmarks that meaningfully improve with this? We should try to add some.

yuyichao · 2017-07-28T18:25:39Z

Regression on this will probably show up in future benchmarks as we have more and more use of Ref in ccalls.

We are in general very good at avoiding performance pitfails in the benchmarks....

yuyichao added compiler:codegen Generation of LLVM IR and native code performance Must go faster labels Jul 5, 2017

yuyichao force-pushed the yyc/codegen/alloc-elim branch from 57a519f to cbfa09e Compare July 5, 2017 04:27

yuyichao force-pushed the yyc/codegen/alloc-elim branch from cbfa09e to c9c47ce Compare July 5, 2017 05:48

yuyichao force-pushed the yyc/codegen/alloc-elim branch 3 times, most recently from 308a53f to 7a852a4 Compare July 5, 2017 13:47

tkelman reviewed Jul 5, 2017

View reviewed changes

maleadt reviewed Jul 6, 2017

View reviewed changes

Keno reviewed Jul 7, 2017

View reviewed changes

yuyichao mentioned this pull request Jul 8, 2017

Remove ieee754_rem_pio2 in favor of a rem_pio2_kernel written in Julia. #22603

Merged

yuyichao force-pushed the yyc/codegen/alloc-elim branch 2 times, most recently from 16eeea4 to aae4455 Compare July 8, 2017 15:33

yuyichao force-pushed the yyc/codegen/alloc-elim branch 2 times, most recently from 4e3af21 to 70f88e0 Compare July 8, 2017 16:39

yuyichao force-pushed the yyc/codegen/alloc-elim branch from 70f88e0 to 340ae86 Compare July 8, 2017 17:22

yuyichao force-pushed the yyc/codegen/alloc-elim branch 2 times, most recently from 7ad6a9c to f71401a Compare July 22, 2017 21:03

Keno reviewed Jul 24, 2017

View reviewed changes

yuyichao force-pushed the yyc/codegen/alloc-elim branch from f71401a to cd8f33e Compare July 24, 2017 20:10

yuyichao force-pushed the yyc/codegen/alloc-elim branch from cd8f33e to 05cefe8 Compare July 27, 2017 14:45

yuyichao added 3 commits July 27, 2017 13:30

Use a local buffer in modf

2793748

Fix #21591

Improving ieee754_rem_pio2

b1a188c

Fix #22004

yuyichao force-pushed the yyc/codegen/alloc-elim branch from 05cefe8 to b1a188c Compare July 27, 2017 17:30

yuyichao merged commit e1a604e into master Jul 28, 2017

yuyichao deleted the yyc/codegen/alloc-elim branch July 28, 2017 11:23

musm mentioned this pull request Aug 17, 2017

use Ref, not preallocated arrays, for passing parameters by reference JuliaMath/SpecialFunctions.jl#44

Merged

vtjnash mentioned this pull request Nov 2, 2017

Dissallow pointer_from_objref on immutable values #15857

Closed

Add LLVM level allocation optimization pass #22684

Add LLVM level allocation optimization pass #22684

Conversation

yuyichao commented Jul 5, 2017

nanosoldier commented Jul 5, 2017

yuyichao commented Jul 5, 2017

nanosoldier commented Jul 5, 2017

yuyichao commented Jul 5, 2017

nanosoldier commented Jul 5, 2017

yuyichao commented Jul 5, 2017

Keno commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nanosoldier commented Jul 5, 2017

Choose a reason for hiding this comment

tkelman commented Jul 5, 2017

yuyichao commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented Jul 7, 2017

yuyichao commented Jul 8, 2017

yuyichao commented Jul 8, 2017

Keno commented Jul 17, 2017

Keno commented Jul 17, 2017

Keno commented Jul 17, 2017

yuyichao commented Jul 22, 2017

yuyichao commented Jul 23, 2017

yuyichao commented Jul 23, 2017

Keno commented Jul 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyichao commented Jul 24, 2017

yuyichao commented Jul 27, 2017

yuyichao commented Jul 27, 2017

iblislin commented Jul 28, 2017 • edited Loading

yuyichao commented Jul 28, 2017 • edited Loading

iblislin commented Jul 28, 2017

yuyichao commented Jul 28, 2017

iblislin commented Jul 28, 2017 • edited Loading

JeffBezanson commented Jul 28, 2017

yuyichao commented Jul 28, 2017

iblislin commented Jul 28, 2017 •

edited

Loading

yuyichao commented Jul 28, 2017 •

edited

Loading

iblislin commented Jul 28, 2017 •

edited

Loading