Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLVM level allocation optimization pass #22684

Merged
merged 3 commits into from
Jul 28, 2017
Merged

Conversation

yuyichao
Copy link
Contributor

@yuyichao yuyichao commented Jul 5, 2017

Test might be added later.

Just to give a taste of it though

julia> function test_opt(a::Float64)
           s = Ref{Float64}()
           c = Ref{Float64}()
           ccall((:sincos, Base.libm_name), Void, (Float64, Ptr{Float64}, Ptr{Float64}), a, s, c)
           s[], c[]
       end
test_opt (generic function with 1 method)

julia> for i in 1
           @time test_opt(1.0)
       end
  0.000005 seconds

julia> @code_llvm test_opt(1.0)
define void @julia_test_opt_63022([2 x double] addrspace(11)* noalias nocapture sret, double) #0 !dbg !5 {
top:
  %2 = alloca i64, align 8
  %3 = bitcast i64* %2 to %jl_value_t*
  %4 = alloca [2 x double], align 8
  %5 = bitcast %jl_value_t* %3 to double*
  %6 = getelementptr inbounds [2 x double], [2 x double]* %4, i64 0, i64 0
  call void inttoptr (i64 140684203667904 to void (double, double*, double*)*)(double %1, double* %6, double* %5)
  %7 = load i64, i64* %2, align 16
  %8 = getelementptr inbounds [2 x double], [2 x double]* %4, i64 0, i64 1
  %9 = bitcast double* %8 to i64*
  store i64 %7, i64* %9, align 8
  %10 = bitcast [2 x double]* %4 to i8*
  %11 = bitcast [2 x double] addrspace(11)* %0 to i8 addrspace(11)*
  call void @llvm.memcpy.p11i8.p0i8.i32(i8 addrspace(11)* %11, i8* %10, i32 16, i32 8, i1 false)
  ret void
}
julia> function f(a)
           b = Ref(a)
           b[] += 1
           b[] -= 1
           b[] += 2
           b[]
       end
f (generic function with 1 method)

julia> @code_llvm f(2)
define i64 @julia_f_63068(i64) #0 !dbg !5 {
top:
  %1 = add i64 %0, 2
  ret i64 %1
}

This takes advantage of the LLVM optimizations to get more precise esape info but it doesn't replace allocation elimination in typeinf, which can also split allocations and do more fancy transformations.

The code generated is not 100% optimum (issue from LLVM optimization order) though that might be better when we can run this later in the pipeline. The current placement for 5.0 is arbitrary....

@nanosoldier runbenchmarks(ALL, vs = ":master")

@yuyichao yuyichao added compiler:codegen Generation of LLVM IR and native code performance Must go faster labels Jul 5, 2017
@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch from 57a519f to cbfa09e Compare July 5, 2017 04:27
@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`make -j3`, ProcessExited(2)) [2]

Logs and partial data can be found here
cc @jrevels

@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 5, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`/home/nanosoldier/workdir/tmpSVWZ30/julia -e Pkg.update()`, ProcessSignaled(11)) [0]

Logs and partial data can be found here
cc @jrevels

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch from cbfa09e to c9c47ce Compare July 5, 2017 05:48
@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 5, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`/home/nanosoldier/workdir/tmpSzyGHq/julia -e Pkg.update()`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @jrevels

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch 3 times, most recently from 308a53f to 7a852a4 Compare July 5, 2017 13:47
@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 5, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@Keno
Copy link
Member

Keno commented Jul 5, 2017

Stealing my thunder, eh? That's ok :). I'll try to review this tomorrow.

if (auto call = dyn_cast<CallInst>(I)) {
if (ptr_from_objref && ptr_from_objref == call->getCalledFunction())
return true;
// Only use in argument counts, uses in operand bundle doesn't since it cannot escape.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah then "uses ... don't" or "use ... doesn't"

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

bool ignore_tag = true;
auto orig = it.first;
if (optimize && checkUses(orig, 0, ignore_tag)) {
// The allocation does not escape or be used in a phi node so none of the derived
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or get used ?

@tkelman
Copy link
Contributor

tkelman commented Jul 5, 2017

Does this mean we can finally deprecate & in ccall?

@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 5, 2017

Stealing my thunder, eh? That's ok :)

FWIW, the main part of this is just ~4hrs of work after I suddenly got interested in checking if the gcframe placement pass is late enough for us to do this. So nothing much would be lost if you already have a better version working ;-p (Plus I've been implementing Jameson's ideas all along :trollface: )

Does this mean we can finally deprecate & in ccall?

Yes, that should be the case this PR is the best at doing. There are still rare cases where the & in ccall can be used to reduce allocation but that only happen with a custom ptr_arg_cconvert and ptr_arg_unsafe_convert and I don't think anyone is doing it. I think master will be open for PR's replacing &s with Refs after this is merged and we can deprecate & after making sure nothing goes wrong.

@@ -2156,25 +2156,10 @@ static Value *emit_allocobj(jl_codectx_t &ctx, size_t static_size, Value *jt)
{
JL_FEAT_REQUIRE(ctx, dynamic_alloc);
JL_FEAT_REQUIRE(ctx, runtime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably postpone these checks to the actual lowering (but we don't have a cgctx there).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be passed as a paramter. What would be the correct behavior if the test fail though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort compilation. currently JL_FEAT_REQUIRE just calls jl_error, and we use the cgctx for current function name & line number, so we'd need to do something else anyway (use DebugLoc, I assume). better leave that to a different PR, for now maybe just:

if (!JL_FEAT_TEST(ctx, dynamic_alloc)) jl_error(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I don't think we can throw an error in the llvm pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I hadn't considered that, as it worked with _dump_function but seems to mess up the JIT indeed. Then again, I only use that feature through _dump_function...
Maybe a "used features mask" to be checked after lowering? Let's just leave it at what it is now and put that in a different PR.

gc_alloc_args.push_back(T_prjlvalue);
jl_alloc_obj_func = Function::Create(FunctionType::get(T_prjlvalue, gc_alloc_args, false),
Function::ExternalLinkage,
"julia.gc_alloc_obj");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this function should get a noalias attribute.

@Keno
Copy link
Member

Keno commented Jul 7, 2017

Couple of suggestions:

  1. Use LLVM's PtrUseVisitor to avoid excessive stack growth in the recursive approach
  2. You can call mem2reg as a cleanup step: http://llvm.org/doxygen/namespacellvm.html#a033a44177ba94b77622aae61ff4fb4b2

@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 8, 2017

  1. noalias attribute added.
  2. I can't figure how to use PtrUseVisitor to handle AddrSpaceCastInst and it also seems hard to use it to handle the mutation recursion so I decided to use the pattern I'm more familiar with and used a manual stack similar to the one in the GC.
  3. mem2reg actually doesn't handle this case (at all). instcombine and sroa does and the new placement for 5.0 takes advantage of that.
  4. Added llvm lifetime intrinsics to reuse stack space. Not sure if there's a LLVM helper function that can make doing that easier...

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch 2 times, most recently from 4e3af21 to 70f88e0 Compare July 8, 2017 16:39
@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 8, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch from 70f88e0 to 340ae86 Compare July 8, 2017 17:22
@Keno
Copy link
Member

Keno commented Jul 17, 2017

Why is the lowering not relatively trivial?

@Keno
Copy link
Member

Keno commented Jul 17, 2017

Or if it is, GC root lowering looks at all the calls and rewrites most of them anyway (because of the cc convention), so it seems like a fine place to do it.

@Keno
Copy link
Member

Keno commented Jul 17, 2017

The concern is that this pass might not know whether it's the last one (in IPO pipelines the pass manager will automatically rerun parts of the pipeline).

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch 2 times, most recently from 7ad6a9c to f71401a Compare July 22, 2017 21:03
@yuyichao
Copy link
Contributor Author

Moved the intrinsic lowering to gcframe lowering pass and updated the test.

@yuyichao
Copy link
Contributor Author

Travis failure looks unrelated and is happenning everywhere....

@yuyichao
Copy link
Contributor Author

More comments?

@Keno
Copy link
Member

Keno commented Jul 23, 2017

I'll try to take another pass through this tonight or tomorrow morning.

}
}

bool AllocOpt::isSafepoint(Instruction *inst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull this out into a helper for both this and the GC lowering code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the lowering pass actually need this function? It has similar logic but does different things for different branches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need this function, but I'd like to at least share the "Known functions emitted in codegen that are not safepoints" part, so we don't have to make changes there in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that when the lowering pass recognize more functions as non safepoint, we don't necessarily want to update this pass to include those. Here I included the same list just to be safe. In principle, this needs a list of functions that codegen assums to be not safepoint. There can be functions that are never safepoint but as long as neither codegen or llvm can insert a call into a unsafe use chain it doesn't need to be treated as non safepoint here.

{
if (!alloc_obj)
return false;
std::map<CallInst*,size_t> allocs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::map is generally discouraged in LLVM passes because of nondeterministic iteration order. Can you use a data structure with deterministic iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that I actually don't need to lookup anything in it so I'll just use a vector instead...

if (ptr_from_objref && ptr_from_objref == call->getCalledFunction())
return true;
auto opno = use->getOperandNo();
// Uses in `jl_roots` operand bundle are not counted as escaping, everything else do.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: everything else is.

};
// Both `orig_i` and `new_i` should be pointer of the same type
// but possibly different address spaces. `new_i` is always in addrspace 0.
auto replace_inst = [&] (Instruction *user) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can technically mutate instructions in place, but I'm fine with this implementation as well.

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch from f71401a to cd8f33e Compare July 24, 2017 20:10
@yuyichao
Copy link
Contributor Author

Updated comment and switched to a vector instead for recording the allocations.
I did not pull the isSafePoint out since there currently isn't a use in the lowering pass. We can certainly split it out when there is one.

@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch from cd8f33e to 05cefe8 Compare July 27, 2017 14:45
yuyichao added 3 commits July 27, 2017 13:30
This can obtain escape information with much higher precision than what we can currently do
in typeinf. However, it does not replace the alloc_elim_pass! in type inference either since
this cannot handle objects with reference fields.

Fix #20452
@yuyichao yuyichao force-pushed the yyc/codegen/alloc-elim branch from 05cefe8 to b1a188c Compare July 27, 2017 17:30
@yuyichao
Copy link
Contributor Author

I'd like to not make any other changes about the organization, especially since I still don't think it's a good idea to move the allocation that late in the pipeline. It disables the llvm constant folding of write barrier on 5.0....

There are two FreeBSD timeout and I have no way to debug it so I'll just wait for someone else to figure out what's wrong there or fix it or merge this PR.

@yuyichao
Copy link
Contributor Author

And for anyone who want to use this on master, the hanging test on FreeBSD is file.

@iblislin
Copy link
Member

iblislin commented Jul 28, 2017

the distributed testsuit absent in log also.
FreeBSD CI keep hanging randomly ... It seems start from 5ea8c7c
I guess that the test case added in #22566 trigger another bug...

@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 28, 2017

the distributed testsuit absent in log also.

JULIA_TEST_MAX_RSS is set so it is moved to worker one and will only run if everything else finishes.

@iblislin
Copy link
Member

👌
In case of hanging file, feel free to rerun CI.
There is a rebuild button at the top-right corner of buildbot UI.

@yuyichao
Copy link
Contributor Author

Restarted one passed so I'm merging this now....

@yuyichao yuyichao merged commit e1a604e into master Jul 28, 2017
@yuyichao yuyichao deleted the yyc/codegen/alloc-elim branch July 28, 2017 11:23
@iblislin
Copy link
Member

iblislin commented Jul 28, 2017

got some compliation warnning from master e1a604e

/usr/home/iblis/git/julia/src/llvm-alloc-opt.cpp:639:53: warning: braces around scalar initializer [-Wbraced-scalar-init]                         
                                                    {ConstantInt::get(T_size, -1)});                                                              
                                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                
/usr/home/iblis/git/julia/src/llvm-late-gc-lowering.cpp:1173:50: warning: braces around scalar initializer [-Wbraced-scalar-init]                 
                                                 {ConstantInt::get(T_size, -1)});                                                                 
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                   
1 warning generated.                                                                                                                              
1 warning generated.                                                                                                                              

with

└─[iblis@abeing]% clang --version
FreeBSD clang version 4.0.0 (branches/release_40 296509) (based on LLVM 4.0.0)
Target: x86_64-unknown-freebsd12.0
Thread model: posix
InstalledDir: /usr/bin

@JeffBezanson
Copy link
Member

This is awesome!

From the nanosoldier run, it looks like we don't have any benchmarks that meaningfully improve with this? We should try to add some.

@yuyichao
Copy link
Contributor Author

Regression on this will probably show up in future benchmarks as we have more and more use of Ref in ccalls.

We are in general very good at avoiding performance pitfails in the benchmarks....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants