- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw exceptions with arguments without creating a GC frame #11284
Conversation
This will obviously cause more functions to be compiled but IMHO the overall code generated shouldn't be more than when it is inlined. (and the |
This issue also affected me, where I wanted to have improved error messages for Unicode encoding problems... I made my own separate function that took the 3 arguments, which then created and threw a UnicodeError exception... and that apparently didn’t trigger creating a GC frame... |
Yeah, the main point is to avoid creating a wrapper function for each error type. This should have the same effect with declaring the constructor of all error types this way (specialize on arguments but not inlined) but IMHO, it is easier to tell the user to use a special function to do throw error without overhead than to explain what is a better way to construct an error (especially for user defined error types). |
I agree that something is needed here, but I'm a little hesitant to make workarounds like this nicer and more useful. It makes it more likely that they'll be picked up outside of Base as a cargo-cult optimization in places where there's already a GC frame anyways. It really feels like we're close to having all the optimizations in place for the GC frame to be automatically omitted. I assume there are other major pitfalls to making BoundsError parametric so its arguments don't need to be boxed? |
Well, unless we don't need gc frame or at least they don't cause a performance penalty anymore I think this kind of optimization is necessary and I don't see why it is an issue to allow people from outside I think if Also, I think this is not only for So as for parametrized |
My point is simply that I see it as a workaround that ideally wouldn't be necessary. But I am afraid that a general solution is farther away than I had hoped, and it is good to DRY up some of these workarounds. There's also the |
the |
What causes the exception handling to be not very fast? Does it at least incur no penalty if no exceptions are raised? That was a very important point of the way that exceptions were handled in CLU... [CLU even had a signaled exception cost the same as a normal return... there was nothing special about exceptions]. |
@carnaval I think I'm more wondering about what should be the long term solution for the problem. Mainly how far away are they (so that we don't need to waste time on a work around if a better solution is almost there) but also how easy it is to undo a workaround. |
I could imagine a @ScottPJones - this PR is an attempt to remove the only extra overhead (besides code size and potential cache misses) an untaken error branch adds to a function. This is only an issue in very small inner functions that do no allocations normally, but of course needs to allocate an error object in an error branch. The creation of that error object needs a GC frame. If there is no other need for that frame, it'd be nice to limit its creation to only occur within the error branch. I can't speak to exception performance once they're thrown, except to say that I don't think you'd want to write Pythonic duck-typed error catching control flow code in Julia. |
@yuyichao I think I'd prefer to add the special case to codegen under the general rule of pushing the ugliness down to the C code where it can possibly be removed later without bothering the julia side of the fence. I'm not sure how much work would this special case incur. I think the general optimization of pushing the gc frame down in the control flow graph as far as possible is interesting in general. Exceptions are slower than they could be for several reasons. One would be that inference does not reason about the type of thrown things (I'm incidentally working on something that could help, no guarantees). The other is I that the way we lower them makes it impossible for llvm to reason about. So even if you do I don't think it's that bad though since we generally don't use those as a control flow mechanism, only for error reporting for which the speed impact is negligible. Of course getting it so that the exceptional case does not slow down the general one is important, so the gc frame opt is interesting in that regard. |
@carnaval Does Julia at least use the technique (which I believe was first introduced by CLU), of using the IP to determine whether there is a catch that should be taken? |
the throw builtin already doesn't use a gc frame. the problem noted here is that constructing the BoundsError forced the creation of a gc frame in the code leading up the eventual call to throw. thus a more satisfying fix for this particular issue is: diff --git a/base/essentials.jl b/base/essentials.jl
index 45b737b..c694c66 100644
--- a/base/essentials.jl
+++ b/base/essentials.jl
@@ -6,9 +6,14 @@ typealias Callable Union(Function,DataType)
const Bottom = Union()
+macro _noinline()
+ _expr(:meta, :noinline)
+end
+
# constructors for Core types in boot.jl
call(T::Type{BoundsError}) = Core.call(T)
-call(T::Type{BoundsError}, args...) = Core.call(T, args...)
+call(T::Type{BoundsError}, arg1) = (@_noinline; Core.call(T, arg1))
+call(T::Type{BoundsError}, arg1, arg2) = (@_noinline; Core.call(T, arg1, arg2))
call(T::Type{DivideError}) = Core.call(T)
call(T::Type{DomainError}) = Core.call(T)
call(T::Type{OverflowError}) = Core.call(T) a more general fix for this would be to allow VarArgs methods to benefit from specsig during codegen. but it would sufficient to allow merging #11244 now: julia> @code_llvm getindex(1:1:1, 1)
define i64 @julia_getindex_20438(%StepRange.4, i64) {
top:
%2 = icmp slt i64 %1, 1
br i1 %2, label %L4, label %L1
L1: ; preds = %top
%3 = call i64 @julia_length2311(%StepRange.4 %0)
%phitmp = icmp slt i64 %3, %1
br i1 %phitmp, label %L4, label %L5
L4: ; preds = %L1, %top
%4 = load %jl_value_t** inttoptr (i64 4465733512 to %jl_value_t**), align 8
%5 = call %jl_value_t* @julia_call_20439(%jl_value_t* %4, %StepRange.4 %0, i64 %1)
call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 350)
unreachable
L5: ; preds = %L1
%6 = extractvalue %StepRange.4 %0, 0
%7 = add i64 %1, -1
%8 = extractvalue %StepRange.4 %0, 1
%9 = mul i64 %7, %8
%10 = add i64 %9, %6
ret i64 %10
} |
If I understand correctly what @carnaval means is to make The problem that I want to address in this PR is to avoid the extra cost of better error reporting in general (although this should indeed help #11244 to be merged, I hope at least=) ). IMHO, the solutions for this problem are the following,
|
@yuyichao got it right, what I'm proposing was a heuristic of pushing down the gc frame setup inside the same basic block as throw when possible. The more general optimization would be very interesting imo : find a set (or maybe one is a more reasonable starting point) of "root" basic block which 1) strictly dominate every store to the gc frame 2) no single basic block containing such store has more than one "root" BB as an ancestor. Forgetting about this for now, I don't have a strong opinion on any of the two approaches. I'm fine with using one of those in a few chosen places as a stopgap. What I really don't want is people to start using it as a "the general way to have fast error handling" because we definitely could have better codegen for this. @ScottPJones No, we do all of this using the simple setjmp/longjmp stuff. It would be interesting to try something else but it's not high on any one's todo list since we don't use handlers in any perf sensitive places. From what I understand the CLU approach requires close cooperation with the backend. To go this way the simplest would probably to reuse llvm's infrastructure for the zero-overhead cxx exception handling. I'm not familiar with it but it does seem like a pain for no clear gain for now. |
OK, I see your point here. I just hope that we don't need to delay better error reporting until than. Each of those attempt seems to be either causing some performance regression (currently my PR) or introducing exactly this function for a specific type (@mbauman and maybe @ScottPJones?). On the other hand, as you can see in the llvm IR I posted above, the function body is also much smaller. Do you know how much will this affect the performance and does this justify using a special form / function to throw an exception? (Now that I think about this, maybe we can just use type inference to do the transformation?) |
@carnaval Well, |
I thought a big part of the overhead was from collecting the backtrace everytime. Didn't know that |
@yuyichao Think about all the register state that has to be saved on a modern machine... I remember that the Sparc architecture was particularly horrendous with |
Setjmp and longjmp are each about half a dozen assembly instructions. At some point we might switch to "zero-cost" exception frames based on llvm's c++, which makes entering a try block free at, the cost of making a throw much more expensive. The real performance loss for exceptions happens in the backtrace collection. |
Yes, but that's the same state that has to be saved on every function call, and much less than a syscall or other context switch has to save.
We turn it off because it does not play very nice with profiling, backtraces, or debugging, while the benefit to performance is not particularly clear. |
@vtjnash Are we collecting the backtrace of an exception just for debugging? (Note: I'm not saying this is not a good enough reason to do that, just curious). And do we still collect a full backtrace (instead of up to the latest catch) I don't know how the library used to collect backtrace works but would it be more effecient to generate the backtrace ourselves (using a global list for example) or will that have the same issue with creating GC frames? |
I think he was talking about the gc frame not the stack frame. I entirely agree that if the stack frame setup is a performance problem for you, you probably should be inlining the function anyway. LLVM itself cannot help us with the gc frame since we leak it to global state right away. Making this optimization would require either a custom llvm pass or modifying codegen. For the exception cost stuff, we probably all agree that :
I'm not sure the implementation cost vs perf reward is worth it here. Setjmp makes it so simple and we have so many other places where we could improve perf in a much more user visible way. |
@vtjnash I don't know what happens in Julia so much, because it seems that it inlines rather aggressively, but we always did debugging builds with the frames, production builds without, and at least for us, it made a big difference [most of it was because of using a large dispatch table to handle the ops for the VM, we didn't do any JIT compilation, and even many of the ops in C were fairly small... (the truly heavily used ones I implemented in assembly, and directly dispatched to the next instruction)] |
@carnaval I'm not saying that it should be done immediately, but I definitely think it should be done before 1.0 😀 Even the LLVM doc recommends using the zero-overhead approach vs. the SJLJ approach. |
OK. This version uses type inference to replace calls to |
Somehow New benchmark ( @code_llvm checkbounds("", 1)
f(a, b) = chr2ind(a, b)
time_func(f, "a", 1) Before define i1 @julia_checkbounds_44384(%jl_value_t*, i64) {
top:
%2 = alloca [3 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [3 x %jl_value_t*]* %2, i64 0, i64 0
%3 = getelementptr [3 x %jl_value_t*]* %2, i64 0, i64 2
%4 = bitcast [3 x %jl_value_t*]* %2 to i64*
store i64 2, i64* %4, align 8
%5 = getelementptr [3 x %jl_value_t*]* %2, i64 0, i64 1
%6 = bitcast %jl_value_t** %5 to %jl_value_t***
%7 = load %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t** %7, %jl_value_t*** %6, align 8
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t* null, %jl_value_t** %3, align 8
%8 = icmp slt i64 %1, 1
br i1 %8, label %L3, label %L1
L1: ; preds = %top
%9 = bitcast %jl_value_t* %0 to %jl_array_t**
%10 = load %jl_array_t** %9, align 8
%11 = getelementptr inbounds %jl_array_t* %10, i64 0, i32 1
%12 = load i64* %11, align 8
%13 = icmp slt i64 %12, %1
br i1 %13, label %L3, label %if2
if2: ; preds = %L1
%14 = load %jl_value_t*** %6, align 8
store %jl_value_t** %14, %jl_value_t*** @jl_pgcstack, align 8
ret i1 true
L3: ; preds = %L1, %top
%15 = call %jl_value_t* @alloc_2w()
%16 = getelementptr inbounds %jl_value_t* %15, i64 -1, i32 0
store %jl_value_t* inttoptr (i64 139853858757776 to %jl_value_t*), %jl_value_t** %16, align 8
%17 = getelementptr inbounds %jl_value_t* %15, i64 0, i32 0
store %jl_value_t* %0, %jl_value_t** %17, align 8
%18 = getelementptr inbounds %jl_value_t* %15, i64 1, i32 0
store %jl_value_t* null, %jl_value_t** %18, align 8
store %jl_value_t* %15, %jl_value_t** %3, align 8
%19 = call %jl_value_t* @jl_box_int64(i64 signext %1)
store %jl_value_t* %19, %jl_value_t** %18, align 8
%20 = icmp eq %jl_value_t* %19, null
br i1 %20, label %cont4, label %wb_not_null
wb_not_null: ; preds = %L3
%21 = bitcast %jl_value_t** %16 to i64*
%22 = load i64* %21, align 8
%23 = and i64 %22, 1
%24 = icmp eq i64 %23, 0
br i1 %24, label %cont4, label %wb_may_trigger
wb_may_trigger: ; preds = %wb_not_null
%25 = getelementptr inbounds %jl_value_t* %19, i64 -1, i32 0
%26 = bitcast %jl_value_t** %25 to i64*
%27 = load i64* %26, align 8
%28 = and i64 %27, 1
%29 = icmp eq i64 %28, 0
br i1 %29, label %wb_trigger, label %cont4
wb_trigger: ; preds = %wb_may_trigger
call void @gc_queue_root(%jl_value_t* %15)
br label %cont4
cont4: ; preds = %wb_trigger, %wb_may_trigger, %wb_not_null, %L3
call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 148)
call void @llvm.trap()
unreachable
}
f
1.791 seconds After define i1 @julia_checkbounds_67755(%jl_value_t*, i64) {
top:
%2 = icmp slt i64 %1, 1
br i1 %2, label %L3, label %L1
L1: ; preds = %top
%3 = bitcast %jl_value_t* %0 to %jl_array_t**
%4 = load %jl_array_t** %3, align 8
%5 = getelementptr inbounds %jl_array_t* %4, i64 0, i32 1
%6 = load i64* %5, align 8
%7 = icmp slt i64 %6, %1
br i1 %7, label %L3, label %if2
if2: ; preds = %L1
ret i1 true
L3: ; preds = %L1, %top
%8 = load %jl_value_t** inttoptr (i64 139976283090536 to %jl_value_t**), align 8
%9 = icmp eq %jl_value_t* %8, null
br i1 %9, label %err, label %ok
err: ; preds = %L3
call void @jl_undefined_var_error(%jl_value_t* inttoptr (i64 139984968716432 to %jl_value_t*))
unreachable
ok: ; preds = %L3
call void @julia_throw_with_args4956(%jl_value_t* %8, %jl_value_t* %0, i64 %1)
ret i1 undef
}
f
1.554 seconds
|
The GC frame is generated because of #11304. (And it was generated for the no checking version because I forgot to add |
@yuyichao I hope you liked the new |
@ScottPJones I liked it by looking at the implementation. Unfortunately (or fortunately?) I haven't used it to benchmark anything that gives me any non-zero result other than the time yet :P |
It helped me see a lot of why the string conversions were so slow... and even with my fixes that I'm hoping will get merged in soon (#11004) there are some extra allocations that I have to track down for the next round of speeding up Julia string handling 😀 |
Rebased due to conflict with #11274 |
This PR is probably superseded by #11508 now. See #11508 (comment) for benchmarks |
This pull request is trying to address the issue brought up in #11244 and the issue that throwing an error with argument can bring down the performance of the non-error path because of the creation of gc frames.
As mentioned in the comment of that pull request, the idea here is to NOT inline the constructor of the error but create specialized function for the arguments in order to avoid boxing.
This function can probably be used in many other places as well but I'm just using it for
SimpleVector
for benchmarking before getting more feedback. (Edit: and if this PR is accepted, I will probably replace a number of other places to use this function either in this PR or in a new one)f(args...)
as efficient asf(args)
#11248.Benchmark:
Code
Before
After
Note: the benchmark time fluctuate for both but the new version (1.72-1.82s) is consistently faster than the original one (1.80-1.92s). Also it is probably hard to quantify the impact of a gc frame but at least it should be clear from the llvm ir that no gc frame was created in the function anymore.