-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiled frames: a sketch #204
base: master
Are you sure you want to change the base?
Conversation
fe888bc
to
85eade2
Compare
I should also add that if someone wants to pick this up, I'll do what I can to provide support. |
One problem: JuliaLang/julia#31429 |
With regards to performance there are two ways forward as I see it.
I think it is important if we go for the second approach, the one suggested in this PR, that we are sure we will be able to retain enough debug information to keep the excellent experience that the current debugger has (when it is not too slow of course). Sorry for the lack of "meat" in this post. Just jotting down some initial thoughts I had. I'm also echoing how nice it would be to have some comment here from the compiler team and what they think is a good way forward. |
For example, the example in https://jrevels.github.io/Cassette.jl/latest/overdub.html, looks pretty much exactly like the |
We should definitely apply what micro-optimizations we can. #206 might be a good landing place for ideas. Initially my goal was to also make this a platform for also working on the compiler latency problem. I am a little less optimistic about this now than I was, but obviously that would still be nice to keep. Some of the W.r.t. Cassette and related ideas, I do think there would be something lost, but because we still have the full interpreter I am not certain it would be limiting. See my point in the OP about copying the state of the caller stack and then re-running in full-interpreter mode---you may not get everything you want when you first execute the frame, but if you can easily do it again in more detail a second time, then everything should be OK.
Not quite. The key point I was trying to make is this: ultimately (once we've done every performance optimization we can think of), So I think this strategy might easily get us more than an order of magnitude better performance than what you could hope for from Cassette. |
Yes and no Cassette would create an new frame per call, but that might be inlined and then LLVM optimizations kick in. So the question becomes more: Can we reuse the frames for repeated calls in a loop. OTOH instead of the callee creating the frame, let the caller create the frame and reuse when looping. But now I am purely speculating, I haven't looked at the Interpreter internals enough. |
Frame creation is really expensive, so EDIT: I like the idea of reusing the frame by the parent. We do that now a bit via the |
If the "only" difference is the order between instrumentation and optimization, couldn't Cassette be made to instrument optimized code? On the surface it just seems like the purpose of Cassette is so similar to the stuff we want to do here. Rewrite the IR and pass along a context. Tagging @jrevels in case he is interested in the discussion. |
What's the thought about breakpoints. Would we insert a |
The framecode, not the framedata, holds the breakpoints. So I think this is something that would require either recompilation or a specialized design (we could pass in both There seems to be a tradeoff here: runtime performance or compile-time performance? I'd probably first try passing in Also inlining would make it much harder to insert breakpoints. Ouch. It might still be possible using the LineInfoNodes, though statement-by-statement correspondence with the lowered code could be lost. |
Recompiling to insert breakpoints kinda seems like a non-starter. Sure, there will be a performance hit to check the break condition but it should be predicted correctly and it is not like we are trying to keep things SIMDing while debugging. A project I want to try out, just to get a feeling for the performance is to "rewrite" the interpreter using Cassette, while keeping most of the datastructures that we have here. So we would turn something like 1 ─ %1 = (Base.float)(x)
│ %2 = (Base.float)(y)
│ %3 = %1 / %2
└── return %3 into (something very roughly like): insert_locals!(ctx, x=x, y=y)
ctx.pc = 1
should_break(ctx, 1) && return hook(ctx)
%1 = (Base.float)(x)
insert_ssa!(ctx, 1, %1)
ctx.pc = 2
should_break(ctx, 2) && hook!(ctx)
%2 = Cassette.overdub(ctx, Base.float, y)
insert_ssa!(ctx, 2, %2)
ctx.pc = 3
should_break(ctx, 3) && hook!(ctx)
%3 = Cassette.overdub(ctx, /, %1, %2)
insert_ssa!(ctx, 3, %3)
ctx.pc = 4
should_break(ctx, 4) && hook!(ctx)
return %3 where |
@KristofferC let me know if I can help with Cassette side of things to try this out! Cassette has a metadata mechanism that one could use to make these things feasible. |
Seems like a worthy experiment. It would also be worth seeing if that's essentially what MagneticReadHead does. If so, one possibility would be to merge the two projects? CC @oxinabox. |
Here's an altered version of my top proposal, with the goal of losing nothing in terms of usability:
This gets us the benefits of reduced framedata-creation while not losing anything, I think, in terms of usability. |
(I will return and comment on this later, in general I am down to help with any cassette related things you need; but am yet to look at this plan.) |
cool cool cool. Some comments fairy scattered.
Yes, this is indeed what MagneticReadHead.jl does. On running after the optimiserFor reference Cassette runs before typing/specialization. I don't know much about the optimizers outpuut On using
|
Doing this in a Cassette-like way (generated function + reflection + return CodeInfo) is the right way to go. Cassette itself doesn't let you work on typed IR, but it's easy to just write out the generator yourself and grab that IR. That's pretty much equivalent to the current I like the sound of Tim's altered proposal, for which a simple generated function is pretty much all you need, but RE the original proposal of re-using the base optimiser: Working on typed IR works fine, with the caveat that you have to return a CodeInfo, which means converting phis back to slots and allowing type inference to run again on the optimised code. Odd but technically very easy and it gets the job done. Allowing inlining also compromises your ability to intercept specific calls, though likely you don't need this (may change as these kind of optimiser plugins become officially supported). There is the issue that Base's IR does not currently preserve much debug info; that's probably the only place where you need fixes in Base here. But this is not a research problem, all compilers do it, and it's clearly necessary long-term for any serious debugging effort; it just needs some straightforward hacking on the SSA data structures. |
I don't think there's any advantage in doing it via a generated function; Anyway, it sounds like we have several plans. I suspect the "altered" version is the way to go and will give excellent results. I think inlining will be a must if we're ever to get anything resembling normal performance. |
I'll have more to chime in here later, but just a note that the yakc mechanism (JuliaLang/julia#31253) covers the case of just wanting to run some CodeInfo as a one-off. |
Just a brief update. With some of the really lovely changes @KristofferC has made recently, we're now competitive with Julia's standard execution mode on tasks like So, another thought: rather than compiling frames, what about a "loop compiler"? To me it seems likely that the vast majority of slow-to-interpret code will fall in one of two categories, either having loops or using deep recursion (e.g.,
The likely sticking point is that not enough will be constant: for example, in a if isa(state, TypeWeSawTheFirstTime)
<inlined code goes here>
else
<generic call goes here>
end Given the recent improvements from @KristofferC's elegant contributions, I am beginning to become optimistic that we might get more benefit from doing this than from creating a compiled variant of what we do now. |
How have you determined this? Shouldn't there cycle mechanics handle this quite well? For interpreting |
It's almost all the recycling (maybe I should have said "frame setup" rather than "frame creation"). I just did |
I don't see how manual inlining will have a big effect without reducing debugging information. If we want to provide the same information as we do now to someone debugging we have to keep track of pretty much the same thing as we do now with the Also, any time we run something compiled, we are going to have troubles with |
I've put a rough test up here: https://gist.github.com/timholy/369c3fbf5d64ee09c3f9692e2db6c489 So complete inlining might give us something like a factor of 6. Not as much as I was hoping, but not bad either. |
For the record, here are some performance numbers. For these, I simulated complete inlining in the
Inlining yielded a 15x advantage for MagneticReadHead but only a 5x advantage for JuliaInterpreter. For JuliaInterpreter, here's an accounting of the cost:
The remaining 19% is fairly widely scattered. |
Looks good! And as long as we can map back everything in the inlined function to the original one we should be fine with debug info. This mapping doesn't have to be very fast to retrieve since we only need them when showing information to the user. I guess that is the plan? |
This is incredibly useful information for MagneticReadHead development. The other useful datapoint youi have here is that a Naive Cassette implementation (which is what MagneticReadHead is) is not going to, in and of itself, give you performance improvements. |
I added a row for "native execution" above. Another useful data point: I created a "minimally instrumented" version of the form used by JuliaInterpreter. This is frankly overly optimistic; normally we'd also instrument the return from function summer_instrumented!(A, framedata)
s = zero(eltype(A))
framedata.locals[2] = Some{Any}(s)
framedata.last_reference[:s] = 2
for a in A
framedata.locals[5] = Some{Any}(s)
framedata.last_reference[:a] = 5
s += a
framedata.locals[2] = Some{Any}(s)
framedata.last_reference[:s] = 2
end
return framedata.locals[2]
end This has a time per iteration, when running in Julia's native mode, of So while we have a ways to go, we're closer to optimal than I thought---I'd be really impressed if we can gain more than 10x compared to where we are now no matter how hard we're willing to work. Really, the only way to recover true compiled-code performance is to interact with native call stacks, aka Gallium. |
I'm not certain what the plan is 😄. I'm a bit greedy, and given the effort involved in implementing inlining 5x seems like less gain than I was hoping. I think we should continue to think about it. It is possible to build fast interpreters: https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/ gets down to about 30 nanoseconds per iteration, though I'm unsure of the relevance for a language as complex as Julia. I find myself contemplating modes in which we "record" the actual actions taken (using, e.g., integer tokens so everything is inferrable) and then run through that recording when handling loops. That's a little too vague to be called "a plan" but perhaps it conveys the flavor of what I'm currently thinking of. And yes, whatever we do, we need to leave clear breadcrumbs so that one can map back to original source code. |
For reference, here is a measure of the performance of MRH these days.
|
It is interesting and would be good to have a discussion but I just want to point out some drawbacks: using e.g. g() = sprand(5,5,0.5) + sprand(5,5,0.5)
f() = rand(5,5) + rand(5,5) I get julia> using MagneticReadHead
julia> @time @run g()
208.436596 seconds (191.57 M allocations: 7.703 GiB, 2.74% gc time)
julia> @time @run f()
21.894716 seconds (42.04 M allocations: 1.778 GiB, 5.27% gc time) vs julia> using Debugger
julia> @time @run g()
5.388586 seconds (10.10 M allocations: 495.592 MiB, 3.60% gc time)
julia> @time @run f()
0.114333 seconds (206.46 k allocations: 9.683 MiB, 4.99% gc time) so you pay a hefty compilation price even if you are just debugging trivial functions. In order to do any real comparison I feel we also need stuff like oxinabox/MagneticReadHead.jl#56 fixed to see how things scale to real code. |
Absolutely, I agree. MRH is no magic bullet. of-course the unchanged library code will still be cached so it will be faster the second time. |
This is a skeleton illustrating how I think we should implement compiled frames. Like all untested sketches, this could of course run into serious roadblocks.
The code here is heavily commented, so that may be sufficient. (I should say that I started out intending to test this on a method
inc1(x) = x + 1
, but never quite got that far; that may explain some of the elements of the code.) But let me explain some of the overall strategy here. The idea is that for any method, we create an instrumented variant: in addition to doing its regular duty, every time it computes something, store intermediate results in aFrameData
that gets passed in as an extra argument. Basically, the idea is thatfoo(x, y)
becomesfoo#instrumented!(x, y, #framedata#)
. In the instrumented variant, assignments to slots and "used" ssavalues (in the sense offramedata.used
, whereframedata
is the framedata forfoo
itself) will need an extra statement inserted that performs the same assignment to the#framedata#
argument.Now, if we were to compile this, we'd probably get a fairly respectable result on that particular method. But what to do about all those calls it makes? If we do nothing, it would be like running the frame in
Compiled()
mode, OK for certain things but very limiting.Here is where the real fun begins. The idea is to intercept inference, and modify the
invokes
to call instrumented variants of the normal methods. A potentially huge win here is that if we do this after running normal inference (including the optimizer), then we get inlining for free. I am anticipating that, with compiled frames,framedata
allocation will become our single biggest expense; we can presumably avoid most of that by inlining all the "simple stuff."Of course that means we'll be blind to what goes on inside the inlined methods. Optionally, I suppose we could turn off the optimizer. But I think a better way to handle that would be on the UI side: all we'd need to do is copy the framedata that executes that call, and create a normal (slow) interpreted frame with the same data and then start executing that in normal interpreted mode. This essentially lets us "snip out" the little piece of the computation when we need it, but get the (probably huge) benefit of inlining for 99% of the execution time.
While I barely know anything about Cassette (something I should remedy some day), I suspect that there are similarities between what I'm proposing here and stuff Cassette is presumably already good at. However, were I to take a guess, I'd say the inlining tricks I'm proposing are something that would be difficult to do via Cassette. Here, by running inference on what is close to the normal method body (with genuinely-normal callees) and then modifying it, we should get something that's very close to the "normal" inlining decisions.
Why am I posting this as a sketch, rather than just doing it? The reality is that because I've prioritized getting the debugger out above most of my other duties, in my "real job" I have many fires burning. Worse (here I'm being a pessimist), I suspect that what serious coding time I can muster over the next few weeks will likely be eaten up by solving problems I've inadvertently created by rewriting the Revise stack. So I'm afraid I'm going to be limited in terms of how much "difficult" development I can do here. Of course I'm happy to offer what help I can, but it may be in an advisory capacity for many things. But I thought I'd throw this out there to see if it helps get things going.
This ventures into some scary territory, it might be nice to get feedback from folks who know the compiler far better than I: CC @Keno, @JeffBezanson, @vtjnash.