-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
precompilation #7977
Comments
There's some amount of processing of individual source files and packages that can be cached – parsing & lowering, and compressed AST generation. Some amount of type inference and code generation could also potentially be done independently, but that's much trickier. Full type inference, inlining, and code generation in general cannot be done until later. Currently, system image generation effectively takes a "snapshot" of the current process – what if we just make snapshotting really easy and make it possible to take a snapshot of a Julia process at any point and then start that snapshot again later? This gives a "layering" approach to precompilation and is quite easy to think about: get your REPL in a state you want to work from, snapshot it, and then start working from the snapshot instead of from an empty REPL. |
dup of #4373 |
The main issue is what a snapshot would even mean. If I opened a file, or initialized a dll, what does the new process see? If I start a task, presumably this is part of the state too, but what does it mean to snapshot a Task? |
Whole-system snapshots don't seem to be that different from our current I think its main limitation is that anything I use heavily (i.e., in practice contributes a lot to my load times) also tends to be something I change frequently. I don't want to have to rebuild all of Julia to test a one-line fix. (Yes, you have to do that if you're modifying base Julia, but we all know that's not ideal, and it gets far worse if you're building Julia and precompiling a dozen packages.) So I find myself choosing between whether package A should be precompiled or remain flexible. In practice I tend to precompile one or two packages, tops. We need a granular approach: package A can be precompiled as a standalone object, to be updated if A's source (or any macros in other packages it depends on) changes. I know it's not easy, but it's reality. |
+1 for dependency tracking to determine when compilation needs to happen. I had to stop using |
Speeding up the compiler can also be part of this. I don't know what kind of gains are possible, but 2x seems a conservative assumption. It's still not well understood why the system does so much work when loading packages. In theory, if a package contains only type and method definitions, loading it should take hardly any time, because compilation is mostly lazy. In practice, many packages run code at load time, but my impression is that they don't run enough stuff to require compiling a large percentage of the package. |
Agreed, especially given that profiling indicates that most of the time is spent in flisp.c. See #7280 (comment) and the gist posted below that. |
commit 8973ed1 might change that feature of the profile, though it's clearly not enough. |
Nope, but if your machine happens to give good C backtraces it's really easy to test this yourself:
I just don't happen to be on such a machine at the moment. (If Tk doesn't work for you, the IJulia interface is really nice.) |
It's all in flisp.c. More than 90% of the time. I can't export the graph for you easily at the moment because of timholy/ProfileView.jl#16. But if the above statements work for you, you can easily get this (and far more) information directly yourself. |
Maybe you can parse using JuliaParser to see where the possible hot-spots are (since it shares a common structure with the flisp parser)? |
That makes me think |
|
I have some code around showing that parsing a few hundred simple comprehensions in a function takes a few milliseconds. Lowering those to Julia AST for sticking in the function table then takes several seconds. |
Nope. Just created my own, super-exciting package (I should register this sucker 😉):
Load time is 6.8 seconds. And the result is even clearer: looks like about 97% of the time is in As I believe I once mentioned, I have seen an extremely rare bug in ProfileView, but in this case I've checked |
Man, if we just cache the lowered code, who cares about full precompilation?!? |
Wow that is really shady. The front end must be doing something really stupid. Fortunately, flisp has a profiler: https://github.com/JeffBezanson/femtolisp/blob/master/tests/test.lsp#L219 |
Well, of course these don't test actually using the code, so of course we'll want compilation too. To test this, I modified my script to call each one of these 10000 functions once:
The load time went from 6.8 seconds to 18 seconds. However, lowering is still more than half the time (remember that calling the functions added another 10001 lines to the file). Of the remainder, |
Glad to hear about the |
Parsing performance is definitely not the problem. julia> @time require("/tmp/testscript.jl")
elapsed time: 85.278999524 seconds (140911932 bytes allocated, 0.18% gc time)
julia> @time parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.522348909 seconds (8937864 bytes allocated)
julia> @time Parser.parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.121571927 seconds (66361048 bytes allocated, 9.37% gc time) |
This discussion has been fruitful. Everybody, I am really sorry. I made an absurd mistake, which has been causing us to call lowering twice on many top-level expressions. I was mis-identifying which expressions needed to be expanded. Oh-so-usefully, the lowering pass is idempotent so I never noticed until I profiled it carefully. 454344f fixes it. For me, Gadfly now loads in 27 seconds instead of 48. Too embarrassed to celebrate :| |
It is kind of hard to evaluate parsing performance with only a reduced test case like tim-holy's example above. Parsing performance varies dramatically between different input files. I guess this is only really testing one code path through the parser, but it looks like a big win. Congrats! It would be nice to have a reduced test case that really stresses the gc in flisp. |
That's great news! |
Now I am running into bizarre failures and haven't been able to get lowering working yet. |
It's too bad speed.julialang.org has been dead for a couple months... |
How long do they take to run? Could I stick it on the end of PkgEval's nightly run? I'm already building master's last good build nightly anyway. |
Probably a good idea to integrate the systems. I think it's just the perf tests, hopefully not that time consuming. Some value in comparing say accelerate to openblas on Mac, but even resurrecting the trendlines for the default Linux configuration would be nice. |
@tkelman resurrecting speed.julialang.org into something more modern is on my TODO list. It's been dead because the machines the perf tests were running on are no longer doing that, and the process I had for building nightly versions was kind of janky. I've setup a buildbot instance which is MUCH better suited to the task, and have been getting everything setup for the last week or so. My dream is to have a setup where continuous testing, builds and speed tests can all happen in a flexible framework. Right now I have OSX build/test/deplot, and Ubuntu build/test/deploy working, along with OSX bottle building for Homebrew.jl. |
I'm now able to both parse and lower the 25000 function file. I don't think I know chicken well enough to come to a conclusion yet, but so far chicken with default settings takes 19 seconds, and flisp takes 15. Chicken reported 80% GC time, so I tried giving it a 256MB heap, and this reduced its run time to 5.3sec. flisp can play the same space-vs-time game, and with the same heap size took 11.7sec. It seems Chicken's GC is absurdly sensitive to heap size, but with a big enough heap it can be 2-3x faster than flisp. A further problem is that AFAICT setting a heap size in chicken also fixes the heap size so it can't grow, and it aborts if it runs out of space. This is not ok. They really need a different heap growth policy, and with that their default performance could be much better. I tried the Chicken is different from both flisp and julia in that all top-level expressions run at run time, not at compile time. I don't know how much this affects performance, but I feel it leads to some strange behaviors. I have some top-level expressions that call |
Jeez. 256MB heap? That seems absurdly large. |
Lisp generates a lot of garbage :-) On Thu, Aug 14, 2014 at 10:59 PM, Stefan Karpinski <[email protected]
|
The port the parser to Julia option is looking better and better all the time. |
@JeffBezanson I guess the other one to try would be gambit. Have you pushed all your latest changes? It might also make sense to reach out to the chicken-scheme community. Schemers seem to be just as fanatical as Julian's when it comes to optimizing code for their favorite scheme implementations. |
Turned out 128MB was nearly as good. 64MB worked but was way slower. In any case, that kind of thing is a machine's job :) |
Ok, all changes pushed to |
@jakebolewski yes we should do that. They might have more GC tuning tricks, or can perhaps even fix things for us. |
Ok, I've been playing with this long enough. Here is a quick guide to testing this out:
csc <compiler options> -dynamic
(load "r5rs-compat.so")
(define (profile-frontend fname)
(let* ((ast (time (jl-parse-file fname))))
(time (julia-expand ast)))
#t)
(profile-frontend "/tmp/torture.jl")
julia/src [a865ca2●] » csi test.scm
CHICKEN
(c) 2008-2014, The Chicken Team
(c) 2000-2007, Felix L. Winkelmann
Version 4.9.0.1 (stability/4.9.0) (rev 8b3189b)
linux-unix-gnu-x86-64 [ 64bit manyargs dload ptables ]
bootstrapped 2014-06-07
; loading test.scm ...
; loading r5rs-compat.so ...
; loading /usr/lib/chicken/7/srfi-60.so ...
2.56s CPU time, 0.566s GC time (major), 2058936 mutations, 28/7289 GCs (major/minor)
30.91s CPU time, 25.967s GC time (major), 28380143 mutations, 182/25774 GCs (major/minor) @JeffBezanson which generated file were you using exactly? (many versions have been posted to date) |
Are you using the latest from source bulid? Out of the box the performance was pretty good, a solid 2x. With a little tuning I'm getting: julia/src [jcb/scheme●] » csc -O4 -u -lfa2 -inline r5rs-compat.scm
...
[jcb/scheme●] » csi test.scm
CHICKEN
(c) 2008-2014, The Chicken Team
(c) 2000-2007, Felix L. Winkelmann
Version 4.9.0.1 (stability/4.9.0) (rev 8b3189b)
linux-unix-gnu-x86-64 [ 64bit manyargs dload ptables ]
bootstrapped 2014-06-07
; loading test.scm ...
; loading r5rs-compat.so ...
; loading /usr/lib/chicken/7/srfi-60.so ...
1.71s CPU time, 0.34s GC time (major), 1713936 mutations, 19/5826 GCs (major/minor)
8.627s CPU time, 4.606s GC time (major), 23650143 mutations, 37/20337 GCs (major/minor)
# master version
julia/src [jcb/scheme●] » flisp/flisp test2.scm
Elapsed time: 3.215708017349243 seconds
Elapsed time: 24.24796199798584 seconds this is with the torture test which I'm finding hits the gc fairly hard. |
I hadn't tried building a .so, but with that I sometimes get
I'm using the v4.9.0.1 Arch package. What test file are you parsing there? |
I haven't ran into that error. You can try enabling different default compilers / linkers with the csc switches. |
Yes, quite a few more code changes are needed. |
Enough to kick the tires. This seems like it could be a 2x+ speedup. It's also a bit crazy that your 2000 line interpreted flisp is only 2x slower than an optimizing compiler developed over 15 years. |
That is pretty crazy. As @stevengj has pointed out, with a decent compiler and good choice of algorithms, it is possible to get to within 2-4x of the performance of many numerical libraries too. |
This is a pretty encouraging result. No matter what path gets chosen, it proves that it's possible to do 2-3x better, at least on trivial code. Moreover, if there's some opportunity to combine multiple passes through the AST into a single pass, I imagine there might be even more savings possible. But it does indeed emphasize the impressive performance/size ratio of flisp! |
I tried adding |
I've been pretty amazed at how often branches are basically free. My favorite example is that our reduce-across-dims code conceptually now works like this (example assuming a matrix as an input):
Two years ago I would have said that the first thing to do was get that branch out of the inner loop. It actually turns out to have a performance impact of at most a few percent. And boy does it make it easy to write cache-friendly reduction algorithms (which, compared to non-cache-friendly algorithms, improve performance many-fold). |
I've been playing with various levels of caching to see what speed advantages they might give: One thing I noticed is that the compiler seems to be too greedy about what it
I think method lowering is still the most expensive step (behind macro evaluation, in many cases), but it seems that julia should be able to directly eval more expressions as well. |
I guess improving this would help with code that generates code in loops, right? As you know from discussions over at Speed.jl, I've definitely found that to be a major factor in slow load times. |
Hmmm, at some point in the past I think the front end was more economical with simple expressions like |
We already have precompilation of the Julia system image. There's been a lot of discussion of how to precompile packages. This issue is for discussion of how to approach precompilation of packages.
The text was updated successfully, but these errors were encountered: