Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestions for faster startup #260

Closed
StefanKarpinski opened this issue Nov 12, 2011 · 37 comments
Closed

suggestions for faster startup #260

StefanKarpinski opened this issue Nov 12, 2011 · 37 comments
Assignees
Labels
performance Must go faster
Milestone

Comments

@StefanKarpinski
Copy link
Member

Ideas for faster startup:

  • cache LLVM bitcode
  • cache native code generated by LLVM
  • mmap the heap data structures on startup

With all of these we may be able to get instantaneous startup.

@JeffBezanson
Copy link
Member

Man, you really love mmap :) Not unjustified, to be sure. But compressing the trees will eliminate most heap objects and shrink the size of the image file, so it will probably be fine just to load it.

@StefanKarpinski
Copy link
Member Author

mmap is the best system call.

@ViralBShah
Copy link
Member

Startup seems ok to me now. Jeff has implemented the tree compression, and although startup improved, it is not instantaneous. Do we know where the time is spent in startup, and which of the suggestions above may help?

@JeffBezanson
Copy link
Member

It's certainly code generation. Part julia->llvm and part llvm->native. The second one is probably bigger since it runs optimization passes. In fact disabling llvm optimization passes cuts 0.3s off startup time for me.

@JeffBezanson
Copy link
Member

One thing to do is experiment with removing optimization passes (codegen.cpp:1990), and see what can be removed without hurting performance.

@StefanKarpinski
Copy link
Member Author

While that's helpful, it's never going to get us to really instant startup. Your analysis suggests that what we really need to get there is storing pre-generated machine code in the startup image.

On Dec 17, 2011, at 3:01 PM, [email protected] wrote:

One thing to do is experiment with removing optimization passes (codegen.cpp:1990), and see what can be removed without hurting performance.


Reply to this email directly or view it on GitHub:
#260 (comment)

@ViralBShah
Copy link
Member

As the standard library keeps growing, startup will continue to becoming slower. It seems that Stefan's suggestion of pre-generation is the right one. Do notice how building sys.ji has become so slow.

@andychu
Copy link

andychu commented Feb 19, 2012

FWIW I just tried Julia and this is the first thing I noticed:

$ time julia hello.j
hello

real 0m2.259s
user 0m2.176s
sys 0m0.076s

$ cat /proc/cpuinfo | grep model
model name : Intel(R) Core(TM) i3 CPU M 370 @ 2.40GHz

It seems excessively slow. One thing that gives a good first impression about node.js is that it starts up fast -- faster than Python or Ruby. And you can use it in shell scripts.

The big mistake in Python and Ruby is that they allow arbitrary code at import time of any module. So big programs often take multiple seconds to even get to main(). Same with C++ -- in a large code base, static initialization before main() often takes multiple seconds in large codebases. It also leads to all sorts of annoying language specification issues.

I think Dart has some concept of immutable modules (no top level mutable state?) that might address this, but I haven't found any details (but fast application startup is a design goal: http://www.dartlang.org/docs/technical-overview/index.html#goals) It is pretty easy to get this wrong.

But I'm excited about Julia, it's amazingly full-featured for an initial release.

@StefanKarpinski
Copy link
Member Author

Yep. Startup is slow. It's a serious annoyance. Obviously we want it to be lightning fast, but it's hard when you're doing all that JITing. I think the conclusion at this point is that we maybe want the ability to make compiled binaries, which would allow the repl itself to be compiled and startup faster. That's essentially equivalent to storing pre-generated machine code (that's basically what a binary is).

@ViralBShah
Copy link
Member

Once we modularize our libraries, we may not need to load the entire world on startup. We can also try to get to the prompt earlier and let stuff happen in the background for a few seconds.

-viral

On 19-Feb-2012, at 2:54 PM, Stefan [email protected] wrote:

Yep. Startup is slow. It's a serious annoyance. Obviously we want it to be lightning fast, but it's hard when you're doing all that JITing. I think the conclusion at this point is that we maybe want the ability to make compiled binaries, which would allow the repl itself to be compiled and startup faster. That's essentially equivalent to storing pre-generated machine code (that's basically what a binary is).


Reply to this email directly or view it on GitHub:
#260 (comment)

@andychu
Copy link

andychu commented Feb 20, 2012

Suggestion: move everything into a module, and then provide a .juliastartup file. Here is my .pystartup file:

$ cat ~/.pystartup
import os
import sys
if sys.platform == 'darwin':
import rlcompleter
import readline
readline.parse_and_bind('bind ^I rl_complete')

$ python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

os.system

^^^^ Now I don't have to "import os" when starting an interactive python interpreter, at the expense of slightly slower startup. I could have also dong "from os import *" to import everything into the main namespace.

When I run a Julia program that does "print("hello"), or a unit test, I would like it if it doesn't compile any FFT functions (I presume it's compiling everything in http://julialang.org/manual/standard-library-reference/). You could also provide a variable for the .juliastartup to tell if it's running a program in batch mode or an interactive prompt. And then users can add a bunch of their own stuff if they want everything loaded.

When there is a module system, I imagine that the code will be compiled on import. So if everything is moved into a module, it will solve the startup time problem mostly without having to speed up compilation itself. I think there is a bit too much in the global namespace now.

@JeffBezanson
Copy link
Member

It's not actually compiling the entire standard library, just methods needed at startup and their dependencies. This does touch a large amount of code since we use regexes, various data structures, etc., but not stuff like FFTs.
Getting the size of the global namespace just right is indeed a delicate and important balance.

@ViralBShah
Copy link
Member

Yes, that is what I have been thinking as well. Once we have support for modules (soon enough), we should be able to move most of the stuff into modules. For that reason, I am holding off on any major refactoring of the library code.

@StefanKarpinski
Copy link
Member Author

@andychu: I should point out that "bare" Julia with no imports is very different from "bare" Python with no imports. Bare Julia literally doesn't even have the ability to add or print integers, let alone floats or strings. That's because almost all functionality is implemented in Julia itself instead of in C code that's pre-compiled and always available.

@pao pao mentioned this issue Jul 18, 2012
@salehqt
Copy link

salehqt commented Apr 12, 2013

I see that most of the time is spent compiling Julia->LLVM and then LLVM->native. With sys.ji most of the LLVM code is cached but at this point caching the native code is more important. Most of the standard library does not change so why can't we compile the sys.ji into sys.so . From my experience with LLVM compiling LLVM bitcode into .so is extremely simple. Loading shared libraries is very fast and does not require jumping through hoops to get it fast.
Another suggestion was to compile each module of the standard library into a separate .bc or .so file and load them on demand. However, for an extremely dynamic language like julia .bc and .so files should be only used as a cache, but this would make start-up instant.

Instant start-up is also very essential if we want to use julia to interface other unix applications using standard shell scripting. It is also essential when developing and debugging code since the functions and types in Julia are immutable.

@salehqt
Copy link

salehqt commented Apr 12, 2013

Again, why can't we compile everything into a .so file and let the system handle mmap (and function look-up) for us.

@ghost ghost assigned JeffBezanson Apr 12, 2013
@JeffBezanson
Copy link
Member

Nobody said we can't compile to a .so. In fact that's exactly what we're talking about doing; there have been several threads on the topic.

@StefanKarpinski
Copy link
Member Author

@salehqt: The main impediment here is that we use LLVM's JIT infrastructure, which doesn't generate bitcode. It's unclear what the best way to generate .so files from Julia code is. One option would be to port our JIT over to MCJIT, which seems to basically generate a .so in memory and then use it. If you've got any expertise in generating .so files from jitted code, it would be quite welcomed.

@salehqt
Copy link

salehqt commented Apr 13, 2013

In sum, there is no need to replace the current JIT. The best way to do it is to use .so file as a cache.

In my experience with LLVM (implementing a toy language) , JIT compiling a code is a shortcut for writing bitcode, compiling it to .so and reading it back into memory using dlopen.

I think .so file generation is only useful when compiling standard libraries or a package and should be restricted to that. The difficult part is using the .so file as a cache, basically the JIT compiler should check if compiled version of function can be found in .SO and use it before JITing the function.

SO files are also efficient data structures, one could put all kinds of metadata, hash tables and even Julia AST inside the SO file so one SO file would represent a complete Julia package. (similar approach is used in .NET DLLs)

A sample use case would be:

  1. Implement a package in Julia and test it
  2. Compile the package down to .so file
  3. Use the package in an application by loading the .so instead of source code.

@StefanKarpinski
Copy link
Member Author

How do you tell the jit to put the code in a .so?

@StefanKarpinski
Copy link
Member Author

We're all sold I the idea, I just don't think it's nearly as simple as you're making it out to be. But I'd be extremely happy to find out I'm wrong.

@JeffBezanson
Copy link
Member

All the work is in arranging things in the runtime and startup so that the
reloaded code actually works.
On Apr 12, 2013 9:31 PM, "Stefan Karpinski" [email protected]
wrote:

We're all sold I the idea, I just don't think it's nearly as simple as
you're making it out to be. But I'd be extremely happy to find out I'm
wrong.


Reply to this email directly or view it on GitHubhttps://github.com//issues/260#issuecomment-16325361
.

@ViralBShah
Copy link
Member

The challenge is not with the JIT storing the bitcode, as much as the dynamic nature of the language. But as Jeff said, most of the heavy lifting is already done.

@salehqt
Copy link

salehqt commented Apr 13, 2013

This is the prototype of the function that writes bitcode, in the same page there is one that reads bitcode from file.
int LLVMWriteBitcodeToFile (LLVMModuleRef M, const char * Path )
http://llvm.org/docs/doxygen/html/group__LLVMCBitWriter.html

However, I haven't seen any API that saves the native code after it was generated by the ExecutaionEngine. what I was suggesting was to save the bitcode and then run an offline compiler (llc & gcc) to generate executable code. This is my code that does this (in ruby).

# create the module mod and add some code to it
mod.write_bitcode("#{@module_name}.bc")
# Now build a shared library
system "llc -relocation-model=pic #{@module_name}.bc"
system "cc -shared #{@module_name}.s -o #{@module_name}.so"
# now load the shared library back into Ruby
require "./#{@module_name.so}"

This is why it is not really useful for the interpreter and it should be used for offline compiling of standard libraries and other big packages. e.g. generate sys.so instead of sys.jl

@StefanKarpinski
Copy link
Member Author

Unfortunately, that's not helpful in this situation – if we had an offline compiler that could generate LLVM bitcode the problem would already be solved.

@salehqt
Copy link

salehqt commented Apr 13, 2013

After examining the code, I can see that it is quite possible with minimal changes to the code. Like Jeff said, all the pieces are there, SO can be used just as another level of caching.

The solution would be calling jl_compile on all regular functions and jl_compile_hint for generic functions to generate all the LLVM code. Once the code is generated, it can be saved to a file for offline compilation.

The bitcode can be compiled to a shared library. At Julia start-up, in addition to the system image, the shared library is loaded along with it, jl_compile then should check for existence of a compiled function using dlsym before generating a new one. This would avoid all Julia->LLVM->Native JIT compilation that takes up most of the start-up cost.

@JeffBezanson
Copy link
Member

A fine description of the easy part of this work.
On Apr 13, 2013 5:14 PM, "Saleh" [email protected] wrote:

After examining the code, I can see that it is quite possible with minimal
changes to the code. Like Jeff said, all the pieces are there, SO can be
used just as another level of caching.

The solution would be calling jl_compile on all regular functions and
jl_compile_hint for generic functions to generate all the LLVM code. Once
the code is generated, it can be saved to a file for offline compilation.

The bitcode can be compiled to a shared library. At Julia start-up, in
addition to the system image, the shared library is loaded along with it,
jl_compile then should check for existence of a compiled function using
dlsym before generating a new one. This would avoid all Julia->LLVM->Native
JIT compilation that takes up most of the start-up cost.


Reply to this email directly or view it on GitHubhttps://github.com//issues/260#issuecomment-16341008
.

@tshort
Copy link
Contributor

tshort commented Apr 16, 2013

@salehqt, could you implement your approach for a package to see how it works? That might be an interesting exercise, and there are packages that could use faster start-up times.

@salehqt
Copy link

salehqt commented Apr 16, 2013

Implementing this requires changing internals of Julia and it can break a lot of things. In the current state, Julia is implemented like C, all of the includes are imported in one big soup of AST and compiled on-demand to LLVM. The resulting LLVM module is also a soup of everything that is compiled. Julia modules and namespaces are merely scoping constructs and do not really separate the code.

I was hoping to just compile a native system image to accompany the current system image (that only contains Julia AST). Later on other developers would modularize Julia and make separate .so modules.

The first step in modularizing would be compiling packages to AST and storing them as binaries so Julia doesn't end up with problems that RubyGems had: loading and parsing too many source files at the start-up of real applications. In my experience with Julia, the front-end is one of the weakest points of Julia and you cannot rely on it for fast start-up.

This is my testing implementation. (does not really work)
salehqt@37eba2a
I am successful at extracting the LLVM code and compiling it to a shared object.
But it fails when using the so because the shared
object cannot link with the julia library, I did not spend much time in figuring out why the
symbols cannot be resolved.

@JeffBezanson
Copy link
Member

Of course they are merely scoping constructs. Try disabling inlining of functions in different modules and see how well that goes over.

@StefanKarpinski
Copy link
Member Author

Unfortunately turning inlining off will absolutely destroy performance.

@diegozea
Copy link
Contributor

dzea@deepthought:~$ time julia-m -e "println(\"Hello\")"
Hello

real    0m1.907s
user    0m2.064s
sys 0m0.344s
dzea@deepthought:~$ time julia-m hello.j 
Hello

real    0m1.932s
user    0m2.056s
sys 0m0.376s
dzea@deepthought:~$ time julia-m --no-history -f hello.j 
Hello

real    0m1.415s
user    0m1.536s
sys 0m0.376s

Using --no-history and -f flags, start up takes ~ 0.6 seconds less. Would be great have a --script or --batch flag optimized for run scripts in this way.

@jiahao
Copy link
Member

jiahao commented Apr 28, 2013

I see essentially no difference on my Macbook Air (commit 2356fb8):

$ wc -l ~/.julia_history 
    5935 /Users/jiahao/.julia_history
$ time ./julia  --no-history -f -e "println(\"Hello\")"
Hello

real    0m2.191s
user    0m2.224s
sys 0m0.104s
$ time ./julia -e "println(\"Hello\")"
Hello

real    0m2.262s
user    0m2.292s
sys 0m0.110s
$ cat > hello.jl
println("Hello")
$ time ./julia hello.jl 
Hello

real    0m2.232s
user    0m2.261s
sys 0m0.109s
$ time ./julia  --no-history -f hello.jl 
Hello

real    0m2.284s
user    0m2.313s
sys 0m0.108s

@diegozea do you have a very long ./julia_history file?

@diegozea
Copy link
Contributor

Yes, is because is longer...

dzea@deepthought:~$ wc -l ~/.julia_history
66392 /home/dzea/.julia_history
dzea@deepthought:~$ rm .julia_history
dzea@deepthought:~$ time julia-m --no-history -f hello.j 
Hello

real    0m1.593s
user    0m1.716s
sys 0m0.376s
dzea@deepthought:~$ time julia-m hello.j 
Hello

real    0m1.569s
user    0m1.708s
sys 0m0.364s

@vtjnash
Copy link
Member

vtjnash commented Dec 13, 2013

do i get to close this now? (#4898)

@Keno
Copy link
Member

Keno commented Dec 13, 2013

Yes

@StefanKarpinski
Copy link
Member Author

I dare say you've earned it. Please do the honors, @vtjnash.

@vtjnash vtjnash closed this as completed Dec 13, 2013
StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
* adds fieldoffset, fixes #218

* fixes fieldoffset to return a UInt

* fixes fieldoffset on v0.3
fredrikekre added a commit that referenced this issue Sep 16, 2022
This patch updates SparseArrays. In particular it contains
JuliaSparse/SparseArrays.jl#260 which is
necessary to make progress in #46759.

All changes:
```
4fb8f0e Fix direction of circshift (#260)
ead48fe Fix `vcat` of sparse vectors with numbers (#253)
d88be9f decrement should always return a vector (#241)
dfcc48a change order of arguments in fkeep, fix bug with fixed elements (#240)
43b4d01 Sparse matrix/vectors with fixed sparsity pattern. (#201)
```
Keno pushed a commit that referenced this issue Oct 9, 2023
- Fix for change in CodeInfo.slotnames type on julia master (#251)
- Add a way to break on throw (#253)
- Exclude Union{} from is_vararg_type (#254)
- Various performance improvements (#254)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

10 participants