-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use multiple threads in vectorized operations #1802
Comments
An approach that might be less counter-Julian than calling out to C for everything might be to rejuvenate Krys' very interesting framework mentioned in this thread: IIUC, he designed it explicitly to support multiple "backends," one of which might be kernel threads. @stevengj, any interest in looking into it? It would be a fantastic contribution. |
I have no interest in CUDA, which I see as an architectural dead end. The hard thing about parallel programming is mainly dealing with memory. Shared memory is hard enough (one kind of memory + synchronization), and distributed-memory programming is harder (local + remote memory). The CUDA/OpenCL model requires the programmer to manually manage 4-5 kinds of memory (main memory, GPU memory, thread-block shared memory, thread-local memory, and remote memory in a cluster situation), which is not a reasonable way forward for software engineering except in special cases. |
I think it would be totally reasonable to start by adding some shared-memory loop-parallelism with restricted parallel regions to the core language. That way we can start by just slapping locks on everything in the runtime, still allowing for speedups in simple vector loops. A possible alternative is to add a separate shared heap as in Racket's "places". |
@stevengj , my point was, while Krys was interested in CUDA, I seem to remember he explicitly designed it to handle multiple backends, of which kernel threads was another candidate. The main point of his framework was a way to implement lazy evaluation in Julia. He got some nice results even with CPU execution from removing temporaries, e.g., from operations like |
@JeffBezanson , I like that idea a lot! |
I like to think of the simple case of implementing |
Something like the OpenMP or Cilk runtime has no problem with parallel functions calling parallel functions, nor with problems that are too small to parallelize. The key thing is that you don't implement shared-memory parallelism by actually spawning threads to do work, you tell the runtime system that work is available and you let it decide what thread to assign (or whether to just do the work serially). This is largely a solved problem (unlike distributed-memory parallelism). You just need to make the existing solutions accessible in Julia. (The experts on this are in Charles Leiserson's group at MIT; I would be happy to help you get in touch with them to bounce around ideas.) |
Yes, I'm familiar with the concepts in Cilk. My biggest concerns are ABI issues and portability. Different compilers seem to have their own openmp implementations. Targeting libgomp is probably enough for non-darwin unices, but I'd rather not have to do multiple backends. |
Yes this is certainly a solved problem in shared memory. |
@JeffBezanson In the short run, it sounds like targeting libgomp will be a good idea (it runs fine on MacOS too in my experience, and is supposedly possible to get running on Windows). That will give you a good platform on which to get something working, shake the thread-safety bugs out, and experiment with syntax, while users will get the benefit of multithreaded operations (and hopefully easier parallelism in their own code too). In the long run (1-2 years at this rate), LLVM will probably support OpenMP directly, and it sounds like they are talking about supporting multiple backends so that difficulty will be taken care of for you. |
Is there a non-GPL alternative? There is the Blocks runtime in LLVM, but I suspect it only works well on OS X, and is unlikely to be as well debugged and understood as OpenMP is. |
Not that I know of. The gcc Cilk runtime is also GPL, and there is another OpenMP implementation called OpenUH that also seems to be derived from gcc and hence GPL. But in the long term, once LLVM supports OpenMP, probably it will target proprietary backends like Intel's too, so Julia users wanting to license a different backend will have options. |
http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/ It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on. Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores. |
@ViralBShah; restrictions on size and shape are highly dependent on machine architecture; why not just have "best guess" values that can be tuned by the end user? E.g. run a dumb optimization search over simple matrix operations, matrix sizes, and number of threads. I know the OpenBLAS people were talking about doing this, because right now they have default thresholds that they use that they acknowledge may not be the optimum on any given machine. |
Matlab's threading model seems to have undergone changes over time. In I don't really know anything about what's happening under the hood. But I've also seen cases where, when Matlab launches a Julia process via their --Tim On Saturday, December 22, 2012 10:40:42 AM Viral B. Shah wrote:
|
A stopgap solution using shmem and mmap. https://groups.google.com/forum/?fromgroups=#!topic/julia-users/Y_ME9uO5tVQ |
@timholy, processor affinity is the devil here. A lot of software uses it in order to make benchmarks look better, but in the real world it is worse than useless except when running on dedicated hardware. It really needs to be disabled by default (e.g. there is an option in OpenBLAS to disable it; FFTW doesn't use it at all unless you manually do taskset, but maybe Mathworks hacked their FFTW version). That, plus you need a way to control the total number of worker threads (e.g. |
@stevengj, thanks for the explanation. I bet you're right about the hacked version, it's a behavior that seems relatively specific to FFTs (although for us that's a big part of our load on that server, when we're not doing them via CUDA, so maybe I wouldn't see it otherwise). |
Related to #1790 |
Does the recent announcement of Intel open-sourcing its LLVM-OpenMP implementation make a difference for Julia in the short term? http://blog.llvm.org/2013/10/openmp-project.html |
Is this something we can expect when the |
No. That will just give us threading capabilities. Once that stabilises we will start with threading array operations and other things. |
This should be closed in favor of #19777, since dot calls have replaced vectorized operations for the most part. |
Matlab users who employ "properly" vectorized code may notice a dramatic slowdown in going to Julia because most Matlab vector operations are multithreaded while Julia operations are not. There is an intrinsic difficulty here because Julia code cannot be multithreaded (see issue #1790), but as a stopgap measure I suppose that you could call out to C code that uses OpenMP to parallelize important vector operations.
For example, compare the following Matlab (R2012a) code (on 16-core x86_64 3GHz AMD Opteron 4280, Debian GNU/Linux)
with the Julia equivalent:
Julia is more than 10x slower. The difference is almost entirely due to multithreading, since if I run Matlab with -singleCompThread I get:
Also needed would be a way to specify the number of threads to use (e.g. you could use the standard OMP_NUM_THREADS environment variable by default, while providing some way to change it from within Julia). The nice thing about using OpenMP is that you can also use OpenMP for the OpenBLAS library and for FFTW (and possibly other libraries?), and use the same thread-pool for all three.
The text was updated successfully, but these errors were encountered: