Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use multiple threads in vectorized operations #1802

Closed
stevengj opened this issue Dec 21, 2012 · 23 comments
Closed

use multiple threads in vectorized operations #1802

stevengj opened this issue Dec 21, 2012 · 23 comments
Labels
multithreading Base.Threads and related functionality parallelism Parallel or distributed computation performance Must go faster

Comments

@stevengj
Copy link
Member

Matlab users who employ "properly" vectorized code may notice a dramatic slowdown in going to Julia because most Matlab vector operations are multithreaded while Julia operations are not. There is an intrinsic difficulty here because Julia code cannot be multithreaded (see issue #1790), but as a stopgap measure I suppose that you could call out to C code that uses OpenMP to parallelize important vector operations.

For example, compare the following Matlab (R2012a) code (on 16-core x86_64 3GHz AMD Opteron 4280, Debian GNU/Linux)

>> x = rand(10000000,1);
Elapsed time is 0.211827 seconds.
>> tic; y = exp(x); toc          
Elapsed time is 0.035116 seconds.
>> tic; y = x.^2 + 3*x.^3; toc
Elapsed time is 0.159840 seconds.

with the Julia equivalent:

julia> x = rand(10000000,1);
julia> tic(); y = exp(x); toc()
elapsed time: 0.3979671001434326 seconds
julia> tic(); y = x.^2 + 3*x.^3; toc()
elapsed time: 2.3250949382781982 seconds

Julia is more than 10x slower. The difference is almost entirely due to multithreading, since if I run Matlab with -singleCompThread I get:

>> tic; y = exp(x); toc
Elapsed time is 0.406924 seconds.
>> tic; y = x.^2 + 3*x.^3; toc   
Elapsed time is 2.005534 seconds.

Also needed would be a way to specify the number of threads to use (e.g. you could use the standard OMP_NUM_THREADS environment variable by default, while providing some way to change it from within Julia). The nice thing about using OpenMP is that you can also use OpenMP for the OpenBLAS library and for FFTW (and possibly other libraries?), and use the same thread-pool for all three.

@timholy
Copy link
Member

timholy commented Dec 21, 2012

An approach that might be less counter-Julian than calling out to C for everything might be to rejuvenate Krys' very interesting framework mentioned in this thread:
https://groups.google.com/d/topic/julia-dev/PZfOv4h94lk/discussion

IIUC, he designed it explicitly to support multiple "backends," one of which might be kernel threads. @stevengj, any interest in looking into it? It would be a fantastic contribution.

@stevengj
Copy link
Member Author

I have no interest in CUDA, which I see as an architectural dead end. The hard thing about parallel programming is mainly dealing with memory. Shared memory is hard enough (one kind of memory + synchronization), and distributed-memory programming is harder (local + remote memory). The CUDA/OpenCL model requires the programmer to manually manage 4-5 kinds of memory (main memory, GPU memory, thread-block shared memory, thread-local memory, and remote memory in a cluster situation), which is not a reasonable way forward for software engineering except in special cases.

@JeffBezanson
Copy link
Member

I think it would be totally reasonable to start by adding some shared-memory loop-parallelism with restricted parallel regions to the core language. That way we can start by just slapping locks on everything in the runtime, still allowing for speedups in simple vector loops. A possible alternative is to add a separate shared heap as in Racket's "places".

@timholy
Copy link
Member

timholy commented Dec 21, 2012

@stevengj , my point was, while Krys was interested in CUDA, I seem to remember he explicitly designed it to handle multiple backends, of which kernel threads was another candidate. The main point of his framework was a way to implement lazy evaluation in Julia. He got some nice results even with CPU execution from removing temporaries, e.g., from operations like a = b.*c + d + e, which in standard Julia requires at least two temporaries.

@timholy
Copy link
Member

timholy commented Dec 21, 2012

@JeffBezanson , I like that idea a lot!

@ViralBShah
Copy link
Member

I like to think of the simple case of implementing sum over the columns of a matrix. Say you have as your primitive a parallel vector sum. Now, the implementation of the matrix column sum itself will be written to be parallel. Thus, we very quickly get instances of parallel functions calling other parallel functions. Then you have cases of fat and skinny matrices, or just small matrices which will not benefit from parallelization. We need a design that can at some level incorporate at least some of these things.

@stevengj
Copy link
Member Author

Something like the OpenMP or Cilk runtime has no problem with parallel functions calling parallel functions, nor with problems that are too small to parallelize.

The key thing is that you don't implement shared-memory parallelism by actually spawning threads to do work, you tell the runtime system that work is available and you let it decide what thread to assign (or whether to just do the work serially).

This is largely a solved problem (unlike distributed-memory parallelism). You just need to make the existing solutions accessible in Julia.

(The experts on this are in Charles Leiserson's group at MIT; I would be happy to help you get in touch with them to bounce around ideas.)

@JeffBezanson
Copy link
Member

Yes, I'm familiar with the concepts in Cilk. My biggest concerns are ABI issues and portability. Different compilers seem to have their own openmp implementations. Targeting libgomp is probably enough for non-darwin unices, but I'd rather not have to do multiple backends.

@ViralBShah
Copy link
Member

Yes this is certainly a solved problem in shared memory.

@stevengj
Copy link
Member Author

@JeffBezanson In the short run, it sounds like targeting libgomp will be a good idea (it runs fine on MacOS too in my experience, and is supposedly possible to get running on Windows). That will give you a good platform on which to get something working, shake the thread-safety bugs out, and experiment with syntax, while users will get the benefit of multithreaded operations (and hopefully easier parallelism in their own code too).

In the long run (1-2 years at this rate), LLVM will probably support OpenMP directly, and it sounds like they are talking about supporting multiple backends so that difficulty will be taken care of for you.

@ViralBShah
Copy link
Member

Is there a non-GPL alternative? There is the Blocks runtime in LLVM, but I suspect it only works well on OS X, and is unlikely to be as well debugged and understood as OpenMP is.

@stevengj
Copy link
Member Author

Not that I know of. The gcc Cilk runtime is also GPL, and there is another OpenMP implementation called OpenUH that also seems to be derived from gcc and hence GPL.

But in the long term, once LLVM supports OpenMP, probably it will target proprietary backends like Intel's too, so Julia users wanting to license a different backend will have options.

@ViralBShah
Copy link
Member

http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/

It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on.

Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores.

@staticfloat
Copy link
Member

@ViralBShah; restrictions on size and shape are highly dependent on machine architecture; why not just have "best guess" values that can be tuned by the end user? E.g. run a dumb optimization search over simple matrix operations, matrix sizes, and number of threads. I know the OpenBLAS people were talking about doing this, because right now they have default thresholds that they use that they acknowledge may not be the optimum on any given machine.

@timholy
Copy link
Member

timholy commented Dec 22, 2012

Matlab's threading model seems to have undergone changes over time. In
particular, their having taken away the ability to interactively control
maxNumCompThreads suggests they're using a more sophisticated model these
days.

I don't really know anything about what's happening under the hood. But
recently I had a weird experience with their threads, that may provide some
hints about what is going on: we were in a situation where one MATLAB process
was eating all 32 cores on the machine, and the machine was unusable for
anyone else. I tried renice, then I tried taskset, but Matlab was like the
undead: temporarily I could force the number of cores down to ~10, but within
a few minutes it was right back at 32, even though it was the same PID. It was
running lots of FFTs, for which Matlab just uses FFTW, so maybe it was really
FFTW's "fault." But it was weird.

I've also seen cases where, when Matlab launches a Julia process via their
system command, it's always assigned to CPU 0. So you can have a big machine,
lots of people using it, and all the Matlab-launched Julia processes are
competing for the same core (and therefore running slowly). This doesn't seem
to be entirely reproducible, so I haven't yet dug in to figure out what is
wrong.

--Tim

On Saturday, December 22, 2012 10:40:42 AM Viral B. Shah wrote:

http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/

It seems that the mathworks is just using pthreads, and they have hardcoded
a bunch of special cases. I am guessing that they disable parallelism if a
function is already running in parallel mode. Would be nice if someone
knows more about matlab's threading model can chime in, just so that we
know what's going on.

Of course, I believe we will do something systematic and better. I just
dread putting all these restrictions on size and shape and what not all
over our libraries to decide when to use 1 core or multiple cores.


Reply to this email directly or view it on GitHub:
#1802 (comment)

@ViralBShah
Copy link
Member

A stopgap solution using shmem and mmap.

https://groups.google.com/forum/?fromgroups=#!topic/julia-users/Y_ME9uO5tVQ

@stevengj
Copy link
Member Author

@timholy, processor affinity is the devil here. A lot of software uses it in order to make benchmarks look better, but in the real world it is worse than useless except when running on dedicated hardware. It really needs to be disabled by default (e.g. there is an option in OpenBLAS to disable it; FFTW doesn't use it at all unless you manually do taskset, but maybe Mathworks hacked their FFTW version).

That, plus you need a way to control the total number of worker threads (e.g. OMP_NUM_THREADS or whatever), and last I checked this was a bit dicey with Matlab.

@timholy
Copy link
Member

timholy commented Apr 29, 2013

@stevengj, thanks for the explanation. I bet you're right about the hacked version, it's a behavior that seems relatively specific to FFTs (although for us that's a big part of our load on that server, when we're not doing them via CUDA, so maybe I wouldn't see it otherwise).

@ViralBShah
Copy link
Member

Related to #1790

@rened
Copy link
Member

rened commented Oct 9, 2013

Does the recent announcement of Intel open-sourcing its LLVM-OpenMP implementation make a difference for Julia in the short term? http://blog.llvm.org/2013/10/openmp-project.html

@jongwook
Copy link

jongwook commented Oct 1, 2015

Is this something we can expect when the threading branch gets merged? I still see the 4 times difference (on a quad-core laptop) with MATLAB running OP's example.

@ViralBShah
Copy link
Member

No. That will just give us threading capabilities. Once that stabilises we will start with threading array operations and other things.

@stevengj
Copy link
Member Author

This should be closed in favor of #19777, since dot calls have replaced vectorized operations for the most part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality parallelism Parallel or distributed computation performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants