use multiple threads in vectorized operations #1802

stevengj · 2012-12-21T01:46:22Z

Matlab users who employ "properly" vectorized code may notice a dramatic slowdown in going to Julia because most Matlab vector operations are multithreaded while Julia operations are not. There is an intrinsic difficulty here because Julia code cannot be multithreaded (see issue #1790), but as a stopgap measure I suppose that you could call out to C code that uses OpenMP to parallelize important vector operations.

For example, compare the following Matlab (R2012a) code (on 16-core x86_64 3GHz AMD Opteron 4280, Debian GNU/Linux)

>> x = rand(10000000,1);
Elapsed time is 0.211827 seconds.
>> tic; y = exp(x); toc          
Elapsed time is 0.035116 seconds.
>> tic; y = x.^2 + 3*x.^3; toc
Elapsed time is 0.159840 seconds.

with the Julia equivalent:

julia> x = rand(10000000,1);
julia> tic(); y = exp(x); toc()
elapsed time: 0.3979671001434326 seconds
julia> tic(); y = x.^2 + 3*x.^3; toc()
elapsed time: 2.3250949382781982 seconds

Julia is more than 10x slower. The difference is almost entirely due to multithreading, since if I run Matlab with -singleCompThread I get:

>> tic; y = exp(x); toc
Elapsed time is 0.406924 seconds.
>> tic; y = x.^2 + 3*x.^3; toc   
Elapsed time is 2.005534 seconds.

Also needed would be a way to specify the number of threads to use (e.g. you could use the standard OMP_NUM_THREADS environment variable by default, while providing some way to change it from within Julia). The nice thing about using OpenMP is that you can also use OpenMP for the OpenBLAS library and for FFTW (and possibly other libraries?), and use the same thread-pool for all three.

The text was updated successfully, but these errors were encountered:

timholy · 2012-12-21T02:27:32Z

An approach that might be less counter-Julian than calling out to C for everything might be to rejuvenate Krys' very interesting framework mentioned in this thread:
https://groups.google.com/d/topic/julia-dev/PZfOv4h94lk/discussion

IIUC, he designed it explicitly to support multiple "backends," one of which might be kernel threads. @stevengj, any interest in looking into it? It would be a fantastic contribution.

stevengj · 2012-12-21T02:35:01Z

I have no interest in CUDA, which I see as an architectural dead end. The hard thing about parallel programming is mainly dealing with memory. Shared memory is hard enough (one kind of memory + synchronization), and distributed-memory programming is harder (local + remote memory). The CUDA/OpenCL model requires the programmer to manually manage 4-5 kinds of memory (main memory, GPU memory, thread-block shared memory, thread-local memory, and remote memory in a cluster situation), which is not a reasonable way forward for software engineering except in special cases.

JeffBezanson · 2012-12-21T02:38:06Z

I think it would be totally reasonable to start by adding some shared-memory loop-parallelism with restricted parallel regions to the core language. That way we can start by just slapping locks on everything in the runtime, still allowing for speedups in simple vector loops. A possible alternative is to add a separate shared heap as in Racket's "places".

timholy · 2012-12-21T02:44:46Z

@stevengj , my point was, while Krys was interested in CUDA, I seem to remember he explicitly designed it to handle multiple backends, of which kernel threads was another candidate. The main point of his framework was a way to implement lazy evaluation in Julia. He got some nice results even with CPU execution from removing temporaries, e.g., from operations like a = b.*c + d + e, which in standard Julia requires at least two temporaries.

timholy · 2012-12-21T02:45:34Z

@JeffBezanson , I like that idea a lot!

ViralBShah · 2012-12-21T23:01:07Z

I like to think of the simple case of implementing sum over the columns of a matrix. Say you have as your primitive a parallel vector sum. Now, the implementation of the matrix column sum itself will be written to be parallel. Thus, we very quickly get instances of parallel functions calling other parallel functions. Then you have cases of fat and skinny matrices, or just small matrices which will not benefit from parallelization. We need a design that can at some level incorporate at least some of these things.

stevengj · 2012-12-21T23:06:22Z

Something like the OpenMP or Cilk runtime has no problem with parallel functions calling parallel functions, nor with problems that are too small to parallelize.

The key thing is that you don't implement shared-memory parallelism by actually spawning threads to do work, you tell the runtime system that work is available and you let it decide what thread to assign (or whether to just do the work serially).

This is largely a solved problem (unlike distributed-memory parallelism). You just need to make the existing solutions accessible in Julia.

(The experts on this are in Charles Leiserson's group at MIT; I would be happy to help you get in touch with them to bounce around ideas.)

JeffBezanson · 2012-12-21T23:44:29Z

Yes, I'm familiar with the concepts in Cilk. My biggest concerns are ABI issues and portability. Different compilers seem to have their own openmp implementations. Targeting libgomp is probably enough for non-darwin unices, but I'd rather not have to do multiple backends.

ViralBShah · 2012-12-21T23:58:46Z

Yes this is certainly a solved problem in shared memory.

stevengj · 2012-12-22T01:29:21Z

@JeffBezanson In the short run, it sounds like targeting libgomp will be a good idea (it runs fine on MacOS too in my experience, and is supposedly possible to get running on Windows). That will give you a good platform on which to get something working, shake the thread-safety bugs out, and experiment with syntax, while users will get the benefit of multithreaded operations (and hopefully easier parallelism in their own code too).

In the long run (1-2 years at this rate), LLVM will probably support OpenMP directly, and it sounds like they are talking about supporting multiple backends so that difficulty will be taken care of for you.

ViralBShah · 2012-12-22T08:38:57Z

Is there a non-GPL alternative? There is the Blocks runtime in LLVM, but I suspect it only works well on OS X, and is unlikely to be as well debugged and understood as OpenMP is.

stevengj · 2012-12-22T13:56:35Z

Not that I know of. The gcc Cilk runtime is also GPL, and there is another OpenMP implementation called OpenUH that also seems to be derived from gcc and hence GPL.

But in the long term, once LLVM supports OpenMP, probably it will target proprietary backends like Intel's too, so Julia users wanting to license a different backend will have options.

ViralBShah · 2012-12-22T18:40:41Z

http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/

It seems that the mathworks is just using pthreads, and they have hardcoded a bunch of special cases. I am guessing that they disable parallelism if a function is already running in parallel mode. Would be nice if someone knows more about matlab's threading model can chime in, just so that we know what's going on.

Of course, I believe we will do something systematic and better. I just dread putting all these restrictions on size and shape and what not all over our libraries to decide when to use 1 core or multiple cores.

staticfloat · 2012-12-22T19:25:16Z

@ViralBShah; restrictions on size and shape are highly dependent on machine architecture; why not just have "best guess" values that can be tuned by the end user? E.g. run a dumb optimization search over simple matrix operations, matrix sizes, and number of threads. I know the OpenBLAS people were talking about doing this, because right now they have default thresholds that they use that they acknowledge may not be the optimum on any given machine.

timholy · 2012-12-22T19:44:07Z

Matlab's threading model seems to have undergone changes over time. In
particular, their having taken away the ability to interactively control
maxNumCompThreads suggests they're using a more sophisticated model these
days.

I don't really know anything about what's happening under the hood. But
recently I had a weird experience with their threads, that may provide some
hints about what is going on: we were in a situation where one MATLAB process
was eating all 32 cores on the machine, and the machine was unusable for
anyone else. I tried renice, then I tried taskset, but Matlab was like the
undead: temporarily I could force the number of cores down to ~10, but within
a few minutes it was right back at 32, even though it was the same PID. It was
running lots of FFTs, for which Matlab just uses FFTW, so maybe it was really
FFTW's "fault." But it was weird.

I've also seen cases where, when Matlab launches a Julia process via their
system command, it's always assigned to CPU 0. So you can have a big machine,
lots of people using it, and all the Matlab-launched Julia processes are
competing for the same core (and therefore running slowly). This doesn't seem
to be entirely reproducible, so I haven't yet dug in to figure out what is
wrong.

--Tim

On Saturday, December 22, 2012 10:40:42 AM Viral B. Shah wrote:

http://www.mathworks.in/support/solutions/en/data/1-4PG4AN/

It seems that the mathworks is just using pthreads, and they have hardcoded
a bunch of special cases. I am guessing that they disable parallelism if a
function is already running in parallel mode. Would be nice if someone
knows more about matlab's threading model can chime in, just so that we
know what's going on.

Of course, I believe we will do something systematic and better. I just
dread putting all these restrictions on size and shape and what not all
over our libraries to decide when to use 1 core or multiple cores.

Reply to this email directly or view it on GitHub:
#1802 (comment)

ViralBShah · 2013-04-29T06:56:04Z

A stopgap solution using shmem and mmap.

https://groups.google.com/forum/?fromgroups=#!topic/julia-users/Y_ME9uO5tVQ

stevengj · 2013-04-29T15:59:40Z

@timholy, processor affinity is the devil here. A lot of software uses it in order to make benchmarks look better, but in the real world it is worse than useless except when running on dedicated hardware. It really needs to be disabled by default (e.g. there is an option in OpenBLAS to disable it; FFTW doesn't use it at all unless you manually do taskset, but maybe Mathworks hacked their FFTW version).

That, plus you need a way to control the total number of worker threads (e.g. OMP_NUM_THREADS or whatever), and last I checked this was a bit dicey with Matlab.

timholy · 2013-04-29T17:56:58Z

@stevengj, thanks for the explanation. I bet you're right about the hacked version, it's a behavior that seems relatively specific to FFTs (although for us that's a big part of our load on that server, when we're not doing them via CUDA, so maybe I wouldn't see it otherwise).

ViralBShah · 2013-06-25T07:46:48Z

Related to #1790

rened · 2013-10-09T22:02:04Z

Does the recent announcement of Intel open-sourcing its LLVM-OpenMP implementation make a difference for Julia in the short term? http://blog.llvm.org/2013/10/openmp-project.html

jongwook · 2015-10-01T01:30:46Z

Is this something we can expect when the threading branch gets merged? I still see the 4 times difference (on a quad-core laptop) with MATLAB running OP's example.

ViralBShah · 2015-10-04T08:20:43Z

No. That will just give us threading capabilities. Once that stabilises we will start with threading array operations and other things.

stevengj · 2021-03-15T16:30:37Z

This should be closed in favor of #19777, since dot calls have replaced vectorized operations for the most part.

diegozea mentioned this issue Mar 18, 2013

laplace equation benchmark performance #1168

Closed

timholy mentioned this issue Oct 19, 2013

RFC: Shared-memory parallel computing #4580

Closed

alanedelman mentioned this issue Nov 26, 2014

Parallel Computing Planning Issue #9167

Closed

12 tasks

stevengj mentioned this issue Jan 6, 2015

do not set CPU affinity on Linux (see #3097) #9639

Merged

ViralBShah added the multithreading Base.Threads and related functionality label Mar 6, 2015

JeffBezanson mentioned this issue Jan 6, 2016

Plain old multiplication much slower than Matlab 2013a #14576

Closed

ChrisRackauckas mentioned this issue Aug 8, 2016

Consistent API for embarrassingly parallel routines between levels of parallelism #17887

Open

pabloferz mentioned this issue Feb 20, 2017

multi-threaded (@threads) dotcall/broadcast? #19777

Open

stevengj closed this as completed Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use multiple threads in vectorized operations #1802

use multiple threads in vectorized operations #1802

stevengj commented Dec 21, 2012

timholy commented Dec 21, 2012

stevengj commented Dec 21, 2012

JeffBezanson commented Dec 21, 2012

timholy commented Dec 21, 2012

timholy commented Dec 21, 2012

ViralBShah commented Dec 21, 2012

stevengj commented Dec 21, 2012

JeffBezanson commented Dec 21, 2012

ViralBShah commented Dec 21, 2012

stevengj commented Dec 22, 2012

ViralBShah commented Dec 22, 2012

stevengj commented Dec 22, 2012

ViralBShah commented Dec 22, 2012

staticfloat commented Dec 22, 2012

timholy commented Dec 22, 2012

ViralBShah commented Apr 29, 2013

stevengj commented Apr 29, 2013

timholy commented Apr 29, 2013

ViralBShah commented Jun 25, 2013

rened commented Oct 9, 2013

jongwook commented Oct 1, 2015

ViralBShah commented Oct 4, 2015

stevengj commented Mar 15, 2021

use multiple threads in vectorized operations #1802

use multiple threads in vectorized operations #1802

Comments

stevengj commented Dec 21, 2012

timholy commented Dec 21, 2012

stevengj commented Dec 21, 2012

JeffBezanson commented Dec 21, 2012

timholy commented Dec 21, 2012

timholy commented Dec 21, 2012

ViralBShah commented Dec 21, 2012

stevengj commented Dec 21, 2012

JeffBezanson commented Dec 21, 2012

ViralBShah commented Dec 21, 2012

stevengj commented Dec 22, 2012

ViralBShah commented Dec 22, 2012

stevengj commented Dec 22, 2012

ViralBShah commented Dec 22, 2012

staticfloat commented Dec 22, 2012

timholy commented Dec 22, 2012

ViralBShah commented Apr 29, 2013

stevengj commented Apr 29, 2013

timholy commented Apr 29, 2013

ViralBShah commented Jun 25, 2013

rened commented Oct 9, 2013

jongwook commented Oct 1, 2015

ViralBShah commented Oct 4, 2015

stevengj commented Mar 15, 2021