-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support shared-memory parallelism (multithreading) #1790
Comments
You certainly have a point here. Composability with parallel libraries is especially interesting. Work is underway to provide OpenMP support in LLVM, and if that happens it will be easier to interface with the openmp runtime. Otherwise, would it be sufficient to target a particular popular implementation like libgomp? At a higher lever, it is true that programming distributed memory is much harder, but our thinking has been that once you have a distributed memory implementation it is relatively easy to make it take advantage of shared memory for performance without changing the programming model. Then if you can somehow make distributed memory programming easier, everything will be fine. It would be very valuable to have shared memory parallelism available, but not 100% satisfying just to tack on another parallel programming model. We could take the approach of adding runtime support, then punting the rest to the user or future API designs though. |
It is fairly easy to support the equivalent of |
Is the LLVM blocks runtime (used by Apple in Grand Central Dispatch) worth considering as an alternative to openmp? http://libdispatch.macosforge.org |
Yes, the requirement to use distributed arrays and multiple processes is the whole difficulty -- the key advantage of shared memory is that you have only a single kind of memory to work with. This becomes even more important when you go beyond simple data-parallel circumstances. e.g. parallelizing a sparse-matrix multiply, or a finite-difference time-domain simulation, or an FFT, are all fairly easy in shared memory [at least, before you get into heavy optimization], but are a lot more complicated in distributed memory. And the history of computer science is littered with the corpses of languages that tried to make distributed-memory programming as transparent as shared-memory. OpenMP seems to be by far the most popular technique for parallelizing compiled C/C++/Fortan code on shared-memory systems, so there is a lot to be said for playing nicely with the OpenMP runtime. But the runtime system may something of an implementation detail that in principle could even be switched at compile-time (or even at runtime). [This may be easier said than done; if you have to pick one, I would go with OpenMP.] As I understand it, the main issue right now is that Julia's garbage collection is not multithreaded. As a first step, it wouldn't be so terrible if garbage collection were serialized in a multithreaded program, with the understanding that people writing multithreaded code should try to optimize things so that garbage collection can be deferred to happen infrequently. As Jeff says, I think the key thing is to support shared memory in the runtime with some kind of OpenMP-like semantics via low-level primitives. Julia's macro system is flexible enough that people can then experiment with different syntaxes (although you will probably eventually want to standardize on one syntax to build into Base). |
Here's an old issue on a related topic: For the record, I'd love this too, but not if it creates disasters. |
Intel recently open-sourced its OpenMP library under the BSD. |
OpenMP support is forthcoming in clang - although the approach taken is to do it in the clang frontend rather than in the LLVM IR. See the presentation linked to in this article. |
Since Julia is unlikely to adopt the OpenMP |
Right, Julia is unlikely to adopt the OpenMP syntax, but clang having OpenMP capability makes it possible for us to compile FFTW and OpenBLAS with OpenMP on Mac, where we use clang as the default compiler. This is nice to have, when it happens. |
Last I checked, it was a bad idea to compile FFTW with clang; gcc gave significantly better performance. I don't know about OpenBLAS. For your precompiled Mac binaries, you might want to look into this. (Fortunately, the Intel OpenMP library is supposedly ABI-compatible with libgomp, so you can hopefully compile FFTW with gcc and |
It seems that perhaps there are really three distinct (but related) issues being raised here:
In regards to issue 1, I don't see why Julia would need to change it's programming model to support shared memory communication. (Perhaps I've just revealed my own ignorance regarding Julia.) For example, MPI was designed with distributed computing in mind; however, many MPI implementations (e.g. OpenMPI) are still able to pass messages directly in memory when the sender and receiver are on the same machine. Hopefully, Julia will be able to (if it doesn't already) optimize it's communication strategy in a similar manner. In regards to issue 2, I don't think it's unanimous that thread based programming models are always the best way to write high-performance parallel algorithms. In fact, I believe the designers of the D programming language decided to implement a message passing model to avoid many of the pitfalls of thread-based programming (see http://www.informit.com/articles/article.aspx?p=1609144 for details). Furthermore, with the exception of very simple programs, getting good performance out of OpenMP usually requires more effort than adding a few pragmas. As far as the simple cases are concerned, I think implementing parallel loop macros might be sufficient to make Julia as convenient as OpenMP. Unfortunately, I don't have much to offer regarding issue 3. I hope this helps. |
It seems that there has been progress made with OpenMP and CLANG. See below article: http://www.phoronix.com/scan.php?page=news_item&px=MTQ0NjQ and http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-August/065169.html I have actually been watching the progress of Julia and this issue in particular before jumping into learning a whole new toolset. I genuinely believe this is a must-have for a language/tool like Julia and second the reasoning of stevengj above that this is just so much easier to code for than thinking of distributed memory. |
@bsilbaugh, no one is arguing that "thread based programming models are always the best way to write high-performance parallel algorithms," just that they are significantly simpler in many common cases. The whole point of this that in the simple cases where one wants to parallelize an expensive loop ( And "implementing parallel loop macros" is not sufficient, precisely because it doesn't deal with the issue of memory. The problem is not parallelizing the loop, the problem is that in distributed-memory systems you need to decide in advance where things are stored and deal with explicit communication. This is what shared-memory avoids, and this is where automated language tools have been an abject failure in distributed-memory systems for 20 years. This is not a simple problem. |
@stevengj Breath. Maybe have some Camomile tea. Look, a cute puppy. First, I think we agree on a few things:
Now, let's consider the main point:
For simple cases (loops), being able to use OpenMP is nice. No argument. But is it worth exposing another parallel programming model to the user? Another model that the user needs to think about when composing parallel libraries? (Developer A may decide he will only use threads for his library, and developer B will decide that he will only use one-sided comm for his library. But someday, developer C is going to need to put library A together with B, and will have to try to reason about their interaction.) I think this is perhaps where we disagree. We disagree about the idea of bolting on yet another parallel programming model to the language itself. And that's okay. Out curiosity, have you tried automatic parallelism for some of the use cases you're talking about? Maybe compare a few test cases using between OpenMP and Intel's automatic parallelism just to get a sense of what is possible with automatic parallelism. (I would be willing to do some of the leg work if you would be willing to dig up some good test cases.) If the present state-of-the-art in automatic parallelism is good enough for the kind of problems your talking about, then this might be a way to get shared memory support into Julia without actually requiring Julia users to learn multiple programming models. Obviously, automatic parallelism will take some effort to implement, but then again, the parallel programming model of the future is probably no model; i.e. I suspect multi-threading and message passing will eventually go the way of assembly programming. |
@bsilbaugh, Intel's auto-parallelization still requires the language to support shared-memory parallelism (and in fact, Intel's parallelizer is built on top of OpenMP). They probably didn't build it on top of distributed-memory parallelism precisely because it is so difficult to automate memory distribution. |
@stevengj Automatic parallelization requires compiler support (which may use OpenMP libraries internally), but it does not require language support. (Otherwise, it wouldn't be automatic.) You're right that Intel does not support automatic parallelism for distributed memory architectures (nor does OpenMP). But, for the simple cases you alluded to, perhaps it's enough. The point of my earlier post (which I think was inline with @ViralBShah and @JeffBezanson original comments) was that you may not need to change the current parallel model (i.e. the one-sided communication API supported by the standard library) simply for performance reasons. For example, calling fetch could just as easily dereference a pointer to a block of shared memory instead of pulling data over the network. Depending on julia's JIT capabilities, perhaps even some of these calls (fetch, remote_call, etc) can get optimized out. |
@bsilbaugh, by "language support", I mean support in the Julia runtime, which is separate from the compiler. The main obstacle in this whole discussion is that the Julia runtime is not thread-safe. (And the simple cases I alluded to are no longer so simple in distributed-memory situations.) Yes, you can certainly implement distributed-memory primitives on top of shared memory, but that doesn't address the difference in ease of programming between shared and distributed programming models (regardless of implementation). |
For shared memory parallelism, it would be worth looking at the Cilk model too. There is ongoing work to add it to LLVM (http://cilkplus.github.io/). Cilk avoids some of the composition problems that OpenMP has (notably nested parallelism). Though there's no free lunch -- OpenMP also has certain advantages. Another candidate worth understanding is Deterministic Parallel Java (http://dpj.cs.uiuc.edu/DPJ/Home.html). Maybe some of its techniques can be applied in Julia. I think the important thing is to understand the tradeoffs. |
@ArchRobison, the semantics of OpenMP have been converging towards those of Cilk for some time now. OpenMP now has So, while I agree that Cilk has a good conceptual model for shared-memory parallelism, that question is somewhat orthogonal to what runtime threading library we use. (LLVM support is apparently somewhat irrelevant here since our syntax is not implemented in LLVM; we just need the runtime library.) But again, the stumbling block is thread safety in the Julia runtime library. |
That is true, but i'm just as worried about thread safety in every other julia library that might be out there. |
I'm not sure I understand the latter concern fully. For example, are there really that many functions in base that make use of non-constant global variables? I'm not saying there aren't any---I've written some of them myself---but I don't tend to think of it as a major feature of our code base. Of course with packages there are additional possibilities for conflict, but at least in my own packages I think it's pretty typical of what's in base---a small percentage might need some redesign for thread-safety. |
Though OpenMP has adopted tasking, there are fundamental semantic differences with Cilk that impact composability and performance. Tasking in OpenMP is tied to parallel regions. The big advantage and big disadvantage (depending on context) is that the number of threads executing a parallel region must be bound when the region is entered, before the amount of work or potential parallelism is known. (I work with the Cilk/OpenMP/TBB groups at Intel. We've considered lazy schemes to try to circumvent the issue in OpenMP, but the OpenMP standard has pesky features that get in the way.) I agree that the big stumbling block is the run-time library and existing Julia libraries. Maybe a lint-like tool could inspect Julia libraries for "okay", "definitely not okay", or "a human should take a look"? From my beginner's experience, Julia seems to have much less of the alias analysis misery that stymies such tools for C/C++. |
I am indeed worried about packages, and the various native libraries they might call. Thread safety is a pretty significant demand to put on all those things, especially since the failure mode is not an error message (or poor performance) but random horrible behavior. Julia code is designed to compose together quite promiscuously, so it is hard to say "my code is threaded, so I need to make sure I only use thread-safe libraries" --- one could easily pass a function from one package to an implicitly parallel higher-order function from another package. @ArchRobison great to have somebody from the Cilk group here. |
@ArchRobison, thanks for the clarification, that's very helpful. |
Another issue to consider is the generality of the threading model versus ability to detect races. Automatic race detectors can reduce the pain of dealing with threading bugs. Examples are Helgrind, Intel Inspector, and Cilk screen. (See http://supertech.csail.mit.edu/papers/tushara-meng-thesis.pdf for the theory behind one for Cilk.) The efficiency of a race detector is somewhat inverse to the generality of the parallelism model, so it's something to consider in choosing the parallelism model. JIT compilation may be able to reduce the cost somewhat since it can instrument only the memory accesses that might constitute races. (In the jungle of C/C++ binaries that Inspector deals with, we have to instrument just about every access since binary stack frames don't give us much info to do thread escape analysis.) |
@timholy what is done ? |
Most of Halide syntax, specifically loop reordering, hosting constants, and tiling. Performance on a couple of test examples actually beats (single-threaded) Halide by a bit, surprisingly. |
That sounds fantastic. Adding the "threading bit" is precisely the topic of this issue. What are the show stoppers ? |
Looks like I forgot to commit the The main showstopper is splitting the code for each tile into an independently-callable function, so that threads can be dispatched. However, I planned ahead, and at least the code to allocate temporary memory for each tile is annotated, which might make this easier. |
So where are we now? Have we decided on a model yet? |
See the threads branch. I think we can close this issue and have more specific issues related to the implementation. |
@ViralBShah, I think you mean the I don't think this issue should be closed until threading is merged in some form. You can track an issue much more easily than you can track a branch. |
There used to be a threads branch, and a newer threading branch, and eventually the threads branch was removed, IIRC. I believe @JeffBezanson and @kpamnany are getting this up to speed this week. Perhaps worth connecting with them. |
We got a fair amount accomplished this week. Here's the status.
The plan is to merge into master soon after 0.4 branches, with |
I'm sensing an impending Travis build matrix addition. :) I can hear the server fans spinning up from here. Good thing that Travis is roflscale. |
We will soon get dedicated hardware for you to play with. Let the build matrix grow with impunity. |
Should we close this in favour of more specific issues dealing with multi-threading at this point? |
You said that last october. Judging by
I think the answer there is "not until something gets merged." |
Very exciting to hear about the progress here. I can tell you're just trying to keep me following master rather than focusing on 0.4. Your devious strategy just might work. |
The
and several other errors. @vtjnash Know anything about this? Am I doing something wrong? |
it sounds like someone needs to rebase the patches on top of the llvm 3.7.0 release so that it can remain stable and buildable. |
I'm using the |
Something got merged, so close? It's not enabled by default yet but seems to work at the moment. #9336 covers the prereq LLVM version, and other separate issues can be filed for the occasional bug or segfault that will get found in using the threading code. |
It seems to me that this will be really fun to close when we do finally actually have a non-experimental threading interface. |
We'll continue making progress at thread-safety with PRs like #17250, but I think the original issue is resolved and I don't see any sub-issues mentioned later (which should be opened as separate issues anyways). |
The programming model of what you can do with |
I realize that this would require major effort, but I'm hoping that you will keep this on your radar screen: the lack of native support for shared-memory parallelism is a huge shortcoming of Julia.
The reason to support shared-memory parallelism is that it is usually far, far easier to program than distributed-memory parallelism. Especially if you adopt the same infrastructure (already in gcc) as OpenMP or Cilk+, in which the programmer indicates which function calls and loops can be executed in parallel and the runtime system decides how many threads to spawn and how to distribute the load (e.g. by work stealing to parallelize irregular problems, as in Cilk).
Julia is attractive because it combines ease of programming with performance near that of C, but the latter is no longer true if you can simply put "#pragma parallel for" in your C loop and get 10x speedup on your 12-core machine. Distributed arrays cannot compete with this in ease-of-use, especially for problems in which extensive inter-processor communication is required.
(Using the OpenMP runtime system is also nice in that this way the runtime system will share the same worker threads with C libraries like OpenBLAS and FFTW, so that Julia and external C code won't fight over the CPUs.)
The text was updated successfully, but these errors were encountered: