RFC: Shared-memory parallel computing #4580

timholy · 2013-10-19T18:25:34Z

This is 0.3 material or beyond.

A recent mailing-list discussion, and particularly a comment by Jason Riedy, pointed out a potential strategy for shared-memory parallelism that circumvents some of the problems with other strategies. I've fleshed that basic idea out here, hopefully in a way that provides a fairly nice Julian interface.

This is an RFC for several important reasons: at the moment it's Linux-only, there are API issues to consider, I haven't integrated it into the build process of Julia (it's easier to test and tweak this way, if someone else wants to do so), there are no docs, etc.

Noteworthy features:

This presents both a natural "array" interface and something suitable for parallelism, which I find to be the most attractive thing about this implementation. The main trick here is subverting the serializer so that we can have the full array version available in process id 1 without a big transport hit (the other processes just see the remote-ref version).
Compared to PTools' pfork, this has a more modest overhead. On my machine pfork has approximately 500ms minimum latency even for a trivial computation. Here, the pcall form is limited by the event queue, and on my machine seems to have a minimum latency of about 40ms. The busy-wait version pcall_bw gets that down to less than 1ms per process.
It uses shared memory but cleans up after itself nicely, so is a little like Amit's shared memory/server tasks without the risk of leaving resources dangling.

In my view the main API issue to consider is how to handle the return value(s). As an example of the conceptual challenge, many Julia functions have a return like fill!(), where the filled array is returned---this is nice because then you can chain together function calls very naturally. However, serializing that output will result in a huge performance hit. So the convention here is to return two things: the full output of the function call on process id 1 (which requires no serialization), and the output of each other process as an array of remote references. In cases where the output is already being stored in a pre-allocated SharedArray, none of these outputs matter anyway, because such results are immediately available in the array that exists in process id 1.

CC @amitmurthy.

JeffBezanson · 2013-10-19T18:39:44Z

Great to see this. My main observation is that this has a lot of overlap with DArrays, and should maybe be a feature of that type.

timholy · 2013-10-19T19:27:41Z

Yep, but with one important exception: you can guarantee that S[i,j,k] is fast for a SharedArray (just as fast as an Array), but that's not true not for a general DArray. For that reason I wonder if they need to be different types.

timholy · 2013-10-19T21:05:03Z

That said, I'm supportive of the idea of merging (fewer types is better for everyone), if a good way forward exists.

JeffBezanson · 2013-10-19T22:21:51Z

It is probably good enough to have common interfaces and a common ArrayDist type (i almost called it Distribution, but...)

timholy · 2013-10-19T23:22:38Z

Sounds good.

For the benefit of others who might want to review this, perhaps some additional explanation is in order. This is set up to make it basically trivial to handle embarrassingly-parallel problems (although it's more flexible than that). You create a SharedArray similarly to a regular one:

S = SharedArray(Int, (5, 8))

and then you can manipulate it just as an ordinary array (with what should be the same performance):

fill!(S, 15)
X = tan(S)   # X will be an Array, not a SharedArray
y = S[4,17]

etc. There's no special implementation of tan() here, it's the standard one: for most purposes, in process id 1 (the main process) S is just a regular array. (For code that doesn't allow AbstractArray inputs, you may have to say something like randn!(S.data); we might want to create a wrapper for such functions.) That's convenient when only some part of your code can/should run in parallel, and it's nice that there is no performance hit.

When you want to execute a parallel operation, you say

pcall(func, arg1, arg2, ...)

The arguments can be anything, and in particular you can have as many (or as few) shared arrays in that list as you want. There are major performance advantages to making sure that the args are small, as far as the serializer is concerned; the advantage of a SharedArray, like a DArray, is that no matter how big the actual data, it's effectively small as far as the serializer is concerned.

To take advantage of parallel processing, the function you pass to pcall needs just 1 or 2 lines for each argument that might be a SharedArray:

A = myarray(S)
indexes = myindexes(S)  # only necessary if this thread should handle just part of S

Assuming S is a SharedArray, A is a regular Array, and it's the whole thing (not just a "local" chunk, because we're talking shared memory here). If you're setting up S to be a pre-allocated output, the second line tells your code which indexes in S the current process is responsible for setting. Just iterate over those indexes, setting the corresponding values of A, and trust that other processes are handling the rest. That's the full extent of work you need to do to enable your algorithm to run in parallel. Moreover, myarray and myindexes do the obvious thing for any other AbstractArray type, so you can write your code just once for either single-threaded or parallel execution.

For quickly-running operations, you get some performance benefit by bypassing the ordinary synchronization methods, like so:

doneflag = sharedsync()
pcall_bw(func, doneflag, arg1, arg2, ...)

where bw stands for busy-wait. For operations longer than a few hundred milliseconds, you won't notice any real advantage from using the busy-wait syntax.

The bottom line: my hope is that for many purposes this should be a reasonable substitute for SMP multithreading, without most of the dangers. See also #1802.

ViralBShah · 2013-10-20T02:16:11Z

This is certainly the way to go until we do real multi-threading. It would be great to work in some of pfork style capabilities, since that has the potential to save quite a bit of memory for parallel usage.

JeffBezanson · 2013-10-20T04:33:38Z

I guess what some people call "real" multithreading I call "incredibly unreliable and difficult" multithreading.

amitmurthy · 2013-10-29T08:38:50Z

IMHO, we should support "incredibly unreliable and difficult" multithreading at some point, only because "performance" is one of the main drivers of Julia. And we should provide all the tools we can to people who can deal with the said complexity of multithreading, and want to squeeze out every bit of performance.

If I were to design a large scale system for reliability, I am fully with you that a message passing model is the way to go. But since a) we benchmark ourselves with folks building high performance stuff in C and b) CPUs are cramming more and more cores, we should provide a means for developers to leverage those cores without copying copious amounts of data between processes.

An alternative, though, maybe a model wherein the API is still message passing but

a) the runtime itself is multi-threaded and addprocs creates threads instead of external processes. The abstraction is still one of "julia processes", but they run as OS threads internally
b) the runtime internally implements a copy-on-write model for data sent/recd between threads(julia processes)

ViralBShah · 2013-11-20T08:18:57Z

bump

timholy · 2013-11-20T11:50:07Z

To make it work across platforms, there are two issues:

We need an implementation of ramdisk() on other platforms. Do Macs and Windows machines have such a filesystem set up by default? (On Linux, creating one requires superuser privileges, so if /dev/shm weren't already available this might be a problem.)
For proper cleanup on Windows, we'd need something like unlink().

Perhaps both of those functions should be moved to file.jl.

Those are the only problems that I know of (aside from writing documentation, etc). But perhaps others might have objections that run more deeply? If so, this might be a good time to voice them.

timholy · 2013-11-20T13:23:26Z

I should add that some googling has not turned up an OSX variant of /dev/shm. It's possible to create one but I bet it requires elevated privileges (I have no idea, though). Thoughts?

ViralBShah · 2013-11-20T13:26:08Z

That would be difficult on a mac. I am sure it requires superuser, and more so, those are pretty intrusive changes for julia to make on someone's system.

timholy · 2013-11-20T15:01:20Z

Right. It would have be something we'd do once, when Julia is installed (presumably by making the changes to /etc/fstab). If that's not something we're willing/able to do, then this approach is in trouble on OSX (Windows actually seems less problematic, see the section titled "Native"). The rub is that other approaches have major disadvantages too---so far this seems to be the only approach anyone has proposed that allows you to share memory dynamically across processes without worrisome cleanup issues. So we may have a difficult decision to make.

simonster · 2013-11-20T16:16:23Z

Have you looked into shm_open?

RauliRuohonen · 2013-11-20T16:22:02Z

I haven't actually tried this, but wouldn't simply using shm_open and shm_unlink work the same way without requiring /dev/shm? As Unix domain sockets can be used to pass file descriptors around, one might try the following strategy:

Create a Unix domain socket on the first Julia process to be born, and inherit it on all other Julia processes when fork()ed. Or if that doesn't work, use a path in /tmp/wherever/ to store the socket. If Julia crashes, leaving these behind in the filesystem is not a problem, as the resources wasted by the "disk space leak" are insignificant and disappear at boot when /tmp is cleaned. That way nothing breaks when the user uses addprocs(n).
When allocating shared memory, do shm_open() and shm_unlink() atomically, i.e. make sure there is no crash between them, and return the file descriptor, which can be mmap():ed, ftruncate():ed and passed around at will.
Rather than only having SharedArray with prearranged way to "cut" it across processes, allow the user to decide how to do it. I'm thinking of something like Python's multiprocessing module in generality/scope. The current simple SharedArray can easily be implemented on top of something like that.
Ideally (this extra is not simple by any stretch): mmap() the shared memory to the same addresses (on 64-bit machines there's plenty of address space to waste) automatically across Julia processes, allowing for intra-SHM pointers to be stored in Julia-GC-managed SHM. You could copy pointer-riddled stuff there by e.g. "data = shared_deepcopy(data)". This would allow for all the benefits of threads with a well-defined sandbox for the shared parts. Not much different from using threads with everything defaulting to thread-local unless explicitly declared shared, except less likely to break ("thread locality" is enforced by the MMU, and C libs see single thread only).

Essentially, I'm trying to say that a nicer equivalent to Python's multiprocessing with some convenience tools on top looks doable without mucking with Julia's internals, and that's already a 80% solution. With proper internals-mucking, you might later be able to do "better threads than 'real threads'" with the same basic approach, which should be a 100% solution (there's always ccall if one really absolutely must have 'real' threads).

ViralBShah · 2013-11-20T16:51:47Z

I believe that is what @amitmurthy is doing in the PTools.jl package.

timholy · 2013-11-21T15:20:59Z

This is great feedback. I didn't even know about shm_unlink, it sounds like it might indeed solve the cleanup problem.

I've been mulling this over. I like playing soccer, and can often dribble around several opponents, but sometimes you realize it's time to pass the ball to a teammate. This is such a time. There are quite a few people who know much more about this particular topic than I do, and in any event I have weeks worth of work (which will stretch to months when coupled with real life) to do in HDF5, Images, and ImageView. Anyone who wants to see this feature in 0.3 is invited to have a go. If this code is a useful starting point, it's already in the teh/sharedarrays branch; if the teh concerns anyone, I'm happy to rename it to just sharedarrays so that there's no concern that I might feel any degree of ownership.

ViralBShah · 2013-11-22T03:22:14Z

The renaming of the branch is certainly not necessary!

amitmurthy · 2013-11-22T09:37:09Z

@timholy , my modified branch using shm_open and shm_unlink is at https://github.com/amitmurthy/julia/tree/amitm/sharedarrays .

I removed your checks for mapping the segment read only if it existed, since the segments lifetime is expected to be the current run. The name of the segment also includes the OS pid of process 1. As per the man pages, once a segment has been unlinked, the next time the same name is used (with O_CREAT) it is basically a new different segment, so, we should not have problems having multiple SharedArrays in the same julia session.
test/sharedarrays passes fine
I guess this approach is now good enough for both Linux and OSX

Please do check it out and merge into teh/sharedarrays if you are OK with it.

@RauliRuohonen,
I liked your suggestion of passing shmem fds via unix domain sockets and tried it out too - it works. But, I think doing that is an overkill for SharedArrays. With Tim's model, the only time we may be left with dangling segments is if julia crashes during creation time when the segment is being mapped across all nodes. This would not be expected to be a common occurrence.

@JeffBezanson ,
Does it make sense for DArrays to automatically uses shmem when all the specified workers are on the same host? At least on OSX/Linux. Where shmem is not possible (like if one of the processes is on a different host or Windows) we use the current implementation.
We can also have a "safe" mode in DArray (the default) where all reads/writes to indexes outside of myindexes are necessarily does via talking to the process owning that index, and a non-safe mode, where we allow direct updates to the segment.

timholy · 2013-11-22T11:00:36Z

Amit, thanks so much for picking this ball up! As far as I can tell your changes look fine, so I merged it.

timholy · 2013-11-25T00:23:42Z

@RauliRuohonen: since you know Python's multiprocessing, and also the things about it you don't like and would like to change, you're probably the best one to tackle these issues. (If you are new to Julia/github development, see the CONTRIBUTING.md file.)

Otherwise, from my perspective the only thing missing is Windows support. In the absence of a restructuring from @RauliRuohonen, and given Amit's "seal of approval" of the overall design, I may be able to at least take a stab at implementing that over the next couple of weeks.

amitmurthy · 2013-11-25T04:06:01Z

@timholy , I am implementing this idea of yours in DArray itself. Give me a couple of days...

timholy · 2013-11-25T11:35:53Z

@amitmurthy, of course!

ViralBShah · 2013-12-04T03:07:52Z

@timholy Should we close this in light of @amitmurthy 's new PR on the topic?

Shared-memory parallel computing (Linux-only)

0a10654

amitmurthy added 2 commits November 22, 2013 14:20

shm_open and shm_unlink

4763b4b

error handling

ee53682

timholy mentioned this pull request Nov 30, 2013

Added SharedArray #4939

Closed

timholy closed this Dec 4, 2013

timholy mentioned this pull request Dec 12, 2013

support shared-memory parallelism (multithreading) #1790

Closed

timholy deleted the teh/sharedarrays branch February 13, 2014 12:33

timholy mentioned this pull request Mar 27, 2014

Serializing RemoteRefs result in unexplained delays #6287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Shared-memory parallel computing #4580

RFC: Shared-memory parallel computing #4580

timholy commented Oct 19, 2013

JeffBezanson commented Oct 19, 2013

timholy commented Oct 19, 2013

timholy commented Oct 19, 2013

JeffBezanson commented Oct 19, 2013

timholy commented Oct 19, 2013

ViralBShah commented Oct 20, 2013

JeffBezanson commented Oct 20, 2013

amitmurthy commented Oct 29, 2013

ViralBShah commented Nov 20, 2013

timholy commented Nov 20, 2013

timholy commented Nov 20, 2013

ViralBShah commented Nov 20, 2013

timholy commented Nov 20, 2013

simonster commented Nov 20, 2013

RauliRuohonen commented Nov 20, 2013

ViralBShah commented Nov 20, 2013

timholy commented Nov 21, 2013

ViralBShah commented Nov 22, 2013

amitmurthy commented Nov 22, 2013

timholy commented Nov 22, 2013

timholy commented Nov 25, 2013

amitmurthy commented Nov 25, 2013

timholy commented Nov 25, 2013

ViralBShah commented Dec 4, 2013

RFC: Shared-memory parallel computing #4580

RFC: Shared-memory parallel computing #4580

Conversation

timholy commented Oct 19, 2013

JeffBezanson commented Oct 19, 2013

timholy commented Oct 19, 2013

timholy commented Oct 19, 2013

JeffBezanson commented Oct 19, 2013

timholy commented Oct 19, 2013

ViralBShah commented Oct 20, 2013

JeffBezanson commented Oct 20, 2013

amitmurthy commented Oct 29, 2013

ViralBShah commented Nov 20, 2013

timholy commented Nov 20, 2013

timholy commented Nov 20, 2013

ViralBShah commented Nov 20, 2013

timholy commented Nov 20, 2013

simonster commented Nov 20, 2013

RauliRuohonen commented Nov 20, 2013

ViralBShah commented Nov 20, 2013

timholy commented Nov 21, 2013

ViralBShah commented Nov 22, 2013

amitmurthy commented Nov 22, 2013

timholy commented Nov 22, 2013

timholy commented Nov 25, 2013

amitmurthy commented Nov 25, 2013

timholy commented Nov 25, 2013

ViralBShah commented Dec 4, 2013