Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multithreading Experiment #6741

Closed
wants to merge 35 commits into from
Closed

Conversation

tknopp
Copy link
Contributor

@tknopp tknopp commented May 4, 2014

This is an experiment on running Julia code in multiple threads (see #1790). While the code is certainly pretty broken I still like to get feedback whether the approach might be feasible.

The idea is to first precompile the function that is supposed to be run in the thread. To this end I use jl_get_specialization, which seems to do what I want. The hope is that the interaction with the Julia environment are minimal after precompilation so that race conditions are less likely to happen.

I have implemented two functions parapply and parapply_jl that take a function that is supposed to be run several times with the last argument being an index that covers a predefined range. While par apply implements the scheduling in C, parapply_jl implements it in Julia and exploits a direct thread API that is exposed to Julia. For the examples I have tried parapply works while parapply_jl crashes sometimes.

While this is a pretty naiv solution the real work is now to determine the places where race conditions can occur. Any help on this would be really appreciated.

This was developed on OSX with clang. For gcc one needs other compiler flags to enable c++11 threads.

@JeffBezanson
Copy link
Member

Interesting. jl_get_specialization does move compilation earlier, but it doesn't really guarantee much. You'd probably still need a lock around code generation. Disabling GC also doesn't do much since there will be race conditions in allocation. In general I don't think it will be realistic to try to eliminate interaction with the runtime system; there can always be allocation or code generation. Though of course you can get lucky and have some functions that do pure computing with no allocation etc.

@tknopp
Copy link
Contributor Author

tknopp commented May 4, 2014

Thanks Jeff. Could you give me a hint in which function the actual code generation is performed. Allocation are the functions in gc.c (e.g. jl_gc_managed_malloc), right?

You are right that its does not seem feasible to minimize the interaction with the runtime to zero. But my impression is that in a "hot loop" a lot less interaction happens than compared to for instance in Python.

And please note that I am absolutely aware that this work is pretty naiv. On the other hand one can learn a lot about the Julia internals when debugging race conditions :-)

@timholy
Copy link
Member

timholy commented May 4, 2014

@tknopp, ages ago I started treading down a similar path, implementing direct wrapping of pthread. I was able to dig up an old commit, in case it's at all useful to you. I don't remember the status I left this in, but "segfaults" comes to mind 😄.

------------------------------- src/codegen.cpp -------------------------------
index 6e606bb..a648f8f 100644
@@ -74,6 +74,7 @@
 #include <set>
 #include <cstdio>
 #include <cassert>
+#include <pthread.h>
 using namespace llvm;

 extern "C" {
@@ -199,6 +200,12 @@ static Function *jlputs_func;
 static Function *resetstkoflw_func;
 #endif

+// locking (thread-safety)
+pthread_mutex_t codegen_lock;
+static inline void init_lock(pthread_mutex_t &mutex) { pthread_mutex_init(&mutex, 0); }
+static inline void lock(pthread_mutex_t &mutex) { pthread_mutex_lock(&mutex); }
+static inline void unlock(pthread_mutex_t &mutex) { pthread_mutex_unlock(&mutex); }
+
 static void jl_rethrow_with_add(const char *fmt, ...)
 {
     if (jl_typeis(jl_exception_in_transit, jl_errorexception_type)) {
@@ -222,6 +229,8 @@ static Function *emit_function(jl_lambda_info_t *lam, bool cstyle);
 //static int n_compile=0;
 static Function *to_function(jl_lambda_info_t *li, bool cstyle)
 {
+    if (!nested_compile)
+        lock(codegen_lock);
     JL_SIGATOMIC_BEGIN();
     assert(!li->inInference);
     BasicBlock *old = nested_compile ? builder.GetInsertBlock() : NULL;
@@ -243,6 +252,8 @@ static Function *to_function(jl_lambda_info_t *li, bool cstyle)
             builder.SetCurrentDebugLocation(olddl);
         }
         JL_SIGATOMIC_END();
+   if (!last_n_c)
+            unlock(codegen_lock);
         jl_rethrow_with_add("error compiling %s", li->name->name);
     }
     assert(f != NULL);
@@ -261,6 +272,8 @@ static Function *to_function(jl_lambda_info_t *li, bool cstyle)
         builder.SetCurrentDebugLocation(olddl);
     }
     JL_SIGATOMIC_END();
+    if (!last_n_c)
+   unlock(codegen_lock);
     return f;
 }

@@ -272,7 +285,7 @@ extern "C" void jl_generate_fptr(jl_function_t *f)
     jl_lambda_info_t *li = f->linfo;
     assert(li->functionObject);
     Function *llvmf = (Function*)li->functionObject;
-    if (li->fptr == &jl_trampoline) {
+    if (li->fptr == &jl_trampoline) {  // FIXME? lock
         JL_SIGATOMIC_BEGIN();
         li->fptr = (jl_fptr_t)jl_ExecutionEngine->getPointerToFunction(llvmf);
         if (li->cFunctionObject != NULL)
@@ -336,7 +349,7 @@ void *jl_function_ptr(jl_function_t *f, jl_value_t *rt, jl_value_t *argt)
                                   li->name->name);
                     }
                 }
-                return jl_ExecutionEngine->getPointerToFunction((Function*)ff->linfo->cFunctionObject);
+                return jl_ExecutionEngine->getPointerToFunction((Function*)ff->linfo->cFunctionObject); // FIXME? lock
             }
         }
     }
@@ -3374,7 +3387,7 @@ static void init_julia_llvm_env(Module *m)

 extern "C" void jl_init_codegen(void)
 {
-    
+    init_lock(codegen_lock);
     InitializeNativeTarget();
     jl_Module = new Module("julia", jl_LLVMContext);

@tknopp
Copy link
Contributor Author

tknopp commented May 4, 2014

Thanks Tim. to_function seems to be the compiling function. And that you have a variable nested_compile seems to indicate that this function can call itself. With a normal lock around the function I get a deadlock ;-)

@tknopp
Copy link
Contributor Author

tknopp commented May 4, 2014

Ok, locking to_function in the same way as Tim suggests helps and the julia implementation parapply_jl now seems to crash not anymore for the test functions I have implemented.

@tknopp
Copy link
Contributor Author

tknopp commented May 5, 2014

The reason to use C++11 threads was that it should be cross platform. But I have just seen that libuv provides cross platform threads and mutexes so this seems to be a better solution.

@StefanKarpinski
Copy link
Member

Vaguely related: https://groups.google.com/d/msg/julia-dev/GTzZlo4Rhrk/ezZhx2wlLnAJ. The approach here is to use a work thread that has no interaction with the Julia runtime to do C work. The tricky part is from Julia, generating a function that can get arguments from a buffer and call a C function to do the actual work, without invoking any GC or code compilation – these generated wrapper functions only use unsafe_load, pointer arithmetic, ccall, and unsafe_store!. Not really relevant to running arbitrary code in another thread, but it was an approach to doing work in a thread.

@tknopp
Copy link
Contributor Author

tknopp commented May 5, 2014

Thanks Stefan, I seem to have missed that post. Maybe I should use the uv_queue_work function to exploit a thread pool here.

I was kind of surprised that my naiv approach worked and although it is still certainly broken I think that we should proceed here. The tricky part is to determine which functions need locks and which not. I currently don't see any crashes in my test program when using parapply whereas parapply sometimes gives wrong results.

It would also be great if someone has a function in mind in Base that would greatly benefit from mutlithreading. In my test I used a matrix-vector multiplication, which is however not so CPU bound. Cheated a little by using Int128 though :-)

@tknopp
Copy link
Contributor Author

tknopp commented May 5, 2014

Ported this to libuv threads/mutex so that it does not rely on C++11. Should hopefully satisfy Travis now.

@timholy
Copy link
Member

timholy commented May 6, 2014

It would also be great if someone has a function in mind in Base that would greatly benefit from mutlithreading.

If nothing in Base comes immediately to mind, Images would be a great test case. imfilter, imfilter_gaussian, or restrict would all benefit a lot from multithreading.

@tknopp
Copy link
Contributor Author

tknopp commented May 6, 2014

yes imfilter seems great. IMHO nd filtering is something belonging to Base but that is another issue (#5155 hint hint) ;-)

I actually am looking for 3 test cases.
a) One that is perfectly parallelizable (i.e. CPU bound) in order to check that the overhead of the internal locks is not too high. I can take something senseless but it would of course be nice to have something meaningful.
b) Something that makes the crowd say "yeah" (e.g. imfilter)
c) Something that still crashes...

I have made some progress with c) and have found a setting where parapply does not crash and parapply_jl still crashes. In this way I have fixed several race conditions but I am not through yet. Locking GC_PUSH/POP was the most painful as I had to lock the outermost pair. Still not sure if that could produce a deadlock. When I better understand how the jl_pgcstack pointer is used it might be a better solution to have one pointer per thread.

PS: c) is of course just meant as an intermediate test case :-)

@ViralBShah
Copy link
Member

Cc: @stevengj

@tknopp
Copy link
Contributor Author

tknopp commented May 7, 2014

The nested locks I had for to_function in a former version where useless. I now use the thread id so that one thread can proceed with all its nested calls. GC_PUSH/POP is similar now but I have to leave out the main thread as it would deadlock otherwise. If someone has an idea how I can implement multiple per thread jl_pgcstack pointers this would be of interest.

@carnaval
Copy link
Contributor

carnaval commented May 7, 2014

I believe thread-local gcstack would be ok. You would have to modify the GC to walk every thread's stack instead of the current one only (see the use of gcstack in gc.c).
I don't have time to look at the code unfortunately, but you would have to be sure you are not running the gc from other threads than the main thread.
Making the allocation thread safe would be ok I guess, but the gc collection is highly non atomic for several reasons. First, the runtime isn't prepared to see jl_values with mark bit set. Even if it was, the collection must see a snapshot of the heap so the "points to" relation must not change while it is running.
Pausing every other thread would be a first step, but I'm not sure how to make sure you pause them in a consistent state wrt the GC, e.g., not in the middle of building the next gcstack or something like that.
I'll be reading about your journey please keep us posted :-)

@tknopp
Copy link
Contributor Author

tknopp commented May 7, 2014

@carnaval: The code is quite messy in the current state because I have tried something here and there. I have not a lot experience with garabage collectors so I am not entirely sure if I get this done alone.

My idea for the thread-local gcstack would be to use a map of pointers:

map<long, jl_gcframe_t *> jl_pgcstacks

void register_thread(long id)
{
     jl_pgcstacks[id] = NULL;
}

#define JL_GC_PUSH1(arg1)                                                  \
void *__gc_stkf[] = {(void*)3, jl_pgcstack[uv_thread_self()], arg1};   \
jl_pgcstack[uv_thread_self()] = (jl_gcframe_t*)__gc_stkf;

#define JL_GC_POP() (jl_pgcstack[uv_thread_self()] = jl_pgcstack[uv_thread_self()]->prev)

and in gc_mark_task one would then run over all pointers in the map.

How to handle garbage collection is indeed an interesting issue. In the first round I would simply disable it.

@JeffBezanson
Copy link
Member

Using a map for that is not going to work. Aside from this being very performance-sensitive, the code generator needs to be able to access this in generated code. There needs to be a pointer in the TLS to a per-thread struct that has the gc stack pointer at a known offset.

@tknopp
Copy link
Contributor Author

tknopp commented May 7, 2014

Indeed thanks. Was not aware of TLS. Seems that libuv as support for it (https://github.com/joyent/libuv/blob/master/include/uv.h#L2202)
But I am not entirely sure how to access all the TLS pointers in the gc_mark_task routine if they are thread local. Will think about that.

@tknopp
Copy link
Contributor Author

tknopp commented May 7, 2014

I just realize that when I disable the gc when the threads are running I could also just let JL_GC_PUSH/POP be noops.

@tknopp
Copy link
Contributor Author

tknopp commented May 8, 2014

Some more findings / improvements. I added some missing locks around the alloc* functions in gc.c and made JL_GC_PUSH/POP noops in the threads. With this I don't get race conditions in the GC anymore.

Locking code generation seems to be quite tricky. I had a situation were one thread was in to_function whereas the other was in cache_method. One thing that seems to help and made parapply_jl run without crashes is to not only trigger code generation via jl_get_specialization before starting the threads but to actually call the function once before going into the parallel phase.

An alternative that is currently commited is to lock jl_apply_generic. Not entirely sure if the code really runs in parallel then though. Travis seems to run into a deadlock.

@tknopp
Copy link
Contributor Author

tknopp commented May 24, 2014

I have been looking into the print issue but did not get that far. It seem there is some task switching going on when printing. I have seen a crash where the thread calls
julia_write and then a little deeper jl_f_yieldto and then alloc_big(sz=140730322856156)

So I really need to understand what all this task switching is about.

@cbecker
Copy link
Contributor

cbecker commented May 24, 2014

Regardless, it is good that you mention it in the 'docs'.

Wouldn't it be good to have a critical section macro, for example, within a thread one can do

@critical println("hello")

to be converted to

lock(someMutex)
println("hello")
unlock(someMutex)

though there is an issue: where to initialize the mutex. Does julia have something like static variables?

Something probably equally or more important would be to have somethings like @atomic:

@atomic a += 2

I assume LLVM provides a way to indicate that.

I am thinking of the openMP constructs I frequently use that make a difference when working with multithreading.

@timholy
Copy link
Member

timholy commented May 24, 2014

Another approach I've used in C is to implement some kind of message queue, and do all printing from the main thread.

@tknopp
Copy link
Contributor Author

tknopp commented May 24, 2014

I have already tried to lock println but this is not the issue here. Its a task switch that is absolutely not threadsafe.

I am quite new to coroutines so while I understand the concept it is not entirely clear when task switches can happen in Julia code.

So I am unsure which is the approach to go:

  1. Prevent task switches in threaded code (is this feasible?)
  2. Make the complete task machinery thread aware using TLS. This would mean that each thread has its own tasks. But I wonder if 2. is not too much overhead.
    Maybe someone with more experience with tasks (@vtjnash) has an opinion about that.

@Keno
Copy link
Member

Keno commented May 24, 2014

You're also not allowed to do I/O from anything but the main thread because the uv interface is not thread safe.

@stevengj
Copy link
Member

Note that uv_async_send is already thread-safe, and is basically the only thread-safe function in libuv as I understand it. This is the documented way to wake up tasks in the main thread from other threads.

@tknopp
Copy link
Contributor Author

tknopp commented May 24, 2014

Ah thanks @loladiro. I had not yet tried to do I/O in a thread so have not seen that. This kind of restricts us to computational things which is probably ok. My dream was a little to also use threads when doing GUI programming where one usually also does I/O in threads to let the GUI still be responsive. @async kind of helps here but I still had problems with freezing UI when programming with Gtk.jl.

@vtjnash
Copy link
Member

vtjnash commented May 24, 2014

You might try turning off COPY_STACK in task.c

Since when libuv gained thread pools, i thought it had become (more) thread-aware

@tknopp
Copy link
Contributor Author

tknopp commented May 24, 2014

task.c does not compile when I undef COPY_STACK

@tknopp
Copy link
Contributor Author

tknopp commented May 24, 2014

Ok I got println in threads working, although I am not 100% sure if this is the best fix. What I have done is to prevent a task switch when yieldto is called from anything but the main thread.

I have further fixed some locks that should now make it possible to run (raw) threads interactively from the REPL (gc has to be disabled!). With that I was able to spawn a thread that was periodically incrementing an array value and see this in action from the REPL.

@ViralBShah ViralBShah added this to the 1.0 milestone May 25, 2014
@tknopp
Copy link
Contributor Author

tknopp commented May 27, 2014

I improved a little the exception handling. Before, only an unspecific excpetion was rethrown in the main thread (i.e. a flag: there was an excption in a thread). Now the actual exception is thrown. TLS is very useful for this stuff.

@cbecker
Copy link
Contributor

cbecker commented May 28, 2014

@tknopp Sounds great, I will give it a try later this week or next week, will keep you posted.

@tknopp
Copy link
Contributor Author

tknopp commented Aug 7, 2014

I am closing this as the threads branch is various steps ahead of this one. @JeffBezanson: Is threads the main threading branch now? I have seen two other and am not entirely sure in which one the development takes place.

@tknopp
Copy link
Contributor Author

tknopp commented Aug 7, 2014

ah wrong button

@tknopp tknopp closed this Aug 7, 2014
@StefanKarpinski
Copy link
Member

Yes, threads is the branch now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.