-
-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give each thread it's own mt_state #395
Conversation
@ssim, @orbitfold, @wkerzendorf I vaguely remember that we had a related discussion about RNGs in shared-memory parallel MCRT calculations back when @mklauser first implemented an OMP Tardis version. Unfortunately, I can't quite remember the conclusions drawn from this discussion. @ssim, do you remember? |
This is correct - I think the substance of the discussion we had before was in general about the need for thread private quantities in any shared memory parallelisation. At that stage I think noone had yet checked through to see what variables needed to be thread private - and this - the parameters for the random number generator - is clearly one example where this is needed. All the estimators are certainly another. |
Here, each thread will get a separate seed and store it's own RNG state. That is if I did everything right. |
@orbitfold any ideas why TRAVIS does not like it? |
No idea, but it seems to fail way before it gets to anything related to this. |
I think the idea in the patch is correct, but I suggest a slightly different implementation. Using an array of mt_states means that two threads are writing to neighboring entries in this array, and this might lead to cache thrashing if they lie on the same cache line. |
Apart from mt_state I think the only global data structure that's written by the C routines is storage, and there I saw only one place in the code which is problematic in OpenMP mode. This is near the end of calculate_chi_bf(). Otherwise things should be safe. |
@orbitfold we are just debugging the 245 error problem: tardis/montecarlo/tests/test_cmontecarlo.py .......
Program received signal SIGSEGV, Segmentation fault.
rk_double (state=0x0) at tardis/montecarlo/src/randomkit/rk_mt.c:163
163 if (state->pos == RK_STATE_LEN)
(gdb) bt
#0 rk_double (state=0x0) at tardis/montecarlo/src/randomkit/rk_mt.c:163
#1 0x000000010cbc4d5b in move_packet_across_shell_boundary () from /Users/wkerzend/python/tardis/tardis/montecarlo/montecarlo.so
#2 0x000000010cbc67e5 in test_move_packet_across_shell_boundary () at tardis/montecarlo/src/test_cmontecarlo.c:254
#3 0x000000010429a677 in ffi_call_unix64 () |
I think the problem occurs because during the tests, the mt_state pointer is not allocated and initialized (since the tests don't call montecarlo_main_loop() but directly the inner routines instead). Whenever one of these routines tries to generate a random number, there is a dereference of an uninitilalized pointer, and most likely a segfault. |
@wkerzendorf I'll look in to that this evening. |
@orbitfold @mreineck this 245 error crops up repeatedly in other ones as well. I suspect the cmontecarlo testing facility. but am not sure. |
On the net I found a theory that the 245 indicates a segmentation fault. Typically, segfaults are signalled by returning -11, and if that is converted into an unsigned 8-bit value (which happens somewhere between the problem and the return code at the shell prompt), you get 256-11=245. |
Without some localization I'm pretty much at a loss ... can you produce a stack trace like the one some comments above for the #399 PR as well? |
it only happens rarely:
|
Hmm, this trace is completely within the Python libraries themselves and gives no clue to where the problem actually occurs in user code. Sorry, I don't have an idea at the moment ... |
This should fix issue #393, comments very welcome.