Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give each thread it's own mt_state #395

Closed
wants to merge 1 commit into from
Closed

Give each thread it's own mt_state #395

wants to merge 1 commit into from

Conversation

orbitfold
Copy link
Contributor

This should fix issue #393, comments very welcome.

@unoebauer
Copy link
Contributor

@ssim, @orbitfold, @wkerzendorf I vaguely remember that we had a related discussion about RNGs in shared-memory parallel MCRT calculations back when @mklauser first implemented an OMP Tardis version. Unfortunately, I can't quite remember the conclusions drawn from this discussion. @ssim, do you remember?

@ssim
Copy link
Contributor

ssim commented Aug 24, 2015

This is correct - I think the substance of the discussion we had before was in general about the need for thread private quantities in any shared memory parallelisation. At that stage I think noone had yet checked through to see what variables needed to be thread private - and this - the parameters for the random number generator - is clearly one example where this is needed. All the estimators are certainly another.

@orbitfold
Copy link
Contributor Author

Here, each thread will get a separate seed and store it's own RNG state. That is if I did everything right.

@wkerzendorf
Copy link
Member

@orbitfold any ideas why TRAVIS does not like it?

@orbitfold
Copy link
Contributor Author

No idea, but it seems to fail way before it gets to anything related to this.

@mreineck
Copy link
Contributor

I think the idea in the patch is correct, but I suggest a slightly different implementation. Using an array of mt_states means that two threads are writing to neighboring entries in this array, and this might lead to cache thrashing if they lie on the same cache line.
How about removing the global variable mt_state completely, creating a thread-local mt_state in montecarlo_main_loop() in the scope opened by #pragma omp parallel and passing a pointer to it to the functions which need it? It makes the interfaces look less nice, but getting rid of globals might be worth it.

@mreineck
Copy link
Contributor

Apart from mt_state I think the only global data structure that's written by the C routines is storage, and there I saw only one place in the code which is problematic in OpenMP mode. This is near the end of calculate_chi_bf(). Otherwise things should be safe.

@wkerzendorf
Copy link
Member

@orbitfold we are just debugging the 245 error problem:

tardis/montecarlo/tests/test_cmontecarlo.py .......
Program received signal SIGSEGV, Segmentation fault.
rk_double (state=0x0) at tardis/montecarlo/src/randomkit/rk_mt.c:163
163   if (state->pos == RK_STATE_LEN)
(gdb) bt
#0  rk_double (state=0x0) at tardis/montecarlo/src/randomkit/rk_mt.c:163
#1  0x000000010cbc4d5b in move_packet_across_shell_boundary () from /Users/wkerzend/python/tardis/tardis/montecarlo/montecarlo.so
#2  0x000000010cbc67e5 in test_move_packet_across_shell_boundary () at tardis/montecarlo/src/test_cmontecarlo.c:254
#3  0x000000010429a677 in ffi_call_unix64 ()

@mreineck
Copy link
Contributor

I think the problem occurs because during the tests, the mt_state pointer is not allocated and initialized (since the tests don't call montecarlo_main_loop() but directly the inner routines instead). Whenever one of these routines tries to generate a random number, there is a dereference of an uninitilalized pointer, and most likely a segfault.

@orbitfold
Copy link
Contributor Author

@wkerzendorf I'll look in to that this evening.

@wkerzendorf
Copy link
Member

@orbitfold @mreineck this 245 error crops up repeatedly in other ones as well. I suspect the cmontecarlo testing facility. but am not sure.

@mreineck
Copy link
Contributor

On the net I found a theory that the 245 indicates a segmentation fault. Typically, segfaults are signalled by returning -11, and if that is converted into an unsigned 8-bit value (which happens somewhere between the problem and the return code at the shell prompt), you get 256-11=245.
In the present case I'm quite sure that the problem is caused by the undefined mt_state pointer, in other situations it's most likely that something else goes wrong in a C component of the code.

@wkerzendorf
Copy link
Member

@mreineck I can reproduce this error (currently sporadically with #399). Not enough of an expert to track it down. let me know if you want to work on this together.

@mreineck
Copy link
Contributor

Without some localization I'm pretty much at a loss ... can you produce a stack trace like the one some comments above for the #399 PR as well?

@wkerzendorf
Copy link
Member

it only happens rarely:

tardis/montecarlo/tests/test_cmontecarlo.py ...........
Program received signal SIGSEGV, Segmentation fault.
0x00000001001026cc in visit_decref () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
(gdb) bt
#0  0x00000001001026cc in visit_decref () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#1  0x0000000100042bc3 in list_traverse () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#2  0x0000000100103227 in collect () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#3  0x0000000100103e5d in _PyObject_GC_Malloc () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#4  0x0000000100103ed2 in _PyObject_GC_NewVar () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib

@mreineck
Copy link
Contributor

Hmm, this trace is completely within the Python libraries themselves and gives no clue to where the problem actually occurs in user code. Sorry, I don't have an idea at the moment ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants