Give each thread it's own mt_state #395

orbitfold · 2015-08-23T18:04:01Z

This should fix issue #393, comments very welcome.

unoebauer · 2015-08-24T09:50:01Z

@ssim, @orbitfold, @wkerzendorf I vaguely remember that we had a related discussion about RNGs in shared-memory parallel MCRT calculations back when @mklauser first implemented an OMP Tardis version. Unfortunately, I can't quite remember the conclusions drawn from this discussion. @ssim, do you remember?

ssim · 2015-08-24T10:55:29Z

This is correct - I think the substance of the discussion we had before was in general about the need for thread private quantities in any shared memory parallelisation. At that stage I think noone had yet checked through to see what variables needed to be thread private - and this - the parameters for the random number generator - is clearly one example where this is needed. All the estimators are certainly another.

orbitfold · 2015-08-24T10:56:56Z

Here, each thread will get a separate seed and store it's own RNG state. That is if I did everything right.

wkerzendorf · 2015-08-24T11:33:06Z

@orbitfold any ideas why TRAVIS does not like it?

orbitfold · 2015-08-24T11:36:13Z

No idea, but it seems to fail way before it gets to anything related to this.

mreineck · 2015-08-25T15:48:44Z

I think the idea in the patch is correct, but I suggest a slightly different implementation. Using an array of mt_states means that two threads are writing to neighboring entries in this array, and this might lead to cache thrashing if they lie on the same cache line.
How about removing the global variable mt_state completely, creating a thread-local mt_state in montecarlo_main_loop() in the scope opened by #pragma omp parallel and passing a pointer to it to the functions which need it? It makes the interfaces look less nice, but getting rid of globals might be worth it.

mreineck · 2015-08-25T15:53:49Z

Apart from mt_state I think the only global data structure that's written by the C routines is storage, and there I saw only one place in the code which is problematic in OpenMP mode. This is near the end of calculate_chi_bf(). Otherwise things should be safe.

wkerzendorf · 2015-08-26T09:57:54Z

@orbitfold we are just debugging the 245 error problem:

tardis/montecarlo/tests/test_cmontecarlo.py .......
Program received signal SIGSEGV, Segmentation fault.
rk_double (state=0x0) at tardis/montecarlo/src/randomkit/rk_mt.c:163
163   if (state->pos == RK_STATE_LEN)
(gdb) bt
#0  rk_double (state=0x0) at tardis/montecarlo/src/randomkit/rk_mt.c:163
#1  0x000000010cbc4d5b in move_packet_across_shell_boundary () from /Users/wkerzend/python/tardis/tardis/montecarlo/montecarlo.so
#2  0x000000010cbc67e5 in test_move_packet_across_shell_boundary () at tardis/montecarlo/src/test_cmontecarlo.c:254
#3  0x000000010429a677 in ffi_call_unix64 ()

mreineck · 2015-08-26T10:24:33Z

I think the problem occurs because during the tests, the mt_state pointer is not allocated and initialized (since the tests don't call montecarlo_main_loop() but directly the inner routines instead). Whenever one of these routines tries to generate a random number, there is a dereference of an uninitilalized pointer, and most likely a segfault.

orbitfold · 2015-08-26T10:57:44Z

@wkerzendorf I'll look in to that this evening.

wkerzendorf · 2015-08-26T11:58:12Z

@orbitfold @mreineck this 245 error crops up repeatedly in other ones as well. I suspect the cmontecarlo testing facility. but am not sure.

mreineck · 2015-08-26T13:02:47Z

On the net I found a theory that the 245 indicates a segmentation fault. Typically, segfaults are signalled by returning -11, and if that is converted into an unsigned 8-bit value (which happens somewhere between the problem and the return code at the shell prompt), you get 256-11=245.
In the present case I'm quite sure that the problem is caused by the undefined mt_state pointer, in other situations it's most likely that something else goes wrong in a C component of the code.

wkerzendorf · 2015-08-26T13:10:22Z

@mreineck I can reproduce this error (currently sporadically with #399). Not enough of an expert to track it down. let me know if you want to work on this together.

mreineck · 2015-08-26T13:26:39Z

Without some localization I'm pretty much at a loss ... can you produce a stack trace like the one some comments above for the #399 PR as well?

wkerzendorf · 2015-08-26T13:58:42Z

it only happens rarely:

tardis/montecarlo/tests/test_cmontecarlo.py ...........
Program received signal SIGSEGV, Segmentation fault.
0x00000001001026cc in visit_decref () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
(gdb) bt
#0  0x00000001001026cc in visit_decref () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#1  0x0000000100042bc3 in list_traverse () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#2  0x0000000100103227 in collect () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#3  0x0000000100103e5d in _PyObject_GC_Malloc () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib
#4  0x0000000100103ed2 in _PyObject_GC_NewVar () from /Users/wkerzend/anaconda3/envs/tardis-devel/lib/libpython2.7.dylib

mreineck · 2015-08-26T14:35:14Z

Hmm, this trace is completely within the Python libraries themselves and gives no clue to where the problem actually occurs in user code. Sorry, I don't have an idea at the moment ...

Give each thread it's own mt_state

da77d95

mreineck mentioned this pull request Aug 26, 2015

Alternative fix for issue #393 #401

Merged

orbitfold closed this Aug 26, 2015

unoebauer mentioned this pull request Nov 30, 2015

interpolate outside of 2000 and 50000 setting zeta to 1. #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give each thread it's own mt_state #395

Give each thread it's own mt_state #395

orbitfold commented Aug 23, 2015

unoebauer commented Aug 24, 2015

ssim commented Aug 24, 2015

orbitfold commented Aug 24, 2015

wkerzendorf commented Aug 24, 2015

orbitfold commented Aug 24, 2015

mreineck commented Aug 25, 2015

mreineck commented Aug 25, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

orbitfold commented Aug 26, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

Give each thread it's own mt_state #395

Give each thread it's own mt_state #395

Conversation

orbitfold commented Aug 23, 2015

unoebauer commented Aug 24, 2015

ssim commented Aug 24, 2015

orbitfold commented Aug 24, 2015

wkerzendorf commented Aug 24, 2015

orbitfold commented Aug 24, 2015

mreineck commented Aug 25, 2015

mreineck commented Aug 25, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

orbitfold commented Aug 26, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015

wkerzendorf commented Aug 26, 2015

mreineck commented Aug 26, 2015