Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault in susceptibility object for large parallel simulations #1082

Closed
oskooi opened this issue Dec 21, 2019 · 0 comments · Fixed by #1083
Closed

segmentation fault in susceptibility object for large parallel simulations #1082

oskooi opened this issue Dec 21, 2019 · 0 comments · Fixed by #1083
Labels

Comments

@oskooi
Copy link
Collaborator

oskooi commented Dec 21, 2019

There seems to be a segmentation fault in the susceptibility object which occurs only for large parallel simulations. The crash occurs even when there is unused memory left on the machine (which may be related to what has already been reported in #1001).

For example, running the OLED simulation in debug mode (Meep compiled with --enable-debug) for large resolutions (i.e., >100) using 8 cores with hyperthreading disabled and OpenMPI on a single c5.4xlarge instance on AWS EC2 causes a segmentation fault with the error message:

[ip-172-31-50-172:18968] *** Process received signal ***
[ip-172-31-50-172:18969] *** Process received signal ***
[ip-172-31-50-172:18969] Signal: Segmentation fault (11)
[ip-172-31-50-172:18969] Signal code: Address not mapped (1)
[ip-172-31-50-172:18969] Failing at address: (nil)
[ip-172-31-50-172:18968] Signal: Segmentation fault (11)
[ip-172-31-50-172:18968] Signal code: Address not mapped (1)
[ip-172-31-50-172:18968] Failing at address: (nil)
[ip-172-31-50-172:18970] *** Process received signal ***
[ip-172-31-50-172:18970] Signal: Segmentation fault (11)
[ip-172-31-50-172:18970] Signal code: Address not mapped (1)
[ip-172-31-50-172:18970] Failing at address: (nil)
[ip-172-31-50-172:18970] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f00e2e88390]
[ip-172-31-50-172:18970] [ 1] /usr/local/lib/libmeep.so.17(_ZNK4meep25lorentzian_susceptibility17new_internal_dataEPA2_PdRKNS_11grid_volumeE+0xe7)[0x7f00e141bccb]
[ip-172-31-50-172:18970] [ 2] /usr/local/lib/libmeep.so.17(_ZN4meep12fields_chunk11update_polsENS_10field_typeE+0x1ce)[0x7f00e142afae]
[ip-172-31-50-172:18970] [ 3] /usr/local/lib/libmeep.so.17(_ZN4meep6fields11update_polsENS_10field_typeE+0x6f)[0x7f00e142adc7]
[ip-172-31-50-172:18970] [ 4] /usr/local/lib/libmeep.so.17(_ZN4meep6fields4stepEv+0x3b3)[0x7f00e1401465]
[ip-172-31-50-172:18970] [ 5] /usr/local/lib/python3.5/site-packages/meep/_meep.so(+0x1a3b4e)[0x7f00e1844b4e]

Note the line showing lorentzian_susceptibility.

To investigate further, the stack trace obtained using gdb reveals that the crash occurs at src/susceptibility.cpp:116 shown below.

run command

$ mpirun -n 8 xterm -hold -e gdb -ex run --args python3.5 oled.py

stack trace

Thread 1 "python3.5" received signal SIGSEGV, Segmentation fault.
0x00007ffff615eccb in meep::lorentzian_susceptibility::new_internal_data (
    this=0x133eb30, W=0x1319d10, gv=...) at susceptibility.cpp:116
116       d->sz_data = sz;

meep/src/susceptibility.cpp

Lines 106 to 118 in a3b92bb

// for Lorentzian susc. the internal data is just a backup of P from
// the previous timestep.
void *lorentzian_susceptibility::new_internal_data(realnum *W[NUM_FIELD_COMPONENTS][2],
const grid_volume &gv) const {
int num = 0;
FOR_COMPONENTS(c) DOCMP2 {
if (needs_P(c, cmp, W)) num += 2 * gv.ntot();
}
size_t sz = sizeof(lorentzian_data) + sizeof(realnum) * (num - 1);
lorentzian_data *d = (lorentzian_data *)malloc(sz);
d->sz_data = sz;
return (void *)d;
}

No segmentation fault occurs when the resolution is small (i.e., <50).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant