Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMO Zoom on-the-fly with SWIFT segfault #116

Closed
stuartmcalpine opened this issue Nov 8, 2021 · 4 comments
Closed

DMO Zoom on-the-fly with SWIFT segfault #116

stuartmcalpine opened this issue Nov 8, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@stuartmcalpine
Copy link

I am doing some zoom runs, no mpi, on-the-fly with swift. For these tests, cosma 8 and 128 threads. The DMO version segfaults on the 4th invocation, and the hydro version later, like the 10th. But the DMO is consistently failing at the same place.

module load intel_comp/2018 intel_mpi/2018 fftw/3.3.7
module load gsl/2.5 hdf5/1.10.3

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-fPIC" -DVR_ZOOM_SIM=ON -DVR_MPI=OFF -DVR_MPI_REDUCE=OFF -DVR_USE_SWIFT_INTERFACE=ON ..

Config file:

vrconfig_3dfof_subhalos_SO_dmo.txt

Last bit of log:

[1726.081] [debug] search.cxx:3982 Getting Hierarchy 23
[1726.081] [debug] search.cxx:4015 Done
[1726.726] [ info] substructureproperties.cxx:5047 Sort particles and compute properties of 23 objects
[1726.726] [debug] substructureproperties.cxx:5059 Calculate properties using minimum potential particle as reference
[1726.726] [debug] substructureproperties.cxx:5062 Sort particles by binding energy
[1795.700] [debug] substructureproperties.cxx:5087 Memory report at substructureproperties.cxx:5087@long long **SortAccordingtoBindingEnergy(Options &, long long, NBody::Particle *, long long, long long *&, long long *, PropData *, long long): Average: 70.075 [GiB] Data: 72.269 [GiB] Dirty: 0 [B] Library: 0 [B] Peak: 81.744 [GiB] Resident: 68.478 [GiB] Shared: 8.734 [MiB] Size: 72.370 [GiB] Text: 4.180 [MiB]
[1795.701] [debug] substructureproperties.cxx:42 Getting CM
[1795.702] [debug] substructureproperties.cxx:320 Done getting CM in 1 [ms]
[1795.702] [debug] substructureproperties.cxx:4621 Getting energy
[1795.703] [debug] substructureproperties.cxx:4733 Have calculated potentials in 744 [us]
[1795.704] [debug] substructureproperties.cxx:5034 Done getting energy in 1 [ms]
[1795.704] [debug] substructureproperties.cxx:338 Getting bulk properties
[1795.706] [debug] substructureproperties.cxx:2194 Done getting properties in 1 [ms]
[1795.706] [debug] substructureproperties.cxx:3219 Done FOF masses in 4 [us]
[1795.706] [debug] substructureproperties.cxx:3236 Get inclusive masses
[1795.706] [debug] substructureproperties.cxx:3237 with masses based on full SO search (slower) for halos only

Line where it segfaults:

image

image

Originally posted by @stuartmcalpine in #53 (comment)

@rtobar rtobar added the bug Something isn't working label Nov 9, 2021
@stuartmcalpine
Copy link
Author

Note these also segfault standalone on the snapshots for the same setup.

They don't run atall with MPI so I'm doing them no-MPI.

rtobar added a commit that referenced this issue Nov 10, 2021
Just like in 53c0289, the SO_angularmomentum vector only has a certain
size in some situations, and that check wasn't performed in this code
that was setting its values to zero.

Moreover, the values don't need to be initialised to zero as they
already are set to zero when the vector is sized in PropData.Allocate().

This addresses the problem reported in #116.

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar
Copy link

rtobar commented Nov 10, 2021

I just reproduced this locally with gcc on a gdb session with a smaller input file:

[...]
[  11.297] [debug] substructureproperties.cxx:3221 Done FOF masses in 100 [us]
[  11.378] [debug] substructureproperties.cxx:3238 Get inclusive masses
[  11.378] [debug] substructureproperties.cxx:3239  with masses based on full SO search (slower) for halos only
[Thread 0x7ffff1d49640 (LWP 137975) exited]
[Thread 0x7ffff1548640 (LWP 137974) exited]
[New Thread 0x7ffff1548640 (LWP 137978)]
[New Thread 0x7ffff1d49640 (LWP 137979)]
[Thread 0x7ffff1d49640 (LWP 137979) exited]
[Thread 0x7ffff1548640 (LWP 137978) exited]
[New Thread 0x7ffff1548640 (LWP 137980)]
[New Thread 0x7ffff1d49640 (LWP 137981)]

Thread 2 "stf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff254a640 (LWP 137969)]
GetSOMasses () at /home/rtobar/scm/git/VELOCIraptor-STF/src/substructureproperties.cxx:3557
3557                 pdata[i].SO_angularmomentum[iso] = zero;

I had a closer look, and like I mentioned previously this seems to be very much like what happened during #78 and was fixed by 53c0289, so I prepared a similar fix.

@stuartmcalpine I just pushed the commit to the new issue-116 branch. It fixes things for me locally (I can run stf to completion). Please let me know if this fixes things for you too, and after that I'll merge back to the master branch.

@stuartmcalpine
Copy link
Author

Yes that seems to fix it for me, thanks!

@rtobar
Copy link

rtobar commented Nov 11, 2021

The fix is merged now onto the master branch, closing now.

@rtobar rtobar closed this as completed Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants