Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMO breaks on substructure search #37

Closed
edoaltamura opened this issue Oct 20, 2020 · 35 comments
Closed

DMO breaks on substructure search #37

edoaltamura opened this issue Oct 20, 2020 · 35 comments
Labels
bug Something isn't working workaround available

Comments

@edoaltamura
Copy link

Describe the bug
I am trying to run stf stand-alone on a DMO snapshot, using the zoom configuration. The process fails with the error

0 Beginning substructure search
Error, net size 0 with row,col=0,0
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted

The same behavior appears also when running stf on the fly with SWIFT. Note: SWIFT itself completed the run successfully and the snapshots do not deem to be corrupted in any apparent way (checked for existing datasets and datasets shapes).

To Reproduce
Steps to reproduce the behavior:

  1. Go to /cosma/home/dp004/dc-alta2/data7/xl-zooms/dmo/L0300N0564_VR93 on Cosma 7.
  2. Load the following modules:
module purge
module load intel_comp/2020-update2
module load intel_mpi/2020-update2
module load ucx/1.8.1
module load parmetis/4.0.3-64bit
module load parallel_hdf5/1.10.6
module load fftw/3.3.8cosma7
module load gsl/2.5
  1. Run VR as
../VELOCIraptor-STF/stf -I 2 -i snapshots/L0300N0564_VR93_0199 -o L0300N0564_VR93_0199 -C config/vr_config_zoom_dmo.cfg
  1. See error
0 Beginning substructure search
Error, net size 0 with row,col=0,0
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted

Expected behavior
Given the arguments parsed, expected to generate the usual output files in the pwd (e.g. the L0300N0564_VR93_0199.properties file).

Log files
Logs can be displayed to console, but they are also available in the $(pwd)/stf directory.

Environment (please complete the following information):

  • VR version: fresh installation (yesterday) from the master branch, compiled and run with the following modules:
  • Libraries:
module load intel_comp/2020-update2
module load intel_mpi/2020-update2
module load ucx/1.8.1
module load parmetis/4.0.3-64bit
module load parallel_hdf5/1.10.6
module load fftw/3.3.8cosma7
module load gsl/2.5

**Additional context**
I also tried running with higher verbosity in the `.cfg` file, but no further info is shown.

Thanks in advance for your help!
@edoaltamura edoaltamura changed the title DMO break on substructure search DMO breaks on substructure search Oct 20, 2020
@rtobar
Copy link

rtobar commented Oct 20, 2020

@edoaltamura do you happen to have a backtrace of the crash? I don't have access to cosma, and therefore cannot easily identify the problem only with the information given above.

The message Error, net size 0 with row,col=0,0 comes from the construction of a GMatrix where row * col <= 0. This could happen if either are both 0 or negative of course, but given their data types (int) this could also happen if row * col >= 2^31. The std::length_error exception that is finally crashing the program is raised by vector.reserve, and is probably just another symptom of the same problem.

The only place where GMatrix is created with non-hardcoded row/column numbers is in localbgcomp.cpp in the DetermineDenVRatioDistribution function. Here nbins is used both as the row and column count used to construct GMatrix, so its value is either 0 or something bigger than 2^30. There are two places where nbins is calculated:

nbins = max((int)ceil(log10((Double_t)nbodies)/log10(2.0)+1)*4, MINBIN);

This seems innocuous. The other is:

int npeak=0;
for (i=0;i<nbodies;i++) if (Part[i].GetPotential()>=rmin&&Part[i].GetPotential()<rmax) npeak++;
//once have initial estimates of variance bin using Scott's formula
//deltar=3.5*sdlow/pow(nbodies,1./3.);
deltar=3.5*sqrt(sdlow*sdlow+sdhigh*sdhigh)/pow(npeak,1./3.);
//nbins=ceil((rmax-rmin)/deltar+1);
nbins=round((rmax-rmin)/deltar+MINBIN);

Here there is a possibility for npeak to go beyond 2^31, which will overflow it into a negative value, making deltar negative, and finally nbins negative. This would explain the error in the GMatrix construction. This negative value is then given to a vector.resize, which will turn it into back into a (probably very large) positive int64 value, causing a std::length_error.

Now, all this theory makes sense only if the number of particles you are dealing with is greater than 2^31 to begin with, could you confirm that's the case?

@edoaltamura
Copy link
Author

Hi @rtobar, thanks for your reply! I am using less particles than the bit limit (~2^16 particles in this case, but plan on increasing that 8 times soon). I have attached the stdout file - I hope this helps.
log.txt

@rtobar
Copy link

rtobar commented Oct 21, 2020

@edoaltamura the logfile was useful to double-check that the error is indeed happening most likely within DetermineDenVRatioDistribution, and the information on the number of particles is also useful, it basically means what I thought above is not the cause of the error. Other than that there's not much more information I can take out from the logfile. A full backtrace, hopefully with local variable information, would definitely help figuring out what's going on.

Another guess though: maybe sdlow and/or sdhigh and/or meanr are incorrectly initialised (or uninitialized). Unlikely, but still possible. You could try to print the values of sdlow, sdhigh, meanr, sl, rmin, rmax, deltar and nbins in localbgcomp.cxx line 241, just before W=GMatrix(nbins, nbins).

@jchelly
Copy link

jchelly commented Oct 21, 2020

@edoaltamura - you can get a backtrace if you configure velociraptor with -DCMAKE_BUILD_TYPE=Debug and run it in DDT. E.g.

module load allinea
ddt ../VELOCIraptor-STF/stf -I 2 -i snapshots/L0300N0564_VR93_0199 -o L0300N0564_VR93_0199 -C config/vr_config_zoom_dmo.cfg

DDT might mistakenly think you're using MPI because velociraptor is linked against parallel hdf5, so you'll probably have to make sure the MPI box is not checked when it starts up and ignore when it complains you're running an MPI program.

@edoaltamura
Copy link
Author

Thanks @jchelly and @rtobar. I don't have much experience with DTT, but from what I can see one of the issues is with the hdf5 library. The problem persists while compiling/running with different versions of parallel_hdf5/1.10.3 and parallel_hdf5/1.10.6. Surprisingly enough, the run is successful if using the VR executable compiled with hydro, a hydro .cfg file and a dmo snapshot as input (-i). The results this gave are even very believable for basic quantities (M200c, M500c, R200c, etc.).

@jchelly
Copy link

jchelly commented Oct 22, 2020

I can reproduce this. Here's the stack trace when it throws the length error exception:

#12 _INTERNAL8aaf6219::__kmp_launch_worker (thr=0x2aab484e2760) at /nfs/site/proj/openmp/promo/20200504/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:593 (at 0x00002aaaac6508cc)
#11 __kmp_launch_thread (this_thr=0x2aab484e2760) at /nfs/site/proj/openmp/promo/20200504/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:6109 (at 0x00002aaaac5d321e)
#10 __kmp_invoke_task_func (gtid=1213081440) at /nfs/site/proj/openmp/promo/20200504/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7515 (at 0x00002aaaac5d4273)
#9 __kmp_invoke_microtask () from /cosma/local/intel/Parallel_Studio_XE_2020/compilers_and_libraries_2020.2.254/linux/compiler/lib/intel64_lin/libiomp5.so (at 0x00002aaaac6503f3)
#8 L__Z12SearchSubSubR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERPxRxS9_P8PropData_3035__par_region0_2_130 () at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/search.cxx:3065 (at 0x00000000005bd61d)
#7 PreCalcSearchSubSet (opt=..., subnumingroup=42392, subPart=@0x2aab38c002c0: 0x2aab48100ba8, sublevel=1) at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/search.cxx:2770 (at 0x00000000005c6c87)
#6 GetOutliersValues (opt=..., nbodies=42392, Part=0x2aab48100ba8, sublevel=1) at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/localbgcomp.cxx:385 (at 0x00000000005824d9)
#5 DetermineDenVRatioDistribution (opt=..., nbodies=42392, Part=0x2aab48100ba8, meanr=@0x2aab38bfe5e0: -nan(0x8000000000000), sdlow=@0x2aab38bfe5e8: -nan(0x8000000000000), sdhigh=@0x2aab38bfe5f0: -nan(0x8000000000000), sublevel=1) at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/localbgcomp.cxx:242 (at 0x000000000057f710)
#4 std::vector<double, std::allocator<double> >::resize (this=0x2aab38bfdc90, __new_size=9223372036854775808) at /cosma/local/gcc/7.3.0/include/c++/7.3.0/bits/stl_vector.h:692 (at 0x00000000004e6a1b)
#3 std::vector<double, std::allocator<double> >::_M_default_append (this=0x2aab38bfdc90, __n=9223372036854775740) at /cosma/local/gcc/7.3.0/include/c++/7.3.0/bits/vector.tcc:569 (at 0x00000000004e6bea)
#2 std::vector<double, std::allocator<double> >::_M_check_len (this=0x2aab38bfdc90, __n=9223372036854775740, __s=0x6c4760 "vector::_M_default_append") at /cosma/local/gcc/7.3.0/include/c++/7.3.0/bits/stl_vector.h:1500 (at 0x00000000004e6e7f)
#1 std::__throw_length_error (__s=0x6c4760 "vector::_M_default_append") at /cosma/local/software/compiler/gcc-7.3.0/build/x86_64-pc-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/functexcept.cc:78 (at 0x00002aaaac25589f)
#0 __cxxabiv1::__cxa_throw (obj=obj@entry=0x2aab484e2760, tinfo=0x2aaaac512a60, dest=0x2aaaac2423b0 <std::length_error::~length_error()>) at /cosma/local/software/compiler/gcc-7.3.0/build/x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/../../../../libstdc++-v3/libsupc++/eh_throw.cc:75 (at 0x00002aaaac22d220)

This is on 16 threads and no MPI. With 1 thread it doesn't crash. The VR cmake config is

  cmake . -DVR_USE_HYDRO=OFF \
    -DVR_USE_SWIFT_INTERFACE=OFF \
    -DCMAKE_CXX_FLAGS="-fPIC" \
    -DCMAKE_BUILD_TYPE=Debug \
    -DVR_ZOOM_SIM=ON \
    -DVR_MPI=OFF

and the input file is vr_config_zoom_dmo.cfg.gz.

@jchelly
Copy link

jchelly commented Oct 22, 2020

At localbgcomp.cxx:230

        rmin=(meanr-sl*sdlow);
        rmax=(meanr+sl*sdhigh);
        int npeak=0;
        for (i=0;i<nbodies;i++) if (Part[i].GetPotential()>=rmin&&Part[i].GetPotential()<rmax) npeak++;
        //once have initial estimates of variance bin using Scott's formula
        //deltar=3.5*sdlow/pow(nbodies,1./3.);
        deltar=3.5*sqrt(sdlow*sdlow+sdhigh*sdhigh)/pow(npeak,1./3.);
        //nbins=ceil((rmax-rmin)/deltar+1);
        nbins=round((rmax-rmin)/deltar+MINBIN);
        //recalculate deltar
        deltar = (rmax-rmin)/(double)nbins;
        W=GMatrix(nbins,nbins);
        rbin.resize(nbins);

we get the crash because nbins in rbin.resize(nbins) has a huge negative value. rmin, rmax, meannr, sdhigh, sdlow are all NaN.

@rtobar
Copy link

rtobar commented Oct 22, 2020

Thanks @jchelly for the backtrace, that was indeed the direction I was aiming at too with my last comment. The fact meanr is nan means that before that loop rmin and/or deltar were nan, which ultimately can only happen if if one of the particles has a nan potential:

rmin=rmax=Part[0].GetPotential();
#ifdef USEOPENMP
#pragma omp parallel for default(shared) \
private(i,tid) schedule(static) \
reduction(min:rmin) reduction(max:rmax) if (nbodies > ompperiodnum)
#endif
for (i=1;i<nbodies;i++) {
if (rmin>Part[i].GetPotential())rmin=Part[i].GetPotential();
if (rmax<Part[i].GetPotential())rmax=Part[i].GetPotential();
}

These potentials seem to be set just before DetermineDenVRatioDistribution is called, in GetDenVRatio. There is an OpenMP parallel for there setting those potentials, but there are a few things that could contribute to a nan.

Part[i].SetPotential(log(tempdenv)-log(norm)-fbg);

And within the same loop:

tempdenv=Part[i].GetDensity()/opt.Nsearch;

, which would push the problem further back. Sadly, the problem is not obvious, and without a proper debugging session it will be difficult to investigate. If I could get hold of the data, or access to cosma, I could give it a try though. Can either of those be arranged?

@edoaltamura
Copy link
Author

Probably having access to cosma would be optimal, since it would perfectly replicate the issue. The snap file isn't very large (< 1 GB) though, I'm happy to e-mail it to you via WeTransfer or similar service.

@jchelly
Copy link

jchelly commented Oct 22, 2020

@rtobar - you can get a copy of the snapshot here: http://icc.dur.ac.uk/~jch/L0300N0564_VR93_0199.hdf5

There are instructions to sign up for a Cosma account at https://www.dur.ac.uk/icc/cosma/support/account/ but someone here needs to approve adding you to a project. I'll ask about that.

@jchelly
Copy link

jchelly commented Oct 22, 2020

@rtobar - if you apply for an account through DiRAC SAFE (see the link above) it should get approved now. You'll need to create an account on the SAFE web site and then use that to request a login account on Cosma as part of the dp004/virgo project.

@rtobar
Copy link

rtobar commented Oct 23, 2020

@jchelly Thanks for the instructions, I got an account created and setup, but there is still one missing step I need to overcome to be able to apply to cosma (the UK Access Management Federation system does not currently play well with the University of Western Australia's authentication system). I'll try to get that solved, though I suspect it will take a while. In the meanwhile I'll be using the files you uploaded to try to reproduce the error locally.

@rtobar
Copy link

rtobar commented Oct 23, 2020

With the files locally I was able now to reproduce the error and dig a bit more into what's causing it.

Indeed nbins gets out of control because a variety of other quantities go out of range. In particular at the beginning of the function when calculating rmin and rmax, rmin goes to -inf:

Thread 8 "stf" hit Breakpoint 2, DetermineDenVRatioDistribution (opt=..., nbodies=345410, Part=0x7fff76060018, meanr=@0x7fff8e5b6698: 0, sdlow=@0x7fff8e5b66a0: 6.2746337021838311e-322, sdhigh=@0x7fff8e5b66a8: 108905792, sublevel=1) at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:166
166         deltar=(4.0*fabs(rmin))/(Double_t)nbins;
(gdb) p rmin
$1 = -inf
(gdb) p rmax
$2 = 50.755579976876106

This is because Part[1]'s potential is -inf:

(gdb) p Part[0].GetPotential()
$3 = 0.82819776835306769
(gdb) p Part[1].GetPotential()
$4 = -inf

The potentials are set in GetDenVRatio. When stepping through this for i == 1 I see:

Thread 8 "stf" hit Breakpoint 2, GetDenVRatio () at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:89
89              tempdenv=Part[i].GetDensity()/opt.Nsearch;
(gdb) p i
$6 = 1
(gdb) p Part[i].GetDensity()
$7 = 0
(gdb) p tempdenv
$8 = 0
...
121             Part[i].SetPotential(log(tempdenv)-log(norm)-fbg);
(gdb) p fbg
$36 = -18.780740249149968
(gdb) p norm
$37 = 0.063493635934240969
(gdb) p log(tempdenv)
$40 = -inf
(gdb) n
(gdb) p Part[1].GetPotential()
$41 = -inf

So ultimately this is caused by particles having density = 0. All of the above is with 4 threads.

I tried the same with 1 thread, and the density is correctly calculated (or at least has a value):

Breakpoint 1, GetDenVRatio () at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:89
89              tempdenv=Part[i].GetDensity()/opt.Nsearch;
(gdb) p Part[i].GetDensity()
$4 = 7.33894316062687e-09
(gdb) p i
$5 = 1
(gdb) 

These densities seem to be set via SetDensity at:

Part[j].SetDensity(tree->CalcSmoothLocalValue(opt.Nvel, pqv, weight));

This is where I'm concentrating now, I'll post more updates when I learn more.

@edoaltamura
Copy link
Author

As an update, now the exact same issue also appears in the hydro version of the code (still does in dark-matter-only). Could it depend on the Intel 2020 compiler causing some unexpected behavior?

@MatthieuSchaller
Copy link

I doubt the compiler is the issue. If anything it would be the compiler revealing incorrect behaviour in the VR code. But you can try with 2018 or gcc if you want to confirm this.

@rtobar
Copy link

rtobar commented Nov 12, 2020

I don't think it's the compiler either. It also doesn't seem to be a recent bug in VR; as commented on slack earlier today, I was able to reproduce this bug with the same input data using VR versions from 8 weeks and 3 months ago (compiled with gcc). If this bug wasn't found earlier is probably because the inputs didn't trigger it.

@rtobar
Copy link

rtobar commented Nov 16, 2020

@edoaltamura mentioned this on slack:

Compiled from master with intel 2018: c1283dd (this works)
Compiled from hotgas with intel 2020: b58e5ac (breaks as in #37)
Compiled from hotgas with intel 2018: b58e5ac (breaks in a slightly different way)

I my previous comment I mentioned how a 3mo-old version didn't work for me but I was wrong -- I tried again and it did work, so we are aligned on what we see. I'm bisecting now through the history to find the culprit commit, this will hopefully make things easier to fix now. This is still the case: I still get the error on c1283dd locally when running using the input data and the configuration file provided by John.

@rtobar
Copy link

rtobar commented Nov 16, 2020

Update: while it's not clear what is exactly wrong, the problem seems to stem from the difference in how the initial trees are built on the parallel and serial cases. And again, while not the solution, an apparently "good enough" workaround is to force the code to build the trees serially, even when OpenMP support is enabled and there are multiple cores to use. This can be done by setting OMP_run_fof=0 in the configuration file.

rtobar added a commit that referenced this issue Nov 18, 2020
This is but a small set of unused variables visible when compiling the
code with -Wall -Wextra.

I have chosen to fix this file and functions simply to clean up the path
to further debugging the problem described in #37, which stems from the
assignment of zeros to particle densities, which in turn happens in
GetVelocityDensityApproximative.

Signed-off-by: Rodrigo Tobar <[email protected]>
@edoaltamura
Copy link
Author

I can confirm that introducing OMP_run_fof=0 allows the code to run smoothly. Would be interesting to find out the precise cause of this behavior. In the meantime, would it be possible to please rebase the hot_gas_properties branch with the latest updates in master? This way, I would be able to test Claudia's implementation with SO properties further. Thanks again for looking into this.

@rtobar
Copy link

rtobar commented Nov 26, 2020

@edoaltamura thanks for confirming the workaround does indeed work for you too. Indeed there is still work needed to identify the real underlying cause for this, which is why this issue will remain open until the real cause is found.

Regarding hot_gas_properties, I just merged the latest master onto it, while @cdplagos has also made a few further changes. Let's try to keep that discussion separate though (maybe in slack?) to avoid crossing wires here.

@rtobar
Copy link

rtobar commented Dec 6, 2020

As a way to constrain this problem a bit further, I've added recently on the master branch a more explicit error about the condition that is triggering this issue ("Particle density not positive, cannot continue"), which halts the program earlier than before and with a less cryptic message. This is the error message found by @JBorrow in #53. After confirmation we can close #53 as a duplicate of this issue.

In addition, two things have come to my attention these last days:

  • The issue seems to happen only when the code is compiled with VR_MPI=OFF. I hadn't realised about this earlier, but I will double-check to be absolutely certain this is the case.
  • The issue also seems to not occur with VR_STRUCTURE_DEN=OFF. I will again double-check this.

With these two points in mind, my attention is now gravitating to localfields.cxx's GetVelocityDensityApproximative as the potential core of the issue. Again, I'll double-check that these two findings are correct before continuing.

@MatthieuSchaller
Copy link

Would it help to use GCC's thread sanitizer and/or address sanitizer? These are pretty good at finding mistakes in OMP sections.

@rtobar
Copy link

rtobar commented Dec 22, 2020

Thanks to @pelahi we have now a fix that addresses the issue. The problem was apparently introduced by 8b0cf42, but a fix for this is now present on the issue-37 branch in 08595f8. The fix indeed makes the new "Particle density not positive, cannot continue" error disappear, confirming that not only the original problem is gone, but also that the additional diagnostic added after the analysis of the problem are also not triggering errors.

@edoaltamura @MatthieuSchaller when you have a chance please double-check on your end that the fix is working as intended. After confirmation I'll merge this to the master branch.

@pelahi
Copy link
Collaborator

pelahi commented Dec 22, 2020

My apologies for introducing this bug!

@MatthieuSchaller
Copy link

Is that still an issue or are we confident it's now all working? I can't see the branch in the PR list or in the merged list.

@rtobar
Copy link

rtobar commented Feb 3, 2021

I haven't created a PR yet for these changes, as I was waiting for confirmation on your end (@edoaltamura @MatthieuSchaller) that this wasn't an issue anymore. As per my previous comment, the fix is in the issue-37 branch, and my local tests at least showed that the original problem is gone.

@rtobar rtobar added the bug Something isn't working label Feb 16, 2021
@edoaltamura
Copy link
Author

Hi @rtobar, there have been some more commits since last time I checked... I'm going to check the issue-37 branch later this week using both dmo+hydro zooms and give you an update. Thanks for your patience!

@rtobar
Copy link

rtobar commented Feb 16, 2021

Thanks @edoaltamura, that would be great :). If your tests come back with positive results then I'll merge the branch back to the master.

@edoaltamura
Copy link
Author

I'm still detecting some errors when compiling without MPI (I am compiling the up-to-date source in the issue-37 branch).

The hydro version on zooms exits with this message:

[ 199.534] [trace] bgfield.cxx:54 Grid system using leaf nodes with maximum size of 176
[ 199.534] [trace] bgfield.cxx:60 Building physical Tree using minimum shannon entropy as splitting criterion
terminate called after throwing an instance of 'std::runtime_error'
[ 199.534] [trace] search.cxx:1781 Number of cores: 2
[ 199.534] [trace] bgfield.cxx:140 Done
  what():  Particle density not positive, cannot continue[ 199.534] [trace] bgfield.cxx:155 Calculating Grid Mean Velocity
[[ 199.534] [trace] bgfield.cxx:173 199. 534   Done
[[ 199.534] [trace] bgfield.cxx:173 199. 534   Done
[ 199.534] [trace] bgfield.cxx:184 Calculating Grid Velocity Dispersion
[ 199.534] [trace[] bgfield.cxx 199.[534] [trace] :search.cxx:199901965.  Filling KD-Tree Grid
Aborted

which seems related to issue #53. From monitoring the stdout, the OpenMP part seems to launch correctly, but not sure what happens to it gets to the parallel FOF part, since it breaks on negative densities.

A similar setup for dark-matter-only instead returns

[   0.437] [ info] hdfio.cxx:941 Expecting:
[   0.437] [ info] hdfio.cxx:943   Header
[   0.437] [ info] hdfio.cxx:943   PartType0
[   0.437] [ info] hdfio.cxx:943   PartType1
[   0.437] [ info] hdfio.cxx:943   PartType2
[   0.437] [ info] hdfio.cxx:943   PartType3
[   0.437] [ info] hdfio.cxx:943   PartType4
[   0.437] [ info] hdfio.cxx:943   PartType5
[   0.438] [ info] hdfio.cxx:1109 File contains 2312540 particles and is at time 1.000
[   0.438] [ info] hdfio.cxx:1110 Particle system contains 1922386 particles and is at time 1.000 in a box of size 300.000
[   0.438] [ info] hdfio.cxx:1133 Reading file 0
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5G.c line 463 in H5Gopen2(): unable to open group
    major: Symbol table
    minor: Can't open object
  #001: H5Gint.c line 281 in H5G__open_name(): group not found
    major: Symbol table
    minor: Object not found
  #002: H5Gloc.c line 422 in H5G_loc_find(): can't find object
    major: Symbol table
    minor: Object not found
  #003: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #004: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #005: H5Gloc.c line 378 in H5G__loc_find_cb(): object 'PartType0' doesn't exist
    major: Symbol table
    minor: Object not found
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
[   2.505] [ info] io.cxx:127 Done loading input data

which could be a mismatch in the i/o.

Please let me know if there are other useful tests I could do to help investigate these. Has the issue-37 branch been rebased with the latest improvements to master?

@rtobar
Copy link

rtobar commented Feb 19, 2021

@edoaltamura thanks a lot for the further testing. I just merged the latest master branch into issue-37 to make sure the latter contains all the latest fixes and improvements, although I wouldn't expect it would change the outcome of your experiments.

Can I ask you to point me to the input configuration and files you were using for those two last experiments? I just tried the code against the original input files used to report this issue (i.e., snapshots/L0300N0564_VR93_0199 and config/vr_config_zoom_dmo.cfg under /cosma/home/dp004/dc-alta2/data7/xl-zooms/dmo/L0300N0564_VR93) and that ran to completion without errors. If you could also share the compilation options you used for these experiments would be great, I'm using what seems to be representative of the original problem (-DVR_USE_HYDRO=OFF -DVR_USE_SWIFT_INTERFACE=OFF -DCMAKE_BUILD_TYPE=Debug -DVR_ZOOM_SIM=ON -DVR_MPI=OFF), which is also what John described in #37 (comment).

The hydro version on zooms exits with this message [...]

This issue was originally reported DMO runs, while you are reporting a failure for a hydro run. This might or might not be connected to the same underlying issue, but to keep problem solving focused I'd suggest we stick with DMO runs only. #53 is indeed a problem with hydro runs, although without zoom, so again we can't really tell.

A similar setup for dark-matter-only instead returns [...] which could be a mismatch in the i/o.

Yes, looks like that, the logs says that object 'PartType0' doesn't exist but this is required. Again, let's try to untangle this bit by bit. If you think this is a genuine problem please let's open a separate issue to keep track of it.

@edoaltamura
Copy link
Author

I have run one more test replicating your cmake configuration and using the same input snapshot and cfg file and I am getting a segmentation fault on substructure search.

Before segfaulting, the stdout printed

[  75.173] [ info] search.cxx:155 Finished OpenMP local FOF search of 8 containing total of 878268 groups in 49.452 [s]

which may suggest that the FOF search on OpenMP is completing without errors, or that errors from that sections only get caught on substructure search. This is with the new master updates merged into the issue-37 branch.

I have tried 2 combinations of modules on cosma7:

module load cmake
module load intel_comp/2018
module load intel_mpi/2018
module load parmetis
module load parallel_hdf5
module load fftw
module load gsl

and

module load cmake/3.18.1
module load intel_comp/2020-update2
module load intel_mpi/2020-update2
module load ucx/1.8.1
module load parmetis/4.0.3-64bit
module load parallel_hdf5/1.10.6
module load fftw/3.3.8cosma7
module load gsl/2.5

but they both gave the same result.

@rtobar
Copy link

rtobar commented Feb 24, 2021

@edoaltamura I'm really puzzled by the fact that you got a segfault and I didn't for the same code and inputs. I would really like to clarify this, otherwise we are shooting in the dark.

I went ahead and built the latest issue-37 branch from scratch with the given cmake configuration under Debug mode, then ran it against the original input/config files. I then rebuilt in Release mode and ran it again. I repeated these steps for the two sets of modules you outlined above. However in all these tries I couldn't replicate the segfault you reported.

I captured all the commands and outputs in the following files: edo-modules1.txt edo-modules2.txt (so you will find two compilations and executions on each, for Debug and Release mode).

Could you please have a look at these logs and see what is different between our runs? I did notice that in your segfault report the log says there are 878268 groups, while in all my runs I saw 1054896 groups reported. This makes me think you are using a different set of inputs or configuration. If that's the case then we can open a new issue to track that down, with the understanding that the original issue reported here is gone.

@edoaltamura
Copy link
Author

Hi @rtobar, I've tried your configuration and I can reproduce your outputs - can confirm that the original issue is solved. I'm not sure why that didn't work with my pevious inputs - I will double check later. Thanks for your help and patience!

@rtobar
Copy link

rtobar commented Feb 24, 2021

Thanks @edoaltamura, that's great news :). And thanks for your patience as well!

I will then merge issue-37 back to the master and declare this issue as fixed. But by all means, if you get new crashes or error reports, please open new issues; my main concern was not to mix different problems and instead treat them separately.

@rtobar
Copy link

rtobar commented Feb 24, 2021

Fix merged to the master branch, closing this issue now.

@rtobar rtobar closed this as completed Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working workaround available
Projects
None yet
Development

No branches or pull requests

5 participants