DMO breaks on substructure search #37

edoaltamura · 2020-10-20T15:05:21Z

Describe the bug
I am trying to run stf stand-alone on a DMO snapshot, using the zoom configuration. The process fails with the error

0 Beginning substructure search
Error, net size 0 with row,col=0,0
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted

The same behavior appears also when running stf on the fly with SWIFT. Note: SWIFT itself completed the run successfully and the snapshots do not deem to be corrupted in any apparent way (checked for existing datasets and datasets shapes).

To Reproduce
Steps to reproduce the behavior:

Go to /cosma/home/dp004/dc-alta2/data7/xl-zooms/dmo/L0300N0564_VR93 on Cosma 7.
Load the following modules:

module purge
module load intel_comp/2020-update2
module load intel_mpi/2020-update2
module load ucx/1.8.1
module load parmetis/4.0.3-64bit
module load parallel_hdf5/1.10.6
module load fftw/3.3.8cosma7
module load gsl/2.5

Run VR as

../VELOCIraptor-STF/stf -I 2 -i snapshots/L0300N0564_VR93_0199 -o L0300N0564_VR93_0199 -C config/vr_config_zoom_dmo.cfg

See error

0 Beginning substructure search
Error, net size 0 with row,col=0,0
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted

Expected behavior
Given the arguments parsed, expected to generate the usual output files in the pwd (e.g. the L0300N0564_VR93_0199.properties file).

Log files
Logs can be displayed to console, but they are also available in the $(pwd)/stf directory.

Environment (please complete the following information):

VR version: fresh installation (yesterday) from the master branch, compiled and run with the following modules:
Libraries:

module load intel_comp/2020-update2
module load intel_mpi/2020-update2
module load ucx/1.8.1
module load parmetis/4.0.3-64bit
module load parallel_hdf5/1.10.6
module load fftw/3.3.8cosma7
module load gsl/2.5

**Additional context**
I also tried running with higher verbosity in the `.cfg` file, but no further info is shown.

Thanks in advance for your help!

The text was updated successfully, but these errors were encountered:

rtobar · 2020-10-20T15:40:08Z

@edoaltamura do you happen to have a backtrace of the crash? I don't have access to cosma, and therefore cannot easily identify the problem only with the information given above.

The message Error, net size 0 with row,col=0,0 comes from the construction of a GMatrix where row * col <= 0. This could happen if either are both 0 or negative of course, but given their data types (int) this could also happen if row * col >= 2^31. The std::length_error exception that is finally crashing the program is raised by vector.reserve, and is probably just another symptom of the same problem.

The only place where GMatrix is created with non-hardcoded row/column numbers is in localbgcomp.cpp in the DetermineDenVRatioDistribution function. Here nbins is used both as the row and column count used to construct GMatrix, so its value is either 0 or something bigger than 2^30. There are two places where nbins is calculated:

VELOCIraptor-STF/src/localbgcomp.cxx

Line 150 in cb88fb3

nbins = max((int)ceil(log10((Double_t)nbodies)/log10(2.0)+1)*4, MINBIN);

This seems innocuous. The other is:

VELOCIraptor-STF/src/localbgcomp.cxx

Lines 232 to 238 in cb88fb3

    
           int npeak=0; 
        
           for (i=0;i<nbodies;i++) if (Part[i].GetPotential()>=rmin&&Part[i].GetPotential()<rmax) npeak++; 
        
           //once have initial estimates of variance bin using Scott's formula 
        
           //deltar=3.5*sdlow/pow(nbodies,1./3.); 
        
           deltar=3.5*sqrt(sdlow*sdlow+sdhigh*sdhigh)/pow(npeak,1./3.); 
        
           //nbins=ceil((rmax-rmin)/deltar+1); 
        
           nbins=round((rmax-rmin)/deltar+MINBIN);

Here there is a possibility for npeak to go beyond 2^31, which will overflow it into a negative value, making deltar negative, and finally nbins negative. This would explain the error in the GMatrix construction. This negative value is then given to a vector.resize, which will turn it into back into a (probably very large) positive int64 value, causing a std::length_error.

Now, all this theory makes sense only if the number of particles you are dealing with is greater than 2^31 to begin with, could you confirm that's the case?

edoaltamura · 2020-10-20T16:10:29Z

Hi @rtobar, thanks for your reply! I am using less particles than the bit limit (~2^16 particles in this case, but plan on increasing that 8 times soon). I have attached the stdout file - I hope this helps.
log.txt

rtobar · 2020-10-21T02:26:48Z

@edoaltamura the logfile was useful to double-check that the error is indeed happening most likely within DetermineDenVRatioDistribution, and the information on the number of particles is also useful, it basically means what I thought above is not the cause of the error. Other than that there's not much more information I can take out from the logfile. A full backtrace, hopefully with local variable information, would definitely help figuring out what's going on.

Another guess though: maybe sdlow and/or sdhigh and/or meanr are incorrectly initialised (or uninitialized). Unlikely, but still possible. You could try to print the values of sdlow, sdhigh, meanr, sl, rmin, rmax, deltar and nbins in localbgcomp.cxx line 241, just before W=GMatrix(nbins, nbins).

jchelly · 2020-10-21T08:54:26Z

@edoaltamura - you can get a backtrace if you configure velociraptor with -DCMAKE_BUILD_TYPE=Debug and run it in DDT. E.g.

module load allinea
ddt ../VELOCIraptor-STF/stf -I 2 -i snapshots/L0300N0564_VR93_0199 -o L0300N0564_VR93_0199 -C config/vr_config_zoom_dmo.cfg

DDT might mistakenly think you're using MPI because velociraptor is linked against parallel hdf5, so you'll probably have to make sure the MPI box is not checked when it starts up and ignore when it complains you're running an MPI program.

edoaltamura · 2020-10-21T22:43:46Z

Thanks @jchelly and @rtobar. I don't have much experience with DTT, but from what I can see one of the issues is with the hdf5 library. The problem persists while compiling/running with different versions of parallel_hdf5/1.10.3 and parallel_hdf5/1.10.6. Surprisingly enough, the run is successful if using the VR executable compiled with hydro, a hydro .cfg file and a dmo snapshot as input (-i). The results this gave are even very believable for basic quantities (M200c, M500c, R200c, etc.).

jchelly · 2020-10-22T09:04:57Z

I can reproduce this. Here's the stack trace when it throws the length error exception:

#12 _INTERNAL8aaf6219::__kmp_launch_worker (thr=0x2aab484e2760) at /nfs/site/proj/openmp/promo/20200504/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/z_Linux_util.cpp:593 (at 0x00002aaaac6508cc)
#11 __kmp_launch_thread (this_thr=0x2aab484e2760) at /nfs/site/proj/openmp/promo/20200504/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:6109 (at 0x00002aaaac5d321e)
#10 __kmp_invoke_task_func (gtid=1213081440) at /nfs/site/proj/openmp/promo/20200504/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxe16lin03/../../src/kmp_runtime.cpp:7515 (at 0x00002aaaac5d4273)
#9 __kmp_invoke_microtask () from /cosma/local/intel/Parallel_Studio_XE_2020/compilers_and_libraries_2020.2.254/linux/compiler/lib/intel64_lin/libiomp5.so (at 0x00002aaaac6503f3)
#8 L__Z12SearchSubSubR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERPxRxS9_P8PropData_3035__par_region0_2_130 () at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/search.cxx:3065 (at 0x00000000005bd61d)
#7 PreCalcSearchSubSet (opt=..., subnumingroup=42392, subPart=@0x2aab38c002c0: 0x2aab48100ba8, sublevel=1) at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/search.cxx:2770 (at 0x00000000005c6c87)
#6 GetOutliersValues (opt=..., nbodies=42392, Part=0x2aab48100ba8, sublevel=1) at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/localbgcomp.cxx:385 (at 0x00000000005824d9)
#5 DetermineDenVRatioDistribution (opt=..., nbodies=42392, Part=0x2aab48100ba8, meanr=@0x2aab38bfe5e0: -nan(0x8000000000000), sdlow=@0x2aab38bfe5e8: -nan(0x8000000000000), sdhigh=@0x2aab38bfe5f0: -nan(0x8000000000000), sublevel=1) at /cosma7/data/dp004/jch/Swift/edo/VELOCIraptor-STF/src/localbgcomp.cxx:242 (at 0x000000000057f710)
#4 std::vector<double, std::allocator<double> >::resize (this=0x2aab38bfdc90, __new_size=9223372036854775808) at /cosma/local/gcc/7.3.0/include/c++/7.3.0/bits/stl_vector.h:692 (at 0x00000000004e6a1b)
#3 std::vector<double, std::allocator<double> >::_M_default_append (this=0x2aab38bfdc90, __n=9223372036854775740) at /cosma/local/gcc/7.3.0/include/c++/7.3.0/bits/vector.tcc:569 (at 0x00000000004e6bea)
#2 std::vector<double, std::allocator<double> >::_M_check_len (this=0x2aab38bfdc90, __n=9223372036854775740, __s=0x6c4760 "vector::_M_default_append") at /cosma/local/gcc/7.3.0/include/c++/7.3.0/bits/stl_vector.h:1500 (at 0x00000000004e6e7f)
#1 std::__throw_length_error (__s=0x6c4760 "vector::_M_default_append") at /cosma/local/software/compiler/gcc-7.3.0/build/x86_64-pc-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/functexcept.cc:78 (at 0x00002aaaac25589f)
#0 __cxxabiv1::__cxa_throw (obj=obj@entry=0x2aab484e2760, tinfo=0x2aaaac512a60, dest=0x2aaaac2423b0 <std::length_error::~length_error()>) at /cosma/local/software/compiler/gcc-7.3.0/build/x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/../../../../libstdc++-v3/libsupc++/eh_throw.cc:75 (at 0x00002aaaac22d220)

This is on 16 threads and no MPI. With 1 thread it doesn't crash. The VR cmake config is

  cmake . -DVR_USE_HYDRO=OFF \
    -DVR_USE_SWIFT_INTERFACE=OFF \
    -DCMAKE_CXX_FLAGS="-fPIC" \
    -DCMAKE_BUILD_TYPE=Debug \
    -DVR_ZOOM_SIM=ON \
    -DVR_MPI=OFF

and the input file is vr_config_zoom_dmo.cfg.gz.

jchelly · 2020-10-22T09:10:42Z

At localbgcomp.cxx:230

        rmin=(meanr-sl*sdlow);
        rmax=(meanr+sl*sdhigh);
        int npeak=0;
        for (i=0;i<nbodies;i++) if (Part[i].GetPotential()>=rmin&&Part[i].GetPotential()<rmax) npeak++;
        //once have initial estimates of variance bin using Scott's formula
        //deltar=3.5*sdlow/pow(nbodies,1./3.);
        deltar=3.5*sqrt(sdlow*sdlow+sdhigh*sdhigh)/pow(npeak,1./3.);
        //nbins=ceil((rmax-rmin)/deltar+1);
        nbins=round((rmax-rmin)/deltar+MINBIN);
        //recalculate deltar
        deltar = (rmax-rmin)/(double)nbins;
        W=GMatrix(nbins,nbins);
        rbin.resize(nbins);

we get the crash because nbins in rbin.resize(nbins) has a huge negative value. rmin, rmax, meannr, sdhigh, sdlow are all NaN.

rtobar · 2020-10-22T09:47:39Z

Thanks @jchelly for the backtrace, that was indeed the direction I was aiming at too with my last comment. The fact meanr is nan means that before that loop rmin and/or deltar were nan, which ultimately can only happen if if one of the particles has a nan potential:

VELOCIraptor-STF/src/localbgcomp.cxx

Lines 153 to 162 in cb88fb3

    
               rmin=rmax=Part[0].GetPotential(); 
        
           #ifdef USEOPENMP 
        
               #pragma omp parallel for default(shared) \ 
        
               private(i,tid) schedule(static) \ 
        
               reduction(min:rmin) reduction(max:rmax) if (nbodies > ompperiodnum) 
        
           #endif 
        
               for (i=1;i<nbodies;i++) { 
        
                   if (rmin>Part[i].GetPotential())rmin=Part[i].GetPotential(); 
        
                   if (rmax<Part[i].GetPotential())rmax=Part[i].GetPotential(); 
        
               }

These potentials seem to be set just before DetermineDenVRatioDistribution is called, in GetDenVRatio. There is an OpenMP parallel for there setting those potentials, but there are a few things that could contribute to a nan.

VELOCIraptor-STF/src/localbgcomp.cxx

Line 121 in cb88fb3

Part[i].SetPotential(log(tempdenv)-log(norm)-fbg);

And within the same loop:

VELOCIraptor-STF/src/localbgcomp.cxx

Line 89 in cb88fb3

tempdenv=Part[i].GetDensity()/opt.Nsearch;

, which would push the problem further back. Sadly, the problem is not obvious, and without a proper debugging session it will be difficult to investigate. If I could get hold of the data, or access to cosma, I could give it a try though. Can either of those be arranged?

edoaltamura · 2020-10-22T10:24:18Z

Probably having access to cosma would be optimal, since it would perfectly replicate the issue. The snap file isn't very large (< 1 GB) though, I'm happy to e-mail it to you via WeTransfer or similar service.

jchelly · 2020-10-22T10:34:42Z

@rtobar - you can get a copy of the snapshot here: http://icc.dur.ac.uk/~jch/L0300N0564_VR93_0199.hdf5

There are instructions to sign up for a Cosma account at https://www.dur.ac.uk/icc/cosma/support/account/ but someone here needs to approve adding you to a project. I'll ask about that.

jchelly · 2020-10-22T11:17:22Z

@rtobar - if you apply for an account through DiRAC SAFE (see the link above) it should get approved now. You'll need to create an account on the SAFE web site and then use that to request a login account on Cosma as part of the dp004/virgo project.

rtobar · 2020-10-23T02:05:25Z

@jchelly Thanks for the instructions, I got an account created and setup, but there is still one missing step I need to overcome to be able to apply to cosma (the UK Access Management Federation system does not currently play well with the University of Western Australia's authentication system). I'll try to get that solved, though I suspect it will take a while. In the meanwhile I'll be using the files you uploaded to try to reproduce the error locally.

rtobar · 2020-10-23T05:41:52Z

With the files locally I was able now to reproduce the error and dig a bit more into what's causing it.

Indeed nbins gets out of control because a variety of other quantities go out of range. In particular at the beginning of the function when calculating rmin and rmax, rmin goes to -inf:

Thread 8 "stf" hit Breakpoint 2, DetermineDenVRatioDistribution (opt=..., nbodies=345410, Part=0x7fff76060018, meanr=@0x7fff8e5b6698: 0, sdlow=@0x7fff8e5b66a0: 6.2746337021838311e-322, sdhigh=@0x7fff8e5b66a8: 108905792, sublevel=1) at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:166
166         deltar=(4.0*fabs(rmin))/(Double_t)nbins;
(gdb) p rmin
$1 = -inf
(gdb) p rmax
$2 = 50.755579976876106

This is because Part[1]'s potential is -inf:

(gdb) p Part[0].GetPotential()
$3 = 0.82819776835306769
(gdb) p Part[1].GetPotential()
$4 = -inf

The potentials are set in GetDenVRatio. When stepping through this for i == 1 I see:

Thread 8 "stf" hit Breakpoint 2, GetDenVRatio () at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:89
89              tempdenv=Part[i].GetDensity()/opt.Nsearch;
(gdb) p i
$6 = 1
(gdb) p Part[i].GetDensity()
$7 = 0
(gdb) p tempdenv
$8 = 0
...
121             Part[i].SetPotential(log(tempdenv)-log(norm)-fbg);
(gdb) p fbg
$36 = -18.780740249149968
(gdb) p norm
$37 = 0.063493635934240969
(gdb) p log(tempdenv)
$40 = -inf
(gdb) n
(gdb) p Part[1].GetPotential()
$41 = -inf

So ultimately this is caused by particles having density = 0. All of the above is with 4 threads.

I tried the same with 1 thread, and the density is correctly calculated (or at least has a value):

Breakpoint 1, GetDenVRatio () at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:89
89              tempdenv=Part[i].GetDensity()/opt.Nsearch;
(gdb) p Part[i].GetDensity()
$4 = 7.33894316062687e-09
(gdb) p i
$5 = 1
(gdb)

These densities seem to be set via SetDensity at:

VELOCIraptor-STF/src/localfield.cxx

Line 918 in 7d185b0

Part[j].SetDensity(tree->CalcSmoothLocalValue(opt.Nvel, pqv, weight));

This is where I'm concentrating now, I'll post more updates when I learn more.

edoaltamura · 2020-11-11T23:26:55Z

As an update, now the exact same issue also appears in the hydro version of the code (still does in dark-matter-only). Could it depend on the Intel 2020 compiler causing some unexpected behavior?

MatthieuSchaller · 2020-11-12T07:54:21Z

I doubt the compiler is the issue. If anything it would be the compiler revealing incorrect behaviour in the VR code. But you can try with 2018 or gcc if you want to confirm this.

rtobar · 2020-11-12T07:56:36Z

I don't think it's the compiler either. It also doesn't seem to be a recent bug in VR; as commented on slack earlier today, I was able to reproduce this bug with the same input data using VR versions from 8 weeks and 3 months ago (compiled with gcc). If this bug wasn't found earlier is probably because the inputs didn't trigger it.

rtobar · 2020-11-16T03:24:29Z

@edoaltamura mentioned this on slack:

Compiled from master with intel 2018: c1283dd (this works)
Compiled from hotgas with intel 2020: b58e5ac (breaks as in #37)
Compiled from hotgas with intel 2018: b58e5ac (breaks in a slightly different way)

I my previous comment I mentioned how a 3mo-old version didn't work for me but I was wrong -- I tried again and it did work, so we are aligned on what we see. I'm bisecting now through the history to find the culprit commit, this will hopefully make things easier to fix now. This is still the case: I still get the error on c1283dd locally when running using the input data and the configuration file provided by John.

rtobar · 2020-11-16T15:49:37Z

Update: while it's not clear what is exactly wrong, the problem seems to stem from the difference in how the initial trees are built on the parallel and serial cases. And again, while not the solution, an apparently "good enough" workaround is to force the code to build the trees serially, even when OpenMP support is enabled and there are multiple cores to use. This can be done by setting OMP_run_fof=0 in the configuration file.

This is but a small set of unused variables visible when compiling the code with -Wall -Wextra. I have chosen to fix this file and functions simply to clean up the path to further debugging the problem described in #37, which stems from the assignment of zeros to particle densities, which in turn happens in GetVelocityDensityApproximative. Signed-off-by: Rodrigo Tobar <[email protected]>

edoaltamura · 2020-11-25T21:29:39Z

I can confirm that introducing OMP_run_fof=0 allows the code to run smoothly. Would be interesting to find out the precise cause of this behavior. In the meantime, would it be possible to please rebase the hot_gas_properties branch with the latest updates in master? This way, I would be able to test Claudia's implementation with SO properties further. Thanks again for looking into this.

rtobar · 2020-11-26T08:18:05Z

@edoaltamura thanks for confirming the workaround does indeed work for you too. Indeed there is still work needed to identify the real underlying cause for this, which is why this issue will remain open until the real cause is found.

Regarding hot_gas_properties, I just merged the latest master onto it, while @cdplagos has also made a few further changes. Let's try to keep that discussion separate though (maybe in slack?) to avoid crossing wires here.

rtobar · 2020-12-06T02:07:22Z

As a way to constrain this problem a bit further, I've added recently on the master branch a more explicit error about the condition that is triggering this issue ("Particle density not positive, cannot continue"), which halts the program earlier than before and with a less cryptic message. This is the error message found by @JBorrow in #53. After confirmation we can close #53 as a duplicate of this issue.

In addition, two things have come to my attention these last days:

The issue seems to happen only when the code is compiled with VR_MPI=OFF. I hadn't realised about this earlier, but I will double-check to be absolutely certain this is the case.
The issue also seems to not occur with VR_STRUCTURE_DEN=OFF. I will again double-check this.

With these two points in mind, my attention is now gravitating to localfields.cxx's GetVelocityDensityApproximative as the potential core of the issue. Again, I'll double-check that these two findings are correct before continuing.

MatthieuSchaller · 2020-12-08T09:27:02Z

Would it help to use GCC's thread sanitizer and/or address sanitizer? These are pretty good at finding mistakes in OMP sections.

rtobar · 2020-12-22T08:37:13Z

Thanks to @pelahi we have now a fix that addresses the issue. The problem was apparently introduced by 8b0cf42, but a fix for this is now present on the issue-37 branch in 08595f8. The fix indeed makes the new "Particle density not positive, cannot continue" error disappear, confirming that not only the original problem is gone, but also that the additional diagnostic added after the analysis of the problem are also not triggering errors.

@edoaltamura @MatthieuSchaller when you have a chance please double-check on your end that the fix is working as intended. After confirmation I'll merge this to the master branch.

pelahi · 2020-12-22T08:44:09Z

My apologies for introducing this bug!

MatthieuSchaller · 2021-02-03T08:45:45Z

Is that still an issue or are we confident it's now all working? I can't see the branch in the PR list or in the merged list.

rtobar · 2021-02-03T10:24:28Z

I haven't created a PR yet for these changes, as I was waiting for confirmation on your end (@edoaltamura @MatthieuSchaller) that this wasn't an issue anymore. As per my previous comment, the fix is in the issue-37 branch, and my local tests at least showed that the original problem is gone.

edoaltamura · 2021-02-16T18:38:18Z

Hi @rtobar, there have been some more commits since last time I checked... I'm going to check the issue-37 branch later this week using both dmo+hydro zooms and give you an update. Thanks for your patience!

rtobar · 2021-02-16T23:20:13Z

Thanks @edoaltamura, that would be great :). If your tests come back with positive results then I'll merge the branch back to the master.

edoaltamura · 2021-02-18T22:03:53Z

I'm still detecting some errors when compiling without MPI (I am compiling the up-to-date source in the issue-37 branch).

The hydro version on zooms exits with this message:

[ 199.534] [trace] bgfield.cxx:54 Grid system using leaf nodes with maximum size of 176
[ 199.534] [trace] bgfield.cxx:60 Building physical Tree using minimum shannon entropy as splitting criterion
terminate called after throwing an instance of 'std::runtime_error'
[ 199.534] [trace] search.cxx:1781 Number of cores: 2
[ 199.534] [trace] bgfield.cxx:140 Done
  what():  Particle density not positive, cannot continue[ 199.534] [trace] bgfield.cxx:155 Calculating Grid Mean Velocity
[[ 199.534] [trace] bgfield.cxx:173 199. 534   Done
[[ 199.534] [trace] bgfield.cxx:173 199. 534   Done
[ 199.534] [trace] bgfield.cxx:184 Calculating Grid Velocity Dispersion
[ 199.534] [trace[] bgfield.cxx 199.[534] [trace] :search.cxx:199901965.  Filling KD-Tree Grid
Aborted

which seems related to issue #53. From monitoring the stdout, the OpenMP part seems to launch correctly, but not sure what happens to it gets to the parallel FOF part, since it breaks on negative densities.

A similar setup for dark-matter-only instead returns

[   0.437] [ info] hdfio.cxx:941 Expecting:
[   0.437] [ info] hdfio.cxx:943   Header
[   0.437] [ info] hdfio.cxx:943   PartType0
[   0.437] [ info] hdfio.cxx:943   PartType1
[   0.437] [ info] hdfio.cxx:943   PartType2
[   0.437] [ info] hdfio.cxx:943   PartType3
[   0.437] [ info] hdfio.cxx:943   PartType4
[   0.437] [ info] hdfio.cxx:943   PartType5
[   0.438] [ info] hdfio.cxx:1109 File contains 2312540 particles and is at time 1.000
[   0.438] [ info] hdfio.cxx:1110 Particle system contains 1922386 particles and is at time 1.000 in a box of size 300.000
[   0.438] [ info] hdfio.cxx:1133 Reading file 0
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5G.c line 463 in H5Gopen2(): unable to open group
    major: Symbol table
    minor: Can't open object
  #001: H5Gint.c line 281 in H5G__open_name(): group not found
    major: Symbol table
    minor: Object not found
  #002: H5Gloc.c line 422 in H5G_loc_find(): can't find object
    major: Symbol table
    minor: Object not found
  #003: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #004: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #005: H5Gloc.c line 378 in H5G__loc_find_cb(): object 'PartType0' doesn't exist
    major: Symbol table
    minor: Object not found
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 288 in H5Dopen2(): not a location
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Gloc.c line 246 in H5G_loc(): invalid object ID
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
  #000: H5D.c line 378 in H5Dget_space(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
[   2.505] [ info] io.cxx:127 Done loading input data

which could be a mismatch in the i/o.

Please let me know if there are other useful tests I could do to help investigate these. Has the issue-37 branch been rebased with the latest improvements to master?

rtobar · 2021-02-19T02:32:43Z

@edoaltamura thanks a lot for the further testing. I just merged the latest master branch into issue-37 to make sure the latter contains all the latest fixes and improvements, although I wouldn't expect it would change the outcome of your experiments.

Can I ask you to point me to the input configuration and files you were using for those two last experiments? I just tried the code against the original input files used to report this issue (i.e., snapshots/L0300N0564_VR93_0199 and config/vr_config_zoom_dmo.cfg under /cosma/home/dp004/dc-alta2/data7/xl-zooms/dmo/L0300N0564_VR93) and that ran to completion without errors. If you could also share the compilation options you used for these experiments would be great, I'm using what seems to be representative of the original problem (-DVR_USE_HYDRO=OFF -DVR_USE_SWIFT_INTERFACE=OFF -DCMAKE_BUILD_TYPE=Debug -DVR_ZOOM_SIM=ON -DVR_MPI=OFF), which is also what John described in #37 (comment).

The hydro version on zooms exits with this message [...]

This issue was originally reported DMO runs, while you are reporting a failure for a hydro run. This might or might not be connected to the same underlying issue, but to keep problem solving focused I'd suggest we stick with DMO runs only. #53 is indeed a problem with hydro runs, although without zoom, so again we can't really tell.

A similar setup for dark-matter-only instead returns [...] which could be a mismatch in the i/o.

Yes, looks like that, the logs says that object 'PartType0' doesn't exist but this is required. Again, let's try to untangle this bit by bit. If you think this is a genuine problem please let's open a separate issue to keep track of it.

edoaltamura · 2021-02-19T13:03:39Z

I have run one more test replicating your cmake configuration and using the same input snapshot and cfg file and I am getting a segmentation fault on substructure search.

Before segfaulting, the stdout printed

[  75.173] [ info] search.cxx:155 Finished OpenMP local FOF search of 8 containing total of 878268 groups in 49.452 [s]

which may suggest that the FOF search on OpenMP is completing without errors, or that errors from that sections only get caught on substructure search. This is with the new master updates merged into the issue-37 branch.

I have tried 2 combinations of modules on cosma7:

module load cmake
module load intel_comp/2018
module load intel_mpi/2018
module load parmetis
module load parallel_hdf5
module load fftw
module load gsl

and

module load cmake/3.18.1
module load intel_comp/2020-update2
module load intel_mpi/2020-update2
module load ucx/1.8.1
module load parmetis/4.0.3-64bit
module load parallel_hdf5/1.10.6
module load fftw/3.3.8cosma7
module load gsl/2.5

but they both gave the same result.

rtobar · 2021-02-24T07:01:46Z

@edoaltamura I'm really puzzled by the fact that you got a segfault and I didn't for the same code and inputs. I would really like to clarify this, otherwise we are shooting in the dark.

I went ahead and built the latest issue-37 branch from scratch with the given cmake configuration under Debug mode, then ran it against the original input/config files. I then rebuilt in Release mode and ran it again. I repeated these steps for the two sets of modules you outlined above. However in all these tries I couldn't replicate the segfault you reported.

I captured all the commands and outputs in the following files: edo-modules1.txt edo-modules2.txt (so you will find two compilations and executions on each, for Debug and Release mode).

Could you please have a look at these logs and see what is different between our runs? I did notice that in your segfault report the log says there are 878268 groups, while in all my runs I saw 1054896 groups reported. This makes me think you are using a different set of inputs or configuration. If that's the case then we can open a new issue to track that down, with the understanding that the original issue reported here is gone.

edoaltamura · 2021-02-24T08:59:36Z

Hi @rtobar, I've tried your configuration and I can reproduce your outputs - can confirm that the original issue is solved. I'm not sure why that didn't work with my pevious inputs - I will double check later. Thanks for your help and patience!

rtobar · 2021-02-24T09:02:05Z

Thanks @edoaltamura, that's great news :). And thanks for your patience as well!

I will then merge issue-37 back to the master and declare this issue as fixed. But by all means, if you get new crashes or error reports, please open new issues; my main concern was not to mix different problems and instead treat them separately.

rtobar · 2021-02-24T09:19:33Z

Fix merged to the master branch, closing this issue now.

edoaltamura changed the title ~~DMO break on substructure search~~ DMO breaks on substructure search Oct 20, 2020

rtobar mentioned this issue Dec 4, 2020

Negative densities / other crashes with latest master #53

Open

rtobar added the bug Something isn't working label Feb 16, 2021

rtobar added the workaround available label Feb 16, 2021

rtobar closed this as completed Feb 24, 2021

DMO breaks on substructure search #37

DMO breaks on substructure search #37

Comments

edoaltamura commented Oct 20, 2020

rtobar commented Oct 20, 2020

edoaltamura commented Oct 20, 2020

rtobar commented Oct 21, 2020

jchelly commented Oct 21, 2020

edoaltamura commented Oct 21, 2020

jchelly commented Oct 22, 2020

jchelly commented Oct 22, 2020 • edited Loading

rtobar commented Oct 22, 2020

edoaltamura commented Oct 22, 2020

jchelly commented Oct 22, 2020

jchelly commented Oct 22, 2020

rtobar commented Oct 23, 2020

rtobar commented Oct 23, 2020

edoaltamura commented Nov 11, 2020

MatthieuSchaller commented Nov 12, 2020

rtobar commented Nov 12, 2020 • edited Loading

rtobar commented Nov 16, 2020 • edited Loading

rtobar commented Nov 16, 2020

edoaltamura commented Nov 25, 2020

rtobar commented Nov 26, 2020

rtobar commented Dec 6, 2020 • edited Loading

MatthieuSchaller commented Dec 8, 2020

rtobar commented Dec 22, 2020

pelahi commented Dec 22, 2020

MatthieuSchaller commented Feb 3, 2021

rtobar commented Feb 3, 2021

edoaltamura commented Feb 16, 2021

rtobar commented Feb 16, 2021

edoaltamura commented Feb 18, 2021

rtobar commented Feb 19, 2021

edoaltamura commented Feb 19, 2021

rtobar commented Feb 24, 2021

edoaltamura commented Feb 24, 2021

rtobar commented Feb 24, 2021

rtobar commented Feb 24, 2021

jchelly commented Oct 22, 2020 •

edited

Loading

rtobar commented Nov 12, 2020 •

edited

Loading

rtobar commented Nov 16, 2020 •

edited

Loading

rtobar commented Dec 6, 2020 •

edited

Loading