-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DMO breaks on substructure search #37
Comments
@edoaltamura do you happen to have a backtrace of the crash? I don't have access to cosma, and therefore cannot easily identify the problem only with the information given above. The message The only place where VELOCIraptor-STF/src/localbgcomp.cxx Line 150 in cb88fb3
This seems innocuous. The other is: VELOCIraptor-STF/src/localbgcomp.cxx Lines 232 to 238 in cb88fb3
Here there is a possibility for Now, all this theory makes sense only if the number of particles you are dealing with is greater than |
@edoaltamura the logfile was useful to double-check that the error is indeed happening most likely within Another guess though: maybe |
@edoaltamura - you can get a backtrace if you configure velociraptor with -DCMAKE_BUILD_TYPE=Debug and run it in DDT. E.g.
DDT might mistakenly think you're using MPI because velociraptor is linked against parallel hdf5, so you'll probably have to make sure the MPI box is not checked when it starts up and ignore when it complains you're running an MPI program. |
Thanks @jchelly and @rtobar. I don't have much experience with DTT, but from what I can see one of the issues is with the |
I can reproduce this. Here's the stack trace when it throws the length error exception:
This is on 16 threads and no MPI. With 1 thread it doesn't crash. The VR cmake config is
and the input file is vr_config_zoom_dmo.cfg.gz. |
At localbgcomp.cxx:230
we get the crash because nbins in rbin.resize(nbins) has a huge negative value. rmin, rmax, meannr, sdhigh, sdlow are all NaN. |
Thanks @jchelly for the backtrace, that was indeed the direction I was aiming at too with my last comment. The fact VELOCIraptor-STF/src/localbgcomp.cxx Lines 153 to 162 in cb88fb3
These potentials seem to be set just before VELOCIraptor-STF/src/localbgcomp.cxx Line 121 in cb88fb3
And within the same loop: VELOCIraptor-STF/src/localbgcomp.cxx Line 89 in cb88fb3
, which would push the problem further back. Sadly, the problem is not obvious, and without a proper debugging session it will be difficult to investigate. If I could get hold of the data, or access to cosma, I could give it a try though. Can either of those be arranged? |
Probably having access to cosma would be optimal, since it would perfectly replicate the issue. The snap file isn't very large (< 1 GB) though, I'm happy to e-mail it to you via WeTransfer or similar service. |
@rtobar - you can get a copy of the snapshot here: http://icc.dur.ac.uk/~jch/L0300N0564_VR93_0199.hdf5 There are instructions to sign up for a Cosma account at https://www.dur.ac.uk/icc/cosma/support/account/ but someone here needs to approve adding you to a project. I'll ask about that. |
@rtobar - if you apply for an account through DiRAC SAFE (see the link above) it should get approved now. You'll need to create an account on the SAFE web site and then use that to request a login account on Cosma as part of the dp004/virgo project. |
@jchelly Thanks for the instructions, I got an account created and setup, but there is still one missing step I need to overcome to be able to apply to cosma (the UK Access Management Federation system does not currently play well with the University of Western Australia's authentication system). I'll try to get that solved, though I suspect it will take a while. In the meanwhile I'll be using the files you uploaded to try to reproduce the error locally. |
With the files locally I was able now to reproduce the error and dig a bit more into what's causing it. Indeed Thread 8 "stf" hit Breakpoint 2, DetermineDenVRatioDistribution (opt=..., nbodies=345410, Part=0x7fff76060018, meanr=@0x7fff8e5b6698: 0, sdlow=@0x7fff8e5b66a0: 6.2746337021838311e-322, sdhigh=@0x7fff8e5b66a8: 108905792, sublevel=1) at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:166
166 deltar=(4.0*fabs(rmin))/(Double_t)nbins;
(gdb) p rmin
$1 = -inf
(gdb) p rmax
$2 = 50.755579976876106 This is because (gdb) p Part[0].GetPotential()
$3 = 0.82819776835306769
(gdb) p Part[1].GetPotential()
$4 = -inf The potentials are set in
So ultimately this is caused by particles having density = 0. All of the above is with 4 threads. I tried the same with 1 thread, and the density is correctly calculated (or at least has a value): Breakpoint 1, GetDenVRatio () at /home/rtobar/scm/git/VELOCIraptor-STF/src/localbgcomp.cxx:89
89 tempdenv=Part[i].GetDensity()/opt.Nsearch;
(gdb) p Part[i].GetDensity()
$4 = 7.33894316062687e-09
(gdb) p i
$5 = 1
(gdb) These densities seem to be set via VELOCIraptor-STF/src/localfield.cxx Line 918 in 7d185b0
This is where I'm concentrating now, I'll post more updates when I learn more. |
As an update, now the exact same issue also appears in the hydro version of the code (still does in dark-matter-only). Could it depend on the Intel 2020 compiler causing some unexpected behavior? |
I doubt the compiler is the issue. If anything it would be the compiler revealing incorrect behaviour in the VR code. But you can try with 2018 or gcc if you want to confirm this. |
I don't think it's the compiler either. It also doesn't seem to be a recent bug in VR; as commented on slack earlier today, I was able to reproduce this bug with the same input data using VR versions from 8 weeks and 3 months ago (compiled with |
@edoaltamura mentioned this on slack:
I my previous comment I mentioned how a 3mo-old version didn't work for me |
Update: while it's not clear what is exactly wrong, the problem seems to stem from the difference in how the initial trees are built on the parallel and serial cases. And again, while not the solution, an apparently "good enough" workaround is to force the code to build the trees serially, even when OpenMP support is enabled and there are multiple cores to use. This can be done by setting |
This is but a small set of unused variables visible when compiling the code with -Wall -Wextra. I have chosen to fix this file and functions simply to clean up the path to further debugging the problem described in #37, which stems from the assignment of zeros to particle densities, which in turn happens in GetVelocityDensityApproximative. Signed-off-by: Rodrigo Tobar <[email protected]>
I can confirm that introducing |
@edoaltamura thanks for confirming the workaround does indeed work for you too. Indeed there is still work needed to identify the real underlying cause for this, which is why this issue will remain open until the real cause is found. Regarding |
As a way to constrain this problem a bit further, I've added recently on the In addition, two things have come to my attention these last days:
With these two points in mind, my attention is now gravitating to |
Would it help to use GCC's thread sanitizer and/or address sanitizer? These are pretty good at finding mistakes in OMP sections. |
Thanks to @pelahi we have now a fix that addresses the issue. The problem was apparently introduced by 8b0cf42, but a fix for this is now present on the @edoaltamura @MatthieuSchaller when you have a chance please double-check on your end that the fix is working as intended. After confirmation I'll merge this to the |
My apologies for introducing this bug! |
Is that still an issue or are we confident it's now all working? I can't see the branch in the PR list or in the merged list. |
I haven't created a PR yet for these changes, as I was waiting for confirmation on your end (@edoaltamura @MatthieuSchaller) that this wasn't an issue anymore. As per my previous comment, the fix is in the |
Hi @rtobar, there have been some more commits since last time I checked... I'm going to check the |
Thanks @edoaltamura, that would be great :). If your tests come back with positive results then I'll merge the branch back to the |
I'm still detecting some errors when compiling without MPI (I am compiling the up-to-date source in the The hydro version on zooms exits with this message:
which seems related to issue #53. From monitoring the stdout, the OpenMP part seems to launch correctly, but not sure what happens to it gets to the parallel FOF part, since it breaks on negative densities. A similar setup for dark-matter-only instead returns
which could be a mismatch in the i/o. Please let me know if there are other useful tests I could do to help investigate these. Has the |
@edoaltamura thanks a lot for the further testing. I just merged the latest Can I ask you to point me to the input configuration and files you were using for those two last experiments? I just tried the code against the original input files used to report this issue (i.e.,
This issue was originally reported DMO runs, while you are reporting a failure for a hydro run. This might or might not be connected to the same underlying issue, but to keep problem solving focused I'd suggest we stick with DMO runs only. #53 is indeed a problem with hydro runs, although without zoom, so again we can't really tell.
Yes, looks like that, the logs says that |
I have run one more test replicating your Before segfaulting, the stdout printed
which may suggest that the FOF search on OpenMP is completing without errors, or that errors from that sections only get caught on substructure search. This is with the new master updates merged into the I have tried 2 combinations of modules on cosma7:
and
but they both gave the same result. |
@edoaltamura I'm really puzzled by the fact that you got a segfault and I didn't for the same code and inputs. I would really like to clarify this, otherwise we are shooting in the dark. I went ahead and built the latest I captured all the commands and outputs in the following files: edo-modules1.txt edo-modules2.txt (so you will find two compilations and executions on each, for Debug and Release mode). Could you please have a look at these logs and see what is different between our runs? I did notice that in your segfault report the log says there are 878268 groups, while in all my runs I saw 1054896 groups reported. This makes me think you are using a different set of inputs or configuration. If that's the case then we can open a new issue to track that down, with the understanding that the original issue reported here is gone. |
Hi @rtobar, I've tried your configuration and I can reproduce your outputs - can confirm that the original issue is solved. I'm not sure why that didn't work with my pevious inputs - I will double check later. Thanks for your help and patience! |
Thanks @edoaltamura, that's great news :). And thanks for your patience as well! I will then merge |
Fix merged to the |
Describe the bug
I am trying to run
stf
stand-alone on a DMO snapshot, using the zoom configuration. The process fails with the errorThe same behavior appears also when running
stf
on the fly with SWIFT. Note: SWIFT itself completed the run successfully and the snapshots do not deem to be corrupted in any apparent way (checked for existing datasets and datasets shapes).To Reproduce
Steps to reproduce the behavior:
/cosma/home/dp004/dc-alta2/data7/xl-zooms/dmo/L0300N0564_VR93
on Cosma 7.Expected behavior
Given the arguments parsed, expected to generate the usual output files in the
pwd
(e.g. theL0300N0564_VR93_0199.properties
file).Log files
Logs can be displayed to console, but they are also available in the
$(pwd)/stf
directory.Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: