Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage blowing up in large DMO runs #118

Open
MatthieuSchaller opened this issue Nov 9, 2021 · 22 comments
Open

Memory usage blowing up in large DMO runs #118

MatthieuSchaller opened this issue Nov 9, 2021 · 22 comments

Comments

@MatthieuSchaller
Copy link

Describe the bug
Memory footprint blows up in (or just after) the calculation of the local fields

This is running on a 5670^3 particle DMO setup in a 3.2Gpc box at z=4.
The z=5 output works fine.

I am running using 480 MPI ranks on 120 nodes. Each node offers 1TB of RAM.
The code advertises at the start needing 16.68TB.

Crash:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

(Difficult to know where that comes from as exceptions are not caught anywhere.

Compilation:

g++7.3 with Intel-MPI 2018

[0000] [   1.406] [ info] main.cxx:46 VELOCIraptor git version: 44c4960f73393927ec76d7cfc0a66ae94f257560
[0000] [   1.406] [ info] main.cxx:47 VELOCIraptor compiled with:   LONGINT;MPIREDUCEMEM;STRUCDEN;VR_LOG_SOURCE_LOCATION;USEMPI;USEHDF;USEHDFCOMPRESSION;LONGINT;HAVE_GSL22
[0000] [   1.406] [ info] main.cxx:48 VELOCIraptor was built on: Nov  8 2021 17:07:53

Command line:

mpirun -np 480 stf -C ./vrconfig_3dfof_subhalos_SO_hydro.cfg -i flamingo_0008 -o vr_catalogue_0008 -I 2 -s 1200 -Z 160

Config:
vrconfig_3dfof_subhalos_SO_hydro.txt

If it helps, run location on cosma: /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008
There are core dumps as well.

@MatthieuSchaller
Copy link
Author

(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)

@MatthieuSchaller
Copy link
Author

Any advice on things to tweak in the config or setup welcome. I have already tried more node, fewer ranks / node and other similar things but the memory seems to always blow up.

@MatthieuSchaller
Copy link
Author

@stuartmcalpine and @bwvdnbro will be interested as well.

@rtobar
Copy link

rtobar commented Nov 9, 2021

(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)

This has been mentioned a couple of times so I decided to study a bit more what could it be. It seems the configuration value Input_chunk_size (set to 1e7 in your config) governs how much data is read each time from the input file, with float/double datasets reading 3x that many values each time (so for floats that's 1e7 * 3 * 4 = 120MB chunks). I guess you should be able to easily take this up at least by 10x? Bringing it further up would cross the 2G limit, and I'm not sure if we'll run into issues there (as you might remember from #88 we had problems writing > 2GB in parallel, I don't know what the situation is for reading...). Anyhow, worth giving it a try!

@rtobar
Copy link

rtobar commented Nov 9, 2021

@MatthieuSchaller I tried opening the core files but I don't have enough permissions to do so; e.g.:

$> ll /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
-rw------- 1 jlvc76 dphlss 32862208 Nov  9 05:03 /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
$> whoami
dc-toba1
$> groups
dp004 cosma6 clusterusers

@MatthieuSchaller
Copy link
Author

Permissions fixed.

The read is not done using parallel hdf5 so I don't think the limit applies. I'll try raising that variable by 10x and see whether it helps.

@rtobar
Copy link

rtobar commented Nov 10, 2021

Unfortunately the core files are truncated, so I couldn't even get a stacktrace out of them. This is somewhat expected: the expected core sizes are in the order of ~200 GB, but SLURM would have SIGKILL'd the processes after waiting for a bit while they were writing their core files.

Based on the log messages this memory blowup seems to be happening roughly in the same place where our latest fix for #53 is located -- that is, when densities are being computed for structures that span across MPI ranks. To be clear: I don't think the error happens because of the fix -- and if the logs are to be trusted, the memory spike comes even before that, while particles are being exchanged between ranks in order to perform the calculation. Some of these particle exchanges are based on the MPI*NNImport* functions (and the associated KDTree::SearchBallPos functions) we looked at recently in #73 (comment). Like that comment said, some of the SearchBallPos functions return lists which might contain duplicate particle IDs, so it could be a possibility that these lists are blowing up the memory.

While other avenues of investigation make sense too, it could be worth a shot to try this one out and see if by using a different data structure we actually solve the problem (or not).

@MatthieuSchaller
Copy link
Author

Thanks for looking into it.

Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.

@rtobar
Copy link

rtobar commented Nov 10, 2021

Thanks for looking into it.

Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.

Unfortunately I'm not sure there's much you can do without code modifications, if the problem is indeed where I think it is. When I fixed the code for #73 I thought that it would be a good idea to modify SearchPosBall and friends to use memory more wisely, but I wanted to keep the change to a minimum. If this issue is related to the same underlying problem of repeated results being accumulated during SearchPosBall then I think it's definitely worth me trying to fix more fundamental issue.

I'll see if I can come up with something you can test more or less quickly and will post any updates here.

@MatthieuSchaller
Copy link
Author

Thanks.

FYI, I tried running an even more trimmed-down version where I request no over-denstiy calculation, no aperture calculation, and no radial profiles and it also ran out of memory.
Not suprising since the crash happens before any of the properties calculation has even started.

I'll see whether changing MPI_particle_total_buf_size could help or MPI_use_zcurve_mesh_decomposition just because it may give us a more lucky domain decomposition.

Would it make sense to add a maximal distance SearchPosBall could go to? Might be a bit dirty but there are clear limits we can use from the physics of the model.

Also, would it be possible to try the config you guys used on Gadi for your big run? If I am not mistaken it was not far from our run in terms of numbers of particle. Might be a baseline for us to try here.

@rtobar
Copy link

rtobar commented Nov 10, 2021

@doctorcbpower could you point @MatthieuSchaller to the config he's asking above? Thanks!

@doctorcbpower
Copy link

Hi @rtobar and @MatthieuSchaller, sorry, only just saw this. I have been using this for similar particle numbers in smaller boxes, so it should work.
vr_config.L210N5088.cfg.txt

@MatthieuSchaller
Copy link
Author

Thanks! I'll give this a go. Do you remember how much memory was needed for VR to succeed and how many MPI ranks were used?

@MatthieuSchaller
Copy link
Author

Good news: That last configuration worked out of the box on my run.

Time to fire up a diff.

@MatthieuSchaller
Copy link
Author

One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.

@rtobar
Copy link

rtobar commented Nov 12, 2021

One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.

It doesn't have a value defined for Input_chunk_size, and the default is 1e6 -- that's 10x lower than the value you had, so it's reading in 12 MB chunks. This seems to show that the configuration value has a noticeable effect. Did you end up trying with 1e8?

Having said that, I've never done an exhausting profiling of the reading code. Even with those reading sizes it still sounds like things could be better. If the inputs are compressed there'll be also some overhead associated to that I guess, but I don't know if that's the case.

@doctorcbpower
Copy link

Hi @MatthieuSchaller, sorry for delay. That's the config file I use when running VR inline - to be comfortable, I find you basically need to double the memory for the particular runs we do, may not be as severe given the box sizes you are running. Also no issue with reading in particle data, so defining Input_chunk_size isn't such an issue.

@MatthieuSchaller
Copy link
Author

MatthieuSchaller commented Nov 12, 2021

Here are bits of the configs that are different and plausibly related to what we see as a problem:

Parameter Name Mine Yours
Cosmological_input N/A 1
MPI_use_zcurve_mesh_decomposition 0 1
Particle_search_type 1 2
Baryon_searchflag 2 0
FoF_Field_search_type 5 3
Local_velocity_density_approximate_calculation 1 0
Bound_halos 0 1
Virial_density N/A 500
Particle_type_for_reference_frames 1 N/A
Halo_core_phase_merge_dist 0.25 N/A
Structure_phase_merge_dist N/A 0.25
Overdensity_output_maximum_radius_in_critical_density 100 N/A
Spherical_overdenisty_calculation_limited_to_structure_types 4 N/A
Extensive_gas_properties_output 1 N/A
Extensive_star_properties_output 1 N/A
MPI_particle_total_buf_size 10000000000 100000000

(mine is also used for baryon runs)

I am not quite sure what all of these do.

Looking at this list, I am getting suspicious about Local_velocity_density_approximate_calculation as we seem to crash when computing this in the code. Maybe I should try my configuration but instead use the more accurate calculation.

Does any other of these parameter values look suspicious to you?

@rtobar the data is compressed and using some hdf5 filters so that will play a role. And indeed, the lower default Input_chunk_size will have played a role here. I'll try to increase it to 1e8 for the next attempt. SWIFT took 400s to write that compressed data set (using 1200 ranks writing 1 file each however, so that's a factor), which is also one reason to believe some i/o configuration choices here might help make this phase a lot faster.

rtobar added a commit that referenced this issue Nov 16, 2021
This might help us understand what exactly is causing #118 (if logs are
correctly flushed and don't get too scrambled).

Signed-off-by: Rodrigo Tobar <[email protected]>
@rtobar
Copy link

rtobar commented Nov 16, 2021

@MatthieuSchaller not a solution, but the latest master now contains some more logging to find out what's going on and how much is expected to travel through MPI, which seems to be the problem.

Changing Local_velocity_density_approximate_calculation indeed makes the code take a different path, so you won't run into this exact particular issue (but it will still exist I guess, and you might run into something else?). I'm not familiar with all the configuration options, so I'm not able really to tell if any of the differences above are suspicious or not.

@MatthieuSchaller
Copy link
Author

Thanks, I'll update my copy of the code to use this extra output.

My job checking whether Chris' config but with approximate velocity density calculation is still in the queue.

If the code performs better with a precise calculation then it's a double-win for us actually. :)

@MatthieuSchaller
Copy link
Author

Eventually got my next test to run.

Taking Chris' config from above and applying the following changes:

  • Added Input_chunk_size=100000000
  • Removed Cosmological_input=1
  • Removed Virial_density=500
  • Changed MPI_particle_total_buf_size from 100000000 to 10000000000
  • Changed Local_velocity_density_approximate_calculation from 0 to 1

then the code crashed in the old way. (Note without the new memory-related outputs)

I would think it's the approximate calculation of the local velocity density which is the problem here.
Happy to use the more accurate calculation since it works and is likely better if I am to believe the parameter name.

To be extra sure, I am now trying it again but with everything set back to Chris' value apart from that local vel disp. parameter.

@MatthieuSchaller
Copy link
Author

And now changing just Local_velocity_density_approximate_calculation from 0 to 1 from Chris' configuration I get the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants