Memory usage blowing up in large DMO runs #118

MatthieuSchaller · 2021-11-09T11:20:30Z

Describe the bug
Memory footprint blows up in (or just after) the calculation of the local fields

This is running on a 5670^3 particle DMO setup in a 3.2Gpc box at z=4.
The z=5 output works fine.

I am running using 480 MPI ranks on 120 nodes. Each node offers 1TB of RAM.
The code advertises at the start needing 16.68TB.

Crash:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

(Difficult to know where that comes from as exceptions are not caught anywhere.

Compilation:

g++7.3 with Intel-MPI 2018

[0000] [   1.406] [ info] main.cxx:46 VELOCIraptor git version: 44c4960f73393927ec76d7cfc0a66ae94f257560
[0000] [   1.406] [ info] main.cxx:47 VELOCIraptor compiled with:   LONGINT;MPIREDUCEMEM;STRUCDEN;VR_LOG_SOURCE_LOCATION;USEMPI;USEHDF;USEHDFCOMPRESSION;LONGINT;HAVE_GSL22
[0000] [   1.406] [ info] main.cxx:48 VELOCIraptor was built on: Nov  8 2021 17:07:53

Command line:

mpirun -np 480 stf -C ./vrconfig_3dfof_subhalos_SO_hydro.cfg -i flamingo_0008 -o vr_catalogue_0008 -I 2 -s 1200 -Z 160

Config:
vrconfig_3dfof_subhalos_SO_hydro.txt

If it helps, run location on cosma: /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008
There are core dumps as well.

The text was updated successfully, but these errors were encountered:

MatthieuSchaller · 2021-11-09T11:22:20Z

(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)

MatthieuSchaller · 2021-11-09T11:23:02Z

Any advice on things to tweak in the config or setup welcome. I have already tried more node, fewer ranks / node and other similar things but the memory seems to always blow up.

MatthieuSchaller · 2021-11-09T12:23:42Z

@stuartmcalpine and @bwvdnbro will be interested as well.

rtobar · 2021-11-09T15:18:24Z

(as a side note, reading in the data (4.1TB) takes 4.6 hrs, which is less than 1% of the theoretical read speed of the system, and makes things quite hard to debug)

This has been mentioned a couple of times so I decided to study a bit more what could it be. It seems the configuration value Input_chunk_size (set to 1e7 in your config) governs how much data is read each time from the input file, with float/double datasets reading 3x that many values each time (so for floats that's 1e7 * 3 * 4 = 120MB chunks). I guess you should be able to easily take this up at least by 10x? Bringing it further up would cross the 2G limit, and I'm not sure if we'll run into issues there (as you might remember from #88 we had problems writing > 2GB in parallel, I don't know what the situation is for reading...). Anyhow, worth giving it a try!

rtobar · 2021-11-09T15:32:45Z

@MatthieuSchaller I tried opening the core files but I don't have enough permissions to do so; e.g.:

$> ll /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
-rw------- 1 jlvc76 dphlss 32862208 Nov  9 05:03 /cosma8/data/dp004/jlvc76/FLAMINGO/ScienceRuns/DMO/L3200N5760/VR/catalogue_0008/core.222618
$> whoami
dc-toba1
$> groups
dp004 cosma6 clusterusers

MatthieuSchaller · 2021-11-09T16:38:25Z

Permissions fixed.

The read is not done using parallel hdf5 so I don't think the limit applies. I'll try raising that variable by 10x and see whether it helps.

rtobar · 2021-11-10T03:31:24Z

Unfortunately the core files are truncated, so I couldn't even get a stacktrace out of them. This is somewhat expected: the expected core sizes are in the order of ~200 GB, but SLURM would have SIGKILL'd the processes after waiting for a bit while they were writing their core files.

Based on the log messages this memory blowup seems to be happening roughly in the same place where our latest fix for #53 is located -- that is, when densities are being computed for structures that span across MPI ranks. To be clear: I don't think the error happens because of the fix -- and if the logs are to be trusted, the memory spike comes even before that, while particles are being exchanged between ranks in order to perform the calculation. Some of these particle exchanges are based on the MPI*NNImport* functions (and the associated KDTree::SearchBallPos functions) we looked at recently in #73 (comment). Like that comment said, some of the SearchBallPos functions return lists which might contain duplicate particle IDs, so it could be a possibility that these lists are blowing up the memory.

While other avenues of investigation make sense too, it could be worth a shot to try this one out and see if by using a different data structure we actually solve the problem (or not).

MatthieuSchaller · 2021-11-10T07:49:56Z

Thanks for looking into it.

Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.

rtobar · 2021-11-10T15:09:24Z

Thanks for looking into it.

Based on this analysis, anything I should try? Note that I am already trying to not use the SOlists for the same reason of exploding memory footprint(#112) even though we do crash earlier here.

Unfortunately I'm not sure there's much you can do without code modifications, if the problem is indeed where I think it is. When I fixed the code for #73 I thought that it would be a good idea to modify SearchPosBall and friends to use memory more wisely, but I wanted to keep the change to a minimum. If this issue is related to the same underlying problem of repeated results being accumulated during SearchPosBall then I think it's definitely worth me trying to fix more fundamental issue.

I'll see if I can come up with something you can test more or less quickly and will post any updates here.

MatthieuSchaller · 2021-11-10T15:25:51Z

Thanks.

FYI, I tried running an even more trimmed-down version where I request no over-denstiy calculation, no aperture calculation, and no radial profiles and it also ran out of memory.
Not suprising since the crash happens before any of the properties calculation has even started.

I'll see whether changing MPI_particle_total_buf_size could help or MPI_use_zcurve_mesh_decomposition just because it may give us a more lucky domain decomposition.

Would it make sense to add a maximal distance SearchPosBall could go to? Might be a bit dirty but there are clear limits we can use from the physics of the model.

Also, would it be possible to try the config you guys used on Gadi for your big run? If I am not mistaken it was not far from our run in terms of numbers of particle. Might be a baseline for us to try here.

rtobar · 2021-11-10T15:48:47Z

@doctorcbpower could you point @MatthieuSchaller to the config he's asking above? Thanks!

doctorcbpower · 2021-11-11T02:16:27Z

Hi @rtobar and @MatthieuSchaller, sorry, only just saw this. I have been using this for similar particle numbers in smaller boxes, so it should work.
vr_config.L210N5088.cfg.txt

MatthieuSchaller · 2021-11-11T13:30:52Z

Thanks! I'll give this a go. Do you remember how much memory was needed for VR to succeed and how many MPI ranks were used?

MatthieuSchaller · 2021-11-12T08:27:18Z

Good news: That last configuration worked out of the box on my run.

Time to fire up a diff.

MatthieuSchaller · 2021-11-12T08:32:39Z

One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.

rtobar · 2021-11-12T08:41:05Z

One quick note related to the i/o comment above. This config took 6hr30 to read in the data. Then 1hr30 for the rest.

It doesn't have a value defined for Input_chunk_size, and the default is 1e6 -- that's 10x lower than the value you had, so it's reading in 12 MB chunks. This seems to show that the configuration value has a noticeable effect. Did you end up trying with 1e8?

Having said that, I've never done an exhausting profiling of the reading code. Even with those reading sizes it still sounds like things could be better. If the inputs are compressed there'll be also some overhead associated to that I guess, but I don't know if that's the case.

doctorcbpower · 2021-11-12T08:53:05Z

Hi @MatthieuSchaller, sorry for delay. That's the config file I use when running VR inline - to be comfortable, I find you basically need to double the memory for the particular runs we do, may not be as severe given the box sizes you are running. Also no issue with reading in particle data, so defining Input_chunk_size isn't such an issue.

MatthieuSchaller · 2021-11-12T10:11:46Z

Here are bits of the configs that are different and plausibly related to what we see as a problem:

Parameter Name	Mine	Yours
`Cosmological_input`	N/A	`1`
`MPI_use_zcurve_mesh_decomposition`	`0`	`1`
`Particle_search_type`	`1`	`2`
`Baryon_searchflag`	`2`	`0`
`FoF_Field_search_type`	`5`	`3`
`Local_velocity_density_approximate_calculation`	`1`	`0`
`Bound_halos`	`0`	`1`
`Virial_density`	N/A	`500`
`Particle_type_for_reference_frames`	`1`	N/A
`Halo_core_phase_merge_dist`	`0.25`	N/A
`Structure_phase_merge_dist`	N/A	`0.25`
`Overdensity_output_maximum_radius_in_critical_density`	`100`	N/A
`Spherical_overdenisty_calculation_limited_to_structure_types`	`4`	N/A
`Extensive_gas_properties_output`	`1`	N/A
`Extensive_star_properties_output`	`1`	N/A
`MPI_particle_total_buf_size`	`10000000000`	`100000000`

(mine is also used for baryon runs)

I am not quite sure what all of these do.

Looking at this list, I am getting suspicious about Local_velocity_density_approximate_calculation as we seem to crash when computing this in the code. Maybe I should try my configuration but instead use the more accurate calculation.

Does any other of these parameter values look suspicious to you?

@rtobar the data is compressed and using some hdf5 filters so that will play a role. And indeed, the lower default Input_chunk_size will have played a role here. I'll try to increase it to 1e8 for the next attempt. SWIFT took 400s to write that compressed data set (using 1200 ranks writing 1 file each however, so that's a factor), which is also one reason to believe some i/o configuration choices here might help make this phase a lot faster.

This might help us understand what exactly is causing #118 (if logs are correctly flushed and don't get too scrambled). Signed-off-by: Rodrigo Tobar <[email protected]>

rtobar · 2021-11-16T10:06:40Z

@MatthieuSchaller not a solution, but the latest master now contains some more logging to find out what's going on and how much is expected to travel through MPI, which seems to be the problem.

Changing Local_velocity_density_approximate_calculation indeed makes the code take a different path, so you won't run into this exact particular issue (but it will still exist I guess, and you might run into something else?). I'm not familiar with all the configuration options, so I'm not able really to tell if any of the differences above are suspicious or not.

MatthieuSchaller · 2021-11-16T10:10:21Z

Thanks, I'll update my copy of the code to use this extra output.

My job checking whether Chris' config but with approximate velocity density calculation is still in the queue.

If the code performs better with a precise calculation then it's a double-win for us actually. :)

MatthieuSchaller · 2021-11-25T11:01:41Z

Eventually got my next test to run.

Taking Chris' config from above and applying the following changes:

Added Input_chunk_size=100000000
Removed Cosmological_input=1
Removed Virial_density=500
Changed MPI_particle_total_buf_size from 100000000 to 10000000000
Changed Local_velocity_density_approximate_calculation from 0 to 1

then the code crashed in the old way. (Note without the new memory-related outputs)

I would think it's the approximate calculation of the local velocity density which is the problem here.
Happy to use the more accurate calculation since it works and is likely better if I am to believe the parameter name.

To be extra sure, I am now trying it again but with everything set back to Chris' value apart from that local vel disp. parameter.

MatthieuSchaller · 2021-11-25T21:24:21Z

And now changing just Local_velocity_density_approximate_calculation from 0 to 1 from Chris' configuration I get the problem.

rtobar added a commit that referenced this issue Nov 16, 2021

Log memory requirements for data transfers

343f294

This might help us understand what exactly is causing #118 (if logs are correctly flushed and don't get too scrambled). Signed-off-by: Rodrigo Tobar <[email protected]>

rtobar mentioned this issue Nov 16, 2021

Improvements towards figuring out #118 #120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage blowing up in large DMO runs #118

Memory usage blowing up in large DMO runs #118

MatthieuSchaller commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

rtobar commented Nov 9, 2021

rtobar commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

rtobar commented Nov 10, 2021

MatthieuSchaller commented Nov 10, 2021

rtobar commented Nov 10, 2021

MatthieuSchaller commented Nov 10, 2021

rtobar commented Nov 10, 2021

doctorcbpower commented Nov 11, 2021

MatthieuSchaller commented Nov 11, 2021

MatthieuSchaller commented Nov 12, 2021

MatthieuSchaller commented Nov 12, 2021

rtobar commented Nov 12, 2021

doctorcbpower commented Nov 12, 2021

MatthieuSchaller commented Nov 12, 2021 •

edited

Loading

rtobar commented Nov 16, 2021

MatthieuSchaller commented Nov 16, 2021

MatthieuSchaller commented Nov 25, 2021

MatthieuSchaller commented Nov 25, 2021

Memory usage blowing up in large DMO runs #118

Memory usage blowing up in large DMO runs #118

Comments

MatthieuSchaller commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

rtobar commented Nov 9, 2021

rtobar commented Nov 9, 2021

MatthieuSchaller commented Nov 9, 2021

rtobar commented Nov 10, 2021

MatthieuSchaller commented Nov 10, 2021

rtobar commented Nov 10, 2021

MatthieuSchaller commented Nov 10, 2021

rtobar commented Nov 10, 2021

doctorcbpower commented Nov 11, 2021

MatthieuSchaller commented Nov 11, 2021

MatthieuSchaller commented Nov 12, 2021

MatthieuSchaller commented Nov 12, 2021

rtobar commented Nov 12, 2021

doctorcbpower commented Nov 12, 2021

MatthieuSchaller commented Nov 12, 2021 • edited Loading

rtobar commented Nov 16, 2021

MatthieuSchaller commented Nov 16, 2021

MatthieuSchaller commented Nov 25, 2021

MatthieuSchaller commented Nov 25, 2021

MatthieuSchaller commented Nov 12, 2021 •

edited

Loading