Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

Open
makortel opened this issue Dec 9, 2024 · 26 comments
Open

Memory problem in ReReco-Run2024C-JetMET1 pilot #46901

makortel opened this issue Dec 9, 2024 · 26 comments

Comments

@makortel
Copy link
Contributor

makortel commented Dec 9, 2024

A pilot job of the Run2024C ReReco was killed because of using too much memory
https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2024C_JetMET1_pilot_241122_102431_7689/50660/DataProcessing/
The job was ran in CMSSW_14_0_19.

This issue is about studying the memory behavior of the job above.

This problem is reported also in https://its.cern.ch/jira/browse/CMSCOMPPR-56784 and https://gitlab.cern.ch/groups/cms-ppd/-/epics/12.

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2024

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2024

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2024

A new Issue was created by @makortel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2024

Plot from the SimpleMemoryCheck printouts
image

@Dr15Jones
Copy link
Contributor

Here is measurements using the PeriodicAllocMonitor as well a sampling the RSS during the same job

image

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2024

Some overall properties of the job

  • 2638 modules are constructed, of which 469 are destructed soon after because nothing in the job consumes their data products, leaving 2169 modules in total
  • 12 instances of PoolOutputModule + 1 instance of DQMRootOutputModule
    • AOD, MiniAOD, NanoEDMAOD and 9 skims

@davidlange6
Copy link
Contributor

perhaps its wiser to use 8 cores like the tier0 does? That would gain quite a lot of memory headroom per core

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2024

perhaps its wiser to use 8 cores like the tier0 does?

The failed job was configured to use 8 threads.

@davidlange6
Copy link
Contributor

whoops - i was fooled by some of the plots - sorry for the noise.

@makortel
Copy link
Contributor Author

makortel commented Dec 9, 2024

First observation with @Dr15Jones was a rediscovery of #46526 (comment) that was fixed in #46543 (14_2_X) / #46567 (14_1_X). The fix is being backported to 14_0_X as part of #46903.

@srimanob
Copy link
Contributor

srimanob commented Dec 10, 2024

Mem_Plot
Please ignore the x-axis at the moment, I just plot from order that RSS info is printed out. The blue plot (data.txt) was done in total 15000 events, while the green (data_fix) is still on going (~6700th event in the plot).

I just made a quick plot from printout of SimpleMemoryCheck, this PR (#46903) should help for memory reduction, but not the leak.

@Dr15Jones
Copy link
Contributor

So I applied the backport to my build area and re-ran the PeriodicAllocMonitor job. Unfortunately after nearly 6000 events processed overnight the machine I was running on was being rebooted for scheduled maintenance. So the comparison isn't complete (note I ran the job single threaded in order to avoid exceeding the VSize limit on the machine).

image

The backport does show much less outstanding allocation size than the original code. There might still be some upward trend even after the backport but it is hard to tell with the sample I was able to make.

@srimanob
Copy link
Contributor

So I applied the backport to my build area and re-ran the PeriodicAllocMonitor job. Unfortunately after nearly 6000 events processed overnight the machine I was running on was being rebooted for scheduled maintenance. So the comparison isn't complete (note I ran the job single threaded in order to avoid exceeding the VSize limit on the machine).

image

The backport does show much less outstanding allocation size than the original code. There might still be some upward trend even after the backport but it is hard to tell with the sample I was able to make.

Hi @Dr15Jones

Thx.

Which backport are you testing? The PR I made or other PRs which @makortel proposed to backport also.

@Dr15Jones
Copy link
Contributor

Which backport are you testing? The PR I made or other PRs which @makortel proposed to backport also.

The one you (@srimanob) made.

@srimanob
Copy link
Contributor

srimanob commented Dec 10, 2024

Thanks @makortel for the script to make a plot.

I try to run with backport PR, it shows that PR helps to reduce the memory.

Before fix:
memory-standard

With PR to fix the memory:
memory-fix-check

@Dr15Jones
Copy link
Contributor

I was able to uncover a ~ 1k/event memory leak here

regionalMuonShowerCollection =
new RegionalMuonShowerBxCollection(); // To avoid warning re uninitialised collection

This was found using the prototype ModuleEventAllocMonitor

@Dr15Jones
Copy link
Contributor

#46918 fixes the problem in master.

@Dr15Jones
Copy link
Contributor

@srimanob would you like #46918 back ported to CMSSW_14_0 ?

@Dr15Jones
Copy link
Contributor

So after applying the backport #46903 and memory leak fix #46918 (the latter having a much smaller effect) I see that the allocations (using the AllocMonitor system to record new/delete calls) shows much more stable behavior

image

and comparing the final results for RSS and allocations gives
image

Here I'm must processing the first file in the job and I'm reading that file locally.

@srimanob
Copy link
Contributor

Hi @Dr15Jones
Thanks very much. As we still need to converge on the release, I would propose to backport #46918

@srimanob
Copy link
Contributor

Here is another backport I made, #46942. It includes

  1. [14.1.X] use ReadPrescalesFromFile=False in the GenericTriggerEventFlag of a bunch of DQM modules #46591 (use ReadPrescalesFromFile=False in the GenericTriggerEventFlag of a bunch of DQM modules)
  2. Rework Muon candidate selection in few tracker ALCARECO  #46574 (Rework Muon candidate selection in few tracker ALCARECO)
  3. Add GEN-SIM information to resonant di-muon TkAl ALCARECO producers #45357 (Add GEN-SIM information to resonant di-muon TkAl ALCARECO producers)

@srimanob
Copy link
Contributor

@Dr15Jones
Copy link
Contributor

The relevant backport is #46950

@srimanob
Copy link
Contributor

Here are plot from my last test:

Without any fix:
memory_try7_std

With fix of #46903
memory_try7_fix1

With fix of #46903 and #46942 (currently replace by #46954)
memory_try7_fix2

I seems we can be in good shape together with #46950.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants