Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Igprof analysis of 11_1_0_p3_ROOT618 : trklet::TrackletEngineDisplaced::readTables() #30742

Closed
tommasoboccali opened this issue Jul 16, 2020 · 32 comments

Comments

@tommasoboccali
Copy link

Dear all,
an igprof analysis of L1+RECO here shows a large allocation (~ 1 GB) in

void TrackletEngineDisplaced::readTables() {

(direct link HERE)

please have a look at whether this is meaningful.

cheers

@cmsbuild
Copy link
Contributor

A new Issue was created by @tommasoboccali Tommaso Boccali.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@tommasoboccali
Copy link
Author

assign l1,upgrade

1 similar comment
@silviodonato
Copy link
Contributor

assign l1,upgrade

@cmsbuild
Copy link
Contributor

New categories assigned: upgrade,l1

@benkrikler,@rekovic,@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 16, 2020 via email

@tommasoboccali
Copy link
Author

this works for me:

CMSSW_11_1_0_p3_ROOT618 + PR 30684

generate pset with

cmsDriver.py step1 --conditions auto:phase2_realistic_T15 -n 100 --era Phase2C9 --eventcontent FEVTDEBUGHLT --runUnscheduled --filein file:/eos/cms/store/relval/CMSSW_11_0_0/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU25ns_110X_mcRun4_realistic_v3_2026D49PU200-v2/10000/01054EE2-1B51-C449-91A2-5202A60D16A3.root -s RAW2DIGI,L1TrackTrigger,L1 --datatier FEVTDEBUGHLT --customise SLHCUpgradeSimulations/Configuration/aging.customise_aging_1000,L1Trigger/Configuration/customisePhase2TTNoMC.customisePhase2TTNoMC,Configuration/DataProcessing/Utils.addMonitoring --geometry Extended2026D49 --fileout file:step1.root --customise_command "process.SimpleMemoryCheck = cms.Service('SimpleMemoryCheck',ignoreTotal = cms.untracked.int32(1))" --no_exec --nThreads 8 --python step1_L1_ProdLike.py

running on 1 event is enough

PS: I was also looking here ...indeed in most of the lines

"line" is an empry string.

a

if (line =="") continue;

might suffice

@tommasoboccali
Copy link
Author

without the continue

++++++ finished: global begin run for module: label = 'TTTracksFromExtendedTrackletEmulation' id = 23
memory usage: 1739.8 MB allocated, 871.6 MB retained, 1739.8 MB peak

with
if (line =="") continue;

before the table resize:

++++++ finished: global begin run for module: label = 'TTTracksFromExtendedTrackletEmulation' id = 23
memory usage: 673.3 MB allocated, 185.1 MB retained, 673.4 MB peak

I did not check for physics results, though

@tommasoboccali
Copy link
Author

PS: even if it generates a smaller Memory consumption, so not on the radar here, I suspect thesa can be done for

void TripletEngine::readTables() {

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 16, 2020 via email

@davidlange6
Copy link
Contributor

One idea is to clean something like this up
https://github.com/cms-sw/cmssw/compare/CMSSW_11_1_X...davidlange6:dl_tracklet_200716?expand=1

the printouts show
New: VSIZE 8157.83 0 RSS 5402.94 15.2266
vs
Old: VSIZE 10437.2 0 RSS 6860.28 2.57812

@davidlange6
Copy link
Contributor

of course the strings are really just decoded event ids of sorts

@Dr15Jones
Copy link
Contributor

@davidlange6 there is no need to call clear in the destructor. The member data is going to be deleted anyway.

@Dr15Jones
Copy link
Contributor

@davidlange6 the result returned by lower_bound is not guaranteed to be an iterator to the value. If the value is not present the returned iterator is not the same as end, it is the iterator just before where the value should be. Therefore you still need to test that the value pointed to by the iterator matches index.

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 16, 2020 via email

@davidlange6
Copy link
Contributor

@davidlange6 the result returned by lower_bound is not guaranteed to be an iterator to the value. If the value is not present the returned iterator is not the same as end, it is the iterator just before where the value should be. Therefore you still need to test that the value pointed to by the iterator matches index.

ah - then thats a bug! let me fix

@davidlange6
Copy link
Contributor

I updated the branch with Chris's comments.. (it seems there should be a better way)

@davidlange6
Copy link
Contributor

new VSIZE 7800.44 56 RSS 4879.73 27.8438
vs
old VSIZE 10437.2 0 RSS 6860.28 2.57812

(modulo new bugs introduced)

@kpedro88
Copy link
Contributor

@skinnari FYI

@davidlange6
Copy link
Contributor

I added some printouts that suggest the changes do not change the output.

Single threaded RSS savings seems to be 1GB - if anything the prototyped version is faster (5.0 seconds instead of 5.1 seconds for 20 events)

@tommasoboccali
Copy link
Author

tommasoboccali commented Jul 16, 2020 via email

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 16, 2020 via email

@Dr15Jones
Copy link
Contributor

If the table_ is now only setup in the constructor, then it could be effectively 'const'. If that is the case, then the code could be switched back to a ::stream module using a GlobalCache to hold the table which is then shared across all the stream copies.

@tommasoboccali
Copy link
Author

I did a fast test about the

"
As Tommaso and I just discussed - that should destroy the CPU efficiency in a L1 only job.
"

running only L1 with 8 threads (100 events) -- with the module in "one" configuration.

It is certainly true we do not get 100% cu eff, but it not really "destroyed":
I get during event loop

  • WALL: Total loop: 1541.18
  • CPU: Total loop: 10055.5

so eff = 10055/(8*1541) = 82%

@tommasoboccali
Copy link
Author

I did a test @ 4 threads:

WALL: Total loop: 2625
CPU: Total loop: 9694

eff = 92%.

RES max is 7400 ... so it would be still a possibility if we want to increase cpu eff

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 17, 2020 via email

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 17, 2020 via email

@aehart
Copy link
Contributor

aehart commented Jul 18, 2020

@skinnari, @tomalin, and I discussed this issue and #30744, and decided that the best solution for now was to disable the tables in the TrackletEngineDisplaced and TripletEngine. I have now implemented this in PR #30818, and it seems to have the desired effect on memory usage.

Please let me know if there's anything else I can do to help resolve these issues.

@srimanob
Copy link
Contributor

@skinnari @tomalin @aehart @davidlange6
Can this ticket be closed? I understand that currently the tables are currently disabled by default. Or you would like to keep this open together with #30744

@tomalin
Copy link
Contributor

tomalin commented Apr 11, 2021

@srimanob Yes, it can be closed. The tables are indeed disabled by default.

@srimanob
Copy link
Contributor

+Upgrade

@cecilecaillol
Copy link
Contributor

+l1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@qliphy qliphy closed this as completed Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests