Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabled the TED and TRE tables by default. #30818

Closed
wants to merge 4 commits into from

Conversation

aehart
Copy link
Contributor

@aehart aehart commented Jul 18, 2020

PR description:

Addresses issues #30742 and #30744. After discussion with @skinnari and @tomalin, we decided that the tables in the TrackletEngineDisplaced and TripletEngine can be disabled by default. These tables are meant to reduce the occupancy for downstream modules while minimally impacting the ultimate tracking efficiency. This will be necessary for the eventual FPGA design, but given the problems with these tables in terms of memory usage and the effect on tracking efficiency, we think disabling them in the emulation is the best solution for now.

I also increased the maximum number of tracklets in the case of the extended algorithm. This is a temporary change that is necessary to handle the additional tracklets that will eventually be removed by the now-disabled tables.

Finally, there are a few other minor changes meant to reduce the memory usage of these tables. Several other changes are foreseen in the near future. But since the tables are disabled by default now, this is somewhat moot for the time being.

PR validation:

I ran the L1TrackNtupleMaker analyzer over 100 ttbar events with 200 PU. With the baseline, non-extended algorithm, the results are exactly identical, as expected. With the extended algorithm, I see approximately half the memory usage in top, and the memory usage does not appear to grow over time as seen before.

I also ran over a sample of displaced muons with no PU with the extended algorithm. With the tables disabled, as is now the default, we see good efficiency for tracks with d0 up to ~5 cm.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @aehart (Andrew Hart) for CMSSW_11_1_X.

It involves the following packages:

L1Trigger/TrackFindingTracklet

@cmsbuild, @rekovic, @benkrikler can you please review it and eventually sign? Thanks.
@Martin-Grunewald this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@dpiparo
Copy link
Contributor

dpiparo commented Jul 19, 2020

@aehart this code will be ran on the grid. For this reason, before running any production, we need to know

  1. the cpu efficiency of the production job after this change
  2. the RSS footprint of the L1 step as we imagine it in production (it has to be decoupled from Reco since it requires too many resoureces)
    thanks.

@davidlange6
Copy link
Contributor

I would propose that this PR removes the python config parameter that can enable the massive data structure via a "simple python change pull request" (TM).

@aehart
Copy link
Contributor Author

aehart commented Jul 20, 2020

@dpiparo I tested a single-threaded (because L1FPGATrackProducer is an edm::one::EDProducer) job running only the L1TrackTrigger step over 10 ttbar events with 200 PU:

  1. I'm not sure what the best way to measure CPU and real time is, but I used the FastTimerService to write out a summary JSON. With that I see the following efficiencies:

    • total: 213794.773163 / 217487.481959 = 98.3%
    • TTTracksFromTrackletEmulation: 17710.210555 / 17716.911111 = 100%
    • TTTracksFromExtendedTrackletEmulation: 21206.114799 / 21257.489168 = 99.8%

    So the extended algorithm is marginally less efficient than the baseline algorithm, which was not touched, but the overall efficiency of this step looks pretty good to me. The efficiency reported by time for the same job is 90%, I guess because it includes additional CMSSW overhead that the FastTimerService is able to account for. But if there's a better way to measure these times, please let me know.

  2. For the RSS footprint, I used igprof and the IgProfService to output the memory profile after 5/10 events, and the total usage at that point is 1.4 GB. This agrees with what I observe with top during running.

If you need any additional information or if you want the tests run again with different parameters, please let me know.

@davidlange6 I'm not sure I understand what you mean. These TED and TRE tables, which were responsible for the huge memory usage before, can now only be enabled by flipping a switch in Settings.h and recompiling.

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 20, 2020 via email

@silviodonato
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 20, 2020

The tests are being triggered in jenkins.

@aehart
Copy link
Contributor Author

aehart commented Jul 20, 2020

But do your FastTimerService results include the framework stalling while your module to runs? Since its the slowest thing in L1 and a one module, those “CMSSW overheads” are likely partially from L1FPGATrackProducer)

I'm honestly not sure. I admit to using it without fully understanding how it works (shame 🔔).

But looking at the results that it spits out (which you can find at /afs/cern.ch/user/a/ahart/public/1thread/resources.json on LXPLUS), I see CPU and real times for each module that runs, including PoolSource and PoolOutputModule, as well as an "other" category, which includes some kind of overhead. And then the times for the "total" are just the sums for all the modules plus the "other" category.

Is there a better way to measure the timing?

@cmsbuild
Copy link
Contributor

+1
Tested at: 0fe1975
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c7382d/8136/summary.html
CMSSW: CMSSW_11_1_X_2020-07-20-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c7382d/8136/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 36
  • DQMHistoTests: Total histograms compared: 2780792
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2780739
  • DQMHistoTests: Total skipped: 50
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 35 files compared)
  • Checked 152 log files, 16 edm output root files, 36 DQM output files

@silviodonato
Copy link
Contributor

@aehart please make a PR also for master (ie. CMSSW_11_2_X)

@aehart
Copy link
Contributor Author

aehart commented Jul 21, 2020

@silviodonato Done: #30847

@cmsbuild
Copy link
Contributor

Pull request #30818 was updated. @cmsbuild, @rekovic, @benkrikler can you please check and sign again.

@silviodonato
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 27, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+1
Tested at: 989cf34
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c7382d/8303/summary.html
CMSSW: CMSSW_11_1_X_2020-07-27-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c7382d/8303/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 36
  • DQMHistoTests: Total histograms compared: 2780792
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2780741
  • DQMHistoTests: Total skipped: 50
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 35 files compared)
  • Checked 152 log files, 16 edm output root files, 36 DQM output files

@cmsbuild
Copy link
Contributor

Pull request #30818 was updated. @cmsbuild, @rekovic, @benkrikler can you please check and sign again.

@silviodonato
Copy link
Contributor

silviodonato commented Aug 6, 2020

Since we started the re-reco and re-L1 of the HLT TDR samples, I think we can close this PR.
It was discussed at the ORP meeting to avoid to merge this backport - unless we get further memory issues - because it might affect the tracking efficiency.
Please let me know if we need to open it back.
@rekovic

@tomalin
Copy link
Contributor

tomalin commented Aug 6, 2020

I leave it to @aehart to comment further. I'm surprised by your decision to reject this PR. I understood that even with single-threaded running, these tables waste about 2 GB of memory, which must make running on the GRID challenging. And Andrew Hart showed that disabling them doesn't adversely affect the displaced L1 tracking efficiency. (For prompt L1 tracking, these tables are irrelevant).

@aehart
Copy link
Contributor Author

aehart commented Aug 6, 2020

I'm also surprised by this decision. My understanding was that this was a necessary and urgent change to enable the extended L1 tracking to run for production jobs. If the extended tracking will not be run for production jobs, then it doesn't matter either way. But I thought the goal was to have it run.

@skinnari
Copy link
Contributor

@rekovic @silviodonato we are all surprised that this was not integrated, as it seemed it was an urgent and necessary requirement for running production jobs? can someone please clarify, so that we know how to proceed? should this be opened as a PR to master instead at least?

@aehart
Copy link
Contributor Author

aehart commented Aug 19, 2020

@skinnari regarding the last question, there is an open PR to the master branch: #30847

It would still be good though to get clarification on the decision made for this PR and what we need to do to get #30847 merged @rekovic @silviodonato

@aehart aehart deleted the extended_tables_11_1_X branch October 23, 2020 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants