Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM: Switch to edm::stream for DQMEDAnalyzer #28813

Merged
merged 5 commits into from
Mar 3, 2020

Conversation

schneiml
Copy link
Contributor

@schneiml schneiml commented Jan 28, 2020

PR description:

This PR cashes in on the work in #28622 and all the previous PRs, by changing the module type of DQMEDAnalyzer back to edm::stream. This should significantly reduce the number of modules blocking concurrent lumisections on production sequences.

This PR depends on #28622 and includes it, it should be merged once the other one is in. For now, this is to allow some validation of this change. [Edit: #28622 is in! Rebased on top of it. Validation will still take a while.] [Edit2: #28916 brought in some of the changes that used to be here/are required here, that makes this one a bit cleaner.]

DQMEDAnalyzer used to be edm::stream based from 2015 to 2018, so basically not much should break. However, over the last two years DQM has changed, and some modules grew dependencies on edm::one behaviour, while others where migrated from legacy edm::EDAnalyzer to DQMEDAnalyzer as this became possible. This PR includes a lot of changes to these modules to remove their usage of beginJob/endJob methods: While we could provide it (and call it e.g. from beginStream/endStream), it does not make much sense, and most of the modules don't do anything important there anyways. So, I rather banned and removed those methods.

In a few cases, there was non-trivial logic that I'd rather keep, those where migrated to DQMOneEDAnalyzer which still provides those methods. However, it still does not make much sense to do anything in endJob in a DQMEDAnalyzer, since by the time it is called, the DQM output is typically already written to file.

A large number of DQMOneEDAnalyzers remain, so we won't get concurrent LS immediately. These modules need to be either reviewed and rewritten in a concurrent-lumi save way (e.g. by using LuminosityBlockCaches), or removed from the production sequences.

Back in 2018, we moved to edm::one to reduce memory usage. This change gets us the best of both worlds: concurrent processing while the memory usage stays within O(#concurrentLS), compared to the O(#streams) in 2015, thanks to the new DQMStore logic added in #28622.

PR validation:

Not much so far, and it is going to be hard. The PR/IB tests barely use multi-threaded execution, leave alone concurrent LS. The big risk are race conditions, which are notoriously hard to spot, even when the right code paths are executed. A clean, bit-by-bit comparison on a larger sample might be a good idea.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28813/13520

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

Alignment/LaserDQM
Alignment/OfflineValidation
CalibTracker/SiStripChannelGain
Calibration/EcalCalibAlgos
Calibration/HcalCalibAlgos
Calibration/LumiAlCaRecoProducers
Calibration/TkAlCaRecoProducers
Configuration/DataProcessing
DQM/BeamMonitor
DQM/CastorMonitor
DQM/DTMonitorClient
DQM/DTMonitorModule
DQM/EcalCommon
DQM/EcalMonitorClient
DQM/EcalMonitorDbModule
DQM/EcalPreshowerMonitorModule
DQM/HcalCommon
DQM/HcalTasks
DQM/L1TMonitor
DQM/L1TMonitorClient
DQM/RPCMonitorDigi
DQM/SiPixelMonitorDigi
DQM/SiPixelMonitorRawData
DQM/SiStripCommissioningClients
DQM/SiStripCommon
DQM/SiStripMonitorClient
DQM/SiStripMonitorHardware
DQM/SiStripMonitorPedestals
DQM/SiStripMonitorSummary
DQM/TrackingMonitor
DQM/TrackingMonitorClient
DQM/TrigXMonitorClient
DQMOffline/Alignment
DQMOffline/Hcal
DQMOffline/JetMET
DQMOffline/L1Trigger
DQMOffline/Trigger
DQMServices/Components
DQMServices/Core
DQMServices/Demo
DQMServices/FileIO
DQMServices/FwkIO
DQMServices/StreamerIO
DataFormats/Histograms
FastSimulation/MaterialEffects
HLTrigger/Timer
RecoLocalCalo/EcalRecProducers
Validation/EcalDigis
Validation/EcalHits
Validation/GlobalHits
Validation/HcalDigis
Validation/HcalHits
Validation/Mixing
Validation/RecoMuon
Validation/RecoTau
Validation/RecoTrack
Validation/TrackerDigis
Validation/TrackerHits
Validation/TrackerRecHits
Validation/TrackingMCTruth

@andrius-k, @lveldere, @slava77, @schneiml, @kpedro88, @Martin-Grunewald, @rekovic, @fioriNTU, @tlampen, @pohsun, @perrotta, @civanch, @makortel, @cmsbuild, @fwyzard, @davidlange6, @smuzaffar, @Dr15Jones, @ssekmen, @mdhildreth, @jfernan2, @tocheng, @sbein, @fabiocos, @benkrikler, @kmaeshima, @christopheralanwest, @silviodonato, @franzoni can you please review it and eventually sign? Thanks.
@erikbutz, @rappoccio, @echabert, @felicepantaleo, @jandrea, @robervalwalsh, @argiro, @Martin-Grunewald, @mschrode, @fioriNTU, @tlampen, @ahinzmann, @threus, @mmusich, @seemasharmafnal, @venturia, @acimmino, @mmarionncern, @kreczko, @hdelanno, @apsallid, @makortel, @jhgoh, @jdolen, @HuguesBrun, @cericeci, @pieterdavid, @rociovilar, @abbiendi, @barvic, @DryRun, @GiacomoSguazzoni, @rovere, @VinInn, @jdamgov, @missirol, @nhanvtran, @gkasieczka, @tocheng, @schoef, @alesaggio, @idebruyn, @ebrondol, @mtosi, @dgulhan, @clelange, @adewit, @trocino, @battibass, @gbenelli, @Fedespring, @calderona, @hatakeyamak, @wmtford, @matt-komm, @mariadalfonso, @folguera this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@schneiml
Copy link
Contributor Author

please test

let's see what happens. So local tests looked ok'ish.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 28, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4391/console Started: 2020/01/28 19:23

@slava77
Copy link
Contributor

slava77 commented Jan 28, 2020

This PR depends on #28622 and includes it, it should be merged once the other one is in. For now, this is to allow some validation of this change.

this is rather confusing, esp considering a bunch of comments apparently pending in the other PR

@cmsbuild
Copy link
Contributor

-1

Tested at: cd7f1fe

CMSSW: CMSSW_11_1_X_2020-01-28-1100
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c287eb/4391/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests AddOn

  • Unit Tests:

I found errors in the following unit tests:

---> test TestDQMServicesDemo had ERRORS

  • AddOn:

I found errors in the following addon tests:

cmsDriver.py TTbar_8TeV_TuneCUETP8M1_cfi --conditions auto:run1_mc --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot Realistic8TeVCollision : FAILED - time: date Tue Jan 28 21:33:51 2020-date Tue Jan 28 21:30:31 2020 s - exit: 34304
cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi --conditions auto:run2_mc_l1stage1 --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot NominalCollision2015 --era Run2_25ns : FAILED - time: date Tue Jan 28 21:34:59 2020-date Tue Jan 28 21:30:54 2020 s - exit: 34304
cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi --conditions auto:run2_mc --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot NominalCollision2015 --era Run2_2016 : FAILED - time: date Tue Jan 28 21:35:05 2020-date Tue Jan 28 21:31:04 2020 s - exit: 34304

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c287eb/4391/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 29 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2697090
  • DQMHistoTests: Total failures: 475
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2696269
  • DQMHistoTests: Total skipped: 346
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -32.471 KiB( 33 files compared)
  • DQMHistoSizes: changed ( 1000.0 ): -0.812 KiB HLT/EGM
  • DQMHistoSizes: changed ( 1000.0,... ): -0.340 KiB SiStrip/EventInfo
  • DQMHistoSizes: changed ( 1000.0,... ): -0.263 KiB DT/EventInfo
  • DQMHistoSizes: changed ( 1001.0 ): -0.648 KiB AlCaReco/EventInfo
  • DQMHistoSizes: changed ( 11634.0,... ): -4.422 KiB HLT/Muon
  • DQMHistoSizes: changed ( 11634.0,... ): -1.359 KiB HLT/EGM
  • DQMHistoSizes: changed ( 136.731 ): -1.250 KiB HLT/EGM
  • DQMHistoSizes: changed ( 136.788 ): -0.719 KiB HLT/EGM
  • DQMHistoSizes: changed ( 136.85 ): -1.219 KiB HLT/EGM
  • DQMHistoSizes: changed ( 140.56,... ): -0.172 KiB HLT/EGM
  • DQMHistoSizes: changed ( 4.53 ): ...
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@Dr15Jones
Copy link
Contributor

@schneiml is your conclusion then that underlying data products for CSC and JetMET are not reproducible when using threads or do you think there is a long standing problem with how the DQM for those areas is being done? When doing your test, have you tried writing all the data products out and then running DQM only on the data in a file?

@schneiml
Copy link
Contributor Author

schneiml commented Feb 27, 2020

@Dr15Jones

is your conclusion then that underlying data products for CSC and JetMET are not reproducible when using threads or do you think there is a long standing problem with how the DQM for those areas is being done?

Not sure what to say here, apart from, I am not too surprised given that some DQM output tended to be hard to reproduce in the past. I have no idea why exactly that happens; the DQM code now is a one module, so it might indeed be in the products, or it might be a matter of event order (I assume EDM never guaranteed that events are processed in a fixed order, right?)

As far as I know, we very rarely do comparisons with varying number of threads, and don't really keep an eye on such problems.

When doing your test, have you tried writing all the data products out and then running DQM only on the data in a file?

No, and I'd consider that out of scope for this PR, but it might be interesting. I have no idea how to do it, though.

Edit: In case somebody wants to try themselves:

mkdir seq/ ; cd seq/
runTheMatrix.py -l 136.8862 --command '-n 1000 --nStreams 1' -t 1
cd ..
mkdir par/ ; cd par/
runTheMatrix.py -l 136.8862 --command '-n 1000 --nStreams 10' -t 10
cd ..
compareHistograms.py -b seq/136.8862_*/DQM_V0001_R000320822__Global__CMSSW_X_Y_Z__RECO.root  -p par/136.8862_*/DQM_V0001_R000320822__Global__CMSSW_X_Y_Z__RECO.root

should reproduce the problem.

@Dr15Jones
Copy link
Contributor

@slava77 @perrotta FYI the inconsistencies in results seen when running multiple threads may be something the reconstruction group may want to follow up on.

@schneiml
Copy link
Contributor Author

+1

Seems to work as expected, though more extensive validation to make sure all subsystem code works correctly will be required. This will be performed by a special set of RelVals.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 28, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4927/console Started: 2020/02/28 10:08

@cmsbuild
Copy link
Contributor

+1
Tested at: ad175ed
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c287eb/4927/summary.html
CMSSW: CMSSW_11_1_X_2020-02-27-2300
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c287eb/4927/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 15 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2679706
  • DQMHistoTests: Total failures: 44
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2679343
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@silviodonato
Copy link
Contributor

@silviodonato
Copy link
Contributor

silviodonato commented Mar 2, 2020

@schneiml just to summarize and for the benefit of the reviewers. With the current configuration, we get significant differences between multi-threading and single-threding only in CSC and JetMET. However, these differences are visible also in CMSSW_11_1_X_2020-03-02-1100 and then they are not related to this PR.
There were also other differences between multi/single threaded DQM, but they have been "fixed" by using DQMOneEDAnalyzer instead of DQMEDAnalyzer (ie. the version based on edm::one::EDProducer instead of edm::stream::EDProducer ).
The plan is to merge this PR as soon as possible in order to have it in CMSSW_11_0_pre4 and eventually compare the RelVal DQM plots produced by using single and multi -threading.

@christopheralanwest
Copy link
Contributor

+1

@silviodonato
Copy link
Contributor

DQMHistoTests: Total failures: 44

This is related to the issue #29076

@silviodonato
Copy link
Contributor

merge

fastsim: @civanch @lveldere @mdhildreth @sbein @ssekmen
l1: @benkrikler @rekovic
please have a look and let me know if you have comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants