-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQM: Switch to edm::stream
for DQMEDAnalyzer
#28813
Conversation
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28813/13520
|
A new Pull Request was created by @schneiml (Marcel Schneider) for master. It involves the following packages: Alignment/LaserDQM @andrius-k, @lveldere, @slava77, @schneiml, @kpedro88, @Martin-Grunewald, @rekovic, @fioriNTU, @tlampen, @pohsun, @perrotta, @civanch, @makortel, @cmsbuild, @fwyzard, @davidlange6, @smuzaffar, @Dr15Jones, @ssekmen, @mdhildreth, @jfernan2, @tocheng, @sbein, @fabiocos, @benkrikler, @kmaeshima, @christopheralanwest, @silviodonato, @franzoni can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test let's see what happens. So local tests looked ok'ish. |
The tests are being triggered in jenkins. |
this is rather confusing, esp considering a bunch of comments apparently pending in the other PR |
-1 Tested at: cd7f1fe CMSSW: CMSSW_11_1_X_2020-01-28-1100 I found follow errors while testing this PR Failed tests: UnitTests AddOn
I found errors in the following unit tests: ---> test TestDQMServicesDemo had ERRORS
I found errors in the following addon tests: cmsDriver.py TTbar_8TeV_TuneCUETP8M1_cfi --conditions auto:run1_mc --fast -n 100 --eventcontent AODSIM,DQM --relval 100000,1000 -s GEN,SIM,RECOBEFMIX,DIGI:pdigi_valid,L1,DIGI2RAW,L1Reco,RECO,EI,VALIDATION --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --datatier GEN-SIM-DIGI-RECO,DQMIO --beamspot Realistic8TeVCollision : FAILED - time: date Tue Jan 28 21:33:51 2020-date Tue Jan 28 21:30:31 2020 s - exit: 34304 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
@schneiml is your conclusion then that underlying data products for CSC and JetMET are not reproducible when using threads or do you think there is a long standing problem with how the DQM for those areas is being done? When doing your test, have you tried writing all the data products out and then running DQM only on the data in a file? |
Not sure what to say here, apart from, I am not too surprised given that some DQM output tended to be hard to reproduce in the past. I have no idea why exactly that happens; the DQM code now is a one module, so it might indeed be in the products, or it might be a matter of event order (I assume EDM never guaranteed that events are processed in a fixed order, right?) As far as I know, we very rarely do comparisons with varying number of threads, and don't really keep an eye on such problems.
No, and I'd consider that out of scope for this PR, but it might be interesting. I have no idea how to do it, though. Edit: In case somebody wants to try themselves:
should reproduce the problem. |
+1 Seems to work as expected, though more extensive validation to make sure all subsystem code works correctly will be required. This will be performed by a special set of RelVals. |
The tests are being triggered in jenkins. |
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
Kind reminder to |
@schneiml just to summarize and for the benefit of the reviewers. With the current configuration, we get significant differences between multi-threading and single-threding only in CSC and JetMET. However, these differences are visible also in CMSSW_11_1_X_2020-03-02-1100 and then they are not related to this PR. |
+1 |
This is related to the issue #29076 |
merge fastsim: @civanch @lveldere @mdhildreth @sbein @ssekmen |
PR description:
This PR cashes in on the work in #28622 and all the previous PRs, by changing the module type of
DQMEDAnalyzer
back toedm::stream
. This should significantly reduce the number of modules blocking concurrent lumisections on production sequences.This PR depends on #28622 and includes it, it should be merged once the other one is in. For now, this is to allow some validation of this change. [Edit: #28622 is in! Rebased on top of it. Validation will still take a while.] [Edit2: #28916 brought in some of the changes that used to be here/are required here, that makes this one a bit cleaner.]
DQMEDAnalyzer
used to beedm::stream
based from 2015 to 2018, so basically not much should break. However, over the last two years DQM has changed, and some modules grew dependencies onedm::one
behaviour, while others where migrated from legacyedm::EDAnalyzer
toDQMEDAnalyzer
as this became possible. This PR includes a lot of changes to these modules to remove their usage ofbeginJob
/endJob
methods: While we could provide it (and call it e.g. frombeginStream
/endStream
), it does not make much sense, and most of the modules don't do anything important there anyways. So, I rather banned and removed those methods.In a few cases, there was non-trivial logic that I'd rather keep, those where migrated to
DQMOneEDAnalyzer
which still provides those methods. However, it still does not make much sense to do anything inendJob
in aDQMEDAnalyzer
, since by the time it is called, the DQM output is typically already written to file.A large number of
DQMOneEDAnalyzer
s remain, so we won't get concurrent LS immediately. These modules need to be either reviewed and rewritten in a concurrent-lumi save way (e.g. by usingLuminosityBlockCache
s), or removed from the production sequences.Back in 2018, we moved to
edm::one
to reduce memory usage. This change gets us the best of both worlds: concurrent processing while the memory usage stays within O(#concurrentLS), compared to the O(#streams) in 2015, thanks to the newDQMStore
logic added in #28622.PR validation:
Not much so far, and it is going to be hard. The PR/IB tests barely use multi-threaded execution, leave alone concurrent LS. The big risk are race conditions, which are notoriously hard to spot, even when the right code paths are executed. A clean, bit-by-bit comparison on a larger sample might be a good idea.