Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noisy ECAL DQM in gpu validation workflows #42720

Closed
mmusich opened this issue Sep 5, 2023 · 24 comments
Closed

Noisy ECAL DQM in gpu validation workflows #42720

mmusich opened this issue Sep 5, 2023 · 24 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Sep 5, 2023

When adding a new workflow for patatrack validation on 2023 data (see PR #42674) the ECAL DQM for the gpu task is very noisy, emitting several times per event this sort of warnings:

Begin processing the 1st record. Run 366727, Event 132255498, LumiSection 89 on stream 0 at 30-Aug-2023 11:19:08.112 CEST
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
EcalRawDataCollection does not exist. No event-type filtering will be applied
%MSG
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
Ecal Monitor Source::runOnCollection: EcalRawData does not exist
%MSG
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
Ecal Monitor Source::runOnCollection: TrigPrimEmulDigi does not exist
%MSG
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
Ecal Monitor Source::runOnCollection: EBGpuRecHit does not exist
%MSG
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
Ecal Monitor Source::runOnCollection: EEGpuRecHit does not exist
%MSG
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
Ecal Monitor Source::runOnCollection: EBReducedRecHit does not exist
%MSG
%MSG-w EcalDQM:  EcalDQMonitorTask:ecalMonitorTaskEcalOnly  30-Aug-2023 11:19:10 CEST Run: 366727 Event: 132255498
Ecal Monitor Source::runOnCollection: EEReducedRecHit does not exist
%MSG

These appear also in the DQM/Integration unit tests for the ECAL GPU client (introduced in PR #42542 ) which is run in IBS:

see e.g. a log for CMSSW_13_3_X_2023-09-04-1100

Could core DQM / ECAL DQM experts have a look?

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2023

assign dqm, ecal-dpg

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2023

A new Issue was created by @mmusich Marco Musich.

@Dr15Jones, @rappoccio, @smuzaffar, @makortel, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2023

New categories assigned: dqm,ecal-dpg

@tjavaid,@micsucmed,@nothingface0,@wang0jin,@rvenditti,@emanueleusai,@syuvivida,@thomreis,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@alejands
Copy link
Contributor

alejands commented Sep 5, 2023

@mmusich Do we expect the ECAL GPU validation plots to be filled for these validation tests? If not, we can we suppress the number of warnings to avoid spam

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2023

@alejands

Do we expect the ECAL GPU validation plots to be filled for these validation tests?

Yes, we do. That's the whole purpose of the validation workflow.

@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2023

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@abhih1
Copy link
Contributor

abhih1 commented Sep 5, 2023

@alejands

Do we expect the ECAL GPU validation plots to be filled for these validation tests?

Yes, we do. That's the whole purpose of the validation workflow.

@mmusich I think what we wanted to clarify was whether the GPU validation plots are indeed getting filled for these tests despite these warnings.
We suspect these warnings are coming from the difference in the inputsource for the unit tests, a change recently made through this commit: 56c6da4#diff-7e9b17ce9f9a19ae82d8a705a90d5fab2e021a4f0e18f0a84d7a9875621c4224L14-R15

If the plots are getting filled, then we could suppress the number of warnings to reduce the noise.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2023

We suspect these warnings are coming from the difference in the inputsource for the unit tests, a change recently made through this commit:

That test uses DQM streamer files as input, which is exactly what the online DQM sees at P5. So the input collections should be adapted in order to compute with that. Also what about the relval validation workflow that uses regular repacked RAW data introduced in #42674?
That cannot depend on commit 56c6da4.

was whether the GPU validation plots are indeed getting filled for these tests despite these warnings.

I can't really comment on that. I would invite you to run one of these tests locally and check. Hope this helps.

@alejands
Copy link
Contributor

alejands commented Sep 5, 2023

@mmusich I was able to fix a bug causing the warnings for the ECAL GPU validation task and verified it using the tests introduced in #42542, but I was unable to reproduce the error in #42674 on the lxplus-gpu node.

I keep getting errors like this when trying to use CMSSW on the lxplus-gpu node:

/cvmfs/cms.cern.ch/slc7_amd64_gcc11/external/git/2.38.1-ee97e960a104e95b9f0c52a98b85ce43/libexec/git-core/git-remote-https: error while loading shared libraries: libssl.so.10: cannot open shared object file: No such file or directory

Some of the warnings for #42674 don't seem to be related to the GPU validation module, so I would like to reproduce the error myself in order to find the culprit.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 6, 2023

I keep getting errors like this when trying to use CMSSW on the lxplus-gpu node:

When developing for #42674 I was using an lxplus8-gpu node, though I don't immediately see why the arch should matter. Maybe you can try to test there.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 6, 2023

I just noticed that some (all ?) lxplus-gpu nodes are running Red Hat Enterprise Linux release 9.2 -- maybe that's why the SLC7 version of git is having troubles.

@alejands
Copy link
Contributor

alejands commented Sep 6, 2023

When developing for #42674 I was using an lxplus8-gpu node, though I don't immediately see why the arch should matter. Maybe you can try to test there.

I was able to run wf 141.008583 on the lxplus-gpu and debug, thanks!

@alejands
Copy link
Contributor

alejands commented Sep 6, 2023

For wf 141.008583, there were two somewhat separate but related issues related to which collections were trying to be used. Some collections are modified for online GPU validation, but not for offline.

  1. The first issue has to do with looking for ECAL GPU rec hits. As far as I know, these are not yet implemented, so naturally it won't find any GPU rec hit collections. A placeholder input tag is in place, but I believe the GPU tags in particular don't actually point to anything. Perhaps @thomreis can comment on this.

EBCpuRecHit = cms.untracked.InputTag("ecalRecHit@cpu", "EcalRecHitsEB"),
EECpuRecHit = cms.untracked.InputTag("ecalRecHit@cpu", "EcalRecHitsEE"),
EBGpuRecHit = cms.untracked.InputTag("ecalRecHit@cuda", "EcalRecHitsEB"),
EEGpuRecHit = cms.untracked.InputTag("ecalRecHit@cuda", "EcalRecHitsEE")

Rec hit GPU validation was turned on in the GPU module by default, but it is manually turned off in the Online ECAL DQM GPU client ecalgpu_dqm_sourceclient-live_cfg.py. The online client is also set to only run the GPU validation module and no other ECAL DQM modules.

process.ecalGpuTask.params.runGpuTask = True
process.ecalGpuTask.params.enableRecHit = False
process.ecalMonitorTask.workers = ['GpuTask']

  1. If we only care to run the GPU validation module here as well, then that should also solve most of the second issue. In online GPU validation, there are modified input tags for some collections, but again there is no such modification done for offline validation.

# ecalMonitorTask always looks for EcalRawData collection when running, even when not in use
# Default value is cms.untracked.InputTag("ecalDigis")
# Tag is changed below to avoid multiple warnings per event
process.ecalMonitorTask.collectionTags.EcalRawData = cms.untracked.InputTag("hltEcalDigisLegacy")
# Streams used for online GPU validation
process.ecalMonitorTask.collectionTags.EBCpuDigi = cms.untracked.InputTag("hltEcalDigisLegacy", "ebDigis")
process.ecalMonitorTask.collectionTags.EECpuDigi = cms.untracked.InputTag("hltEcalDigisLegacy", "eeDigis")
process.ecalMonitorTask.collectionTags.EBGpuDigi = cms.untracked.InputTag("hltEcalDigisFromGPU", "ebDigis")
process.ecalMonitorTask.collectionTags.EEGpuDigi = cms.untracked.InputTag("hltEcalDigisFromGPU", "eeDigis")

If we are only running the GPU validation module (except for rec hits), then the only input tag complaining is for EcalRawData.

EcalRawData = cms.untracked.InputTag("ecalDigis"),

This collection is unused by the GPU module, but this input tag is always checked to run any ECAL DQM processing for historical reasons. The ECAL digis for GPU validation use the same collection, but the input tags include the @cpu and @cuda modifiers.

EBCpuDigi = cms.untracked.InputTag("ecalDigis@cpu", "ebDigis"),
EECpuDigi = cms.untracked.InputTag("ecalDigis@cpu", "eeDigis"),
EBGpuDigi = cms.untracked.InputTag("ecalDigis@cuda", "ebDigis"),
EEGpuDigi = cms.untracked.InputTag("ecalDigis@cuda", "eeDigis"),

These tags are being used and are not complaining, so would it be okay to customize EcalRawData to ecalDigis@cpu here for GPU validation? I want to make sure there's no nuance to worry about here.

  1. (b) If we do care about running other modules, then I believe the issue also has to do with a difference in input tags. The remaining collection tags that are complaining are

TrigPrimEmulDigi = cms.untracked.InputTag("valEcalTriggerPrimitiveDigis"),

EBReducedRecHit = cms.untracked.InputTag("reducedEcalRecHitsEB"),
EEReducedRecHit = cms.untracked.InputTag("reducedEcalRecHitsEE"),

If they are being used here, are these the correct input tags? If not, then we don't need to worry about these. The warning should go away by only enabling the GPU module.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 7, 2023

hi @alejands,
I think we should be able to run the full ECAL DQM on this workflows.

Does it work for the ECAL only configurations (.511 for MC) ?

@mmusich
Copy link
Contributor Author

mmusich commented Sep 7, 2023

Does it work for the ECAL only configurations (.511 for MC) ?

as far as I can tell, similar warnings are visible in .511 workflows in IBs too: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el9_amd64_gcc11/CMSSW_13_3_X_2023-09-06-2300/pyRelValMatrixLogs/run/11634.511_TTbar_14TeV+2021_Patatrack_ECALOnlyCPU/step3_TTbar_14TeV+2021_Patatrack_ECALOnlyCPU.log#/

@alejands
Copy link
Contributor

alejands commented Sep 15, 2023

I was able to modify one of the cfg files produced by WF 141.008583 and get all the ECAL collections being produced during the RECO step.

For the ECAL Trigger Primitives, there are no emulated digi collections available, only the standard ones.

edm::SortedCollection<EcalTriggerPrimitiveDigi,edm::StrictWeakOrdering<EcalTriggerPrimitiveDigi> >    "ecalDigis"                 "EcalTriggerPrimitives"   "reRECO"

There are also no Reduced Ecal Rec Hits available. Fortunately for both of these cases, we can easily toggle these off and eliminate the warnings.

runOnEmul = cms.untracked.bool(True),

fillRecoFlagReduced = cms.untracked.bool(True)

There is the question of at what level do we want to change these flags? (eg. in the ECAL DQM cfi files in DQM/EcalMonitorTasks, at the level of the cfi and cff files in DQMOffline/Ecal, or at the WF level)


However, for the case of EcalRawData, we require an EcalRawDataCollection (wrapper for an EcalDCCHeaderBlock collection) which is not available. They are, however, available for the ECAL Preshower.

edm::SortedCollection<ESDCCHeaderBlock,edm::StrictWeakOrdering<ESDCCHeaderBlock> >    "ecalPreshowerDigis"        ""                "reRECO"

It appears that this collection is more important for several ECAL DQM modules, but as far as I know, ECAL DQM is not directly in charge of which collections are produced.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 18, 2023

There is the question of at what level do we want to change these flags? (eg. in the ECAL DQM cfi files in DQM/EcalMonitorTasks, at the level of the cfi and cff files in DQMOffline/Ecal, or at the WF level)

I guess this depends if these flags are useful in other setups (and this only ECAL DQM knows about)

It appears that this collection is more important for several ECAL DQM modules, but as far as I know, ECAL DQM is not directly in charge of which collections are produced.

maybe (but ECAL DPG at large certainly is).

@thomreis
Copy link
Contributor

I have updated the ECAL unpacker CPUDigis module to produce dummy collections for the EcalRawDataCollection and some other collections that the CPU unpacker can produce but the GPU one could not. See PR #42844.
This should fix the EcalRawData warning messages.

@alejands
Copy link
Contributor

I implemented the changes discussed earlier in #42848. Before knowing about the fix for EcalRawData in #42844, I implemented a customization option to skip collections and not consume them on the ECAL DQM end. While this is no longer needed, we decided to leave it in as it could prove useful for something else in the future, though it's not being used right now.

@alejands
Copy link
Contributor

I believe this issue is now resolved and can be closed.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 28, 2023

I believe this issue is now resolved and can be closed.

It is lacking signatures from the involved groups

@thomreis
Copy link
Contributor

thomreis commented Oct 3, 2023

+ecal-dpg

@mmusich
Copy link
Contributor Author

mmusich commented Oct 9, 2023

Solved in #42844 + #42848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants