Concurrent Lumis and DQM at HLT #28341

schneiml · 2019-11-04T09:52:41Z

Der HLT and Core Experts,

From recent discussions I hear that HLT is planning to move to concurrent lumisections by Run3. This opens new questions regarding DQM@HLT:

As I understand currently all DQM@HLT uses DQMGlobalEDAnalyzer based modules. These modules hold their histograms in a Run Cache.
However, the histograms produced at HLT are shipped to online DQM on a per-lumisection basis. To my knowledge, this is the only way that these histograms are saved.
This works thanks to the edm::Service<DQMStore>, which can pull the histograms out without help form EDM. Various thread-safety concerns appear (e.g. the edm::global DQMFileSaver [1] sounds very dangerous; it does not take any locks! That is definitely racy in harvesting -- unless running any one-module blocks global modules from running), but it is logically sound as long as lumis are processed one by one.
However, as soon as we have concurrent lumisections, this becomes incorrect, and irreproducible. We are filling the same histograms from different lumisections, and then take a (supposedly) per-lumi snapshot at some random point.

Now, there are many ways to handle that in the future:

Ignore it. The resulting histograms will be perfectly fine for monitoring, even if the lumi boundaries are a bit blurry. We will also continue to need an edm::Service for online DQM in the foreseeable future, and can also keep using that in HLT. [2]
Migrate the DQMGlobalEDAnalyzer modules to use a lumi cache instead of a run cache, and only save run-based histograms at the end of the run (this most likely makes them completely unusable for DQM@HLT).
Implement a mechanism to get Run-based histograms out of HLT. This can be combined with the other solutions.
Migrate the DQMGlobalEDAnalyzer to DQMEDAnalyzer once we get these back to edm::stream (this should ideally happen before CMSSW11, depending on progress with the other PRs now).

I'm looking forward to any additional opinions on this aspect; it might be fine to not do anything, but we should be aware of the implications.

[1] https://cmssdt.cern.ch/dxr/CMSSW/source/DQMServices/Components/plugins/DQMFileSaver.h?from=DQMFileSaver#15

[2] Note that for online DQM we have the same problem, twice: We have modules (e.g. specific edm::one users) that can't save per-lumi, but need their histograms shown live in online DQM, and we have modules that can save per-lumi, but online DQM traditionally can update much faster than per-lumisection. So the edm::Service or similar is unavoidable here, but also less of a problem, since online DQM can happily run single-threaded, no concurrent lumis.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2019-11-04T09:53:08Z

A new Issue was created by @schneiml Marcel Schneider.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2019-11-06T23:58:47Z

assign hlt, core

cmsbuild · 2019-11-06T23:59:16Z

New categories assigned: core,hlt

@Dr15Jones,@smuzaffar,@Martin-Grunewald,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2021-03-16T15:33:21Z

Is this issue still relevant? @cms-sw/dqm-l2

(I'd guess "yes")

jfernan2 · 2021-03-16T16:10:51Z

My understanding is that yes, it is still relevant.
IMHO, the main issue is the need to keep histograms shown live in online DQM. A single thread at Online DQM is still used, but I am not sure of the plan for HLT.
Thanks

fwyzard · 2021-03-16T18:07:22Z

I guess the two options are

drop support for DQM at HLT: no more DQM histograms coming from the HLT to the online DQM, etc.
keep support for DQM at HLT: HLT still needs to make per-Run (and per-LS ?) histograms

What I recall is that the only way to get per-Run histograms from the HLT to the DQM was to "slice" them into per-LS updates; if that is not needed/supported any more, or if there is a better way to do it, we should definitely consider the alternative.

About the actual DQM modules running at HLT: I don't know what has change in DQM-land since 2019; maybe some of the original issues have already been addressed ?

Out of my head, the requirements from the HLT side are

run multithreaded, with multiple concurrent streams and (potentially) concurrent lumisections
keep only one single copy of each histogram (unless multiple copies are needed for correctness in the case of concurrent lumisections)
merging/accumulating histograms from all HLT nodes should be complete (or have the possibility of accounting for missing parts)
older histograms should not overwrite newer histograms (e.g. if a lumisection gets delayed)

Not sure what are the best means to achieve them.

jfernan2 · 2021-03-17T18:01:58Z

I apologize in advance if my answer does not reach the desired levels for this conversation, but unfortunately the real expert on this matter @schneiml is out of CMS since a long time as you probably know.

I am afraid that the original problems stated in this issue concerning DQM@HLT have not been addressed. According to the description by Marcel and the PRs that came afterwards, specially #28622
where DQMGlobalEDAnalyzer based on edm::global::EDProducer was modified, it is used for DQM@HLT and a few random other things, and cannot save per-lumi histograms (this is a conflict with the fact that HLT typically saves only per lumi histograms)

What seems to have changed since this issue was raised, is the change of DQMEDAnalyzer, so that now it is based on edm::stream since #28813

And this links with the last solution proposed by Marcel at the description of this issue, which may be the best option, migrate the DQMGlobalEDAnalyzer to DQMEDAnalyzer, at least for the HLT purpouses.

But I'd like to know first the opinion of the real experts who are still alive, @makortel @Dr15Jones
Thanks

makortel · 2021-03-19T15:20:47Z

I would not trust much our memory, but what I recall and see in the code, migrating DQMGlobalEDAnalyzer to DQMEDAnalyzer could be the least-effort way forward (and would fulfill the requirements to support multiple threads, streams, and concurrent lumisections, and to have minimal copies of histograms in memory).

An alternative could be to add per LS histogram support to DQMGlobalEDAnalyzer. But given the way DQMStore currently appears to assume a module instance processing all events of one LS at a time for per-LS-histograms, such development would probably be quite involved (although I still believe it would be possible to craft a design that would allow that and to e.g. avoid locks in MonitorElement, but that probably goes beyond the current dicussion).

I assume the latter two points of @fwyzard (completeness of merging/accumulating from all HLT nodes, and older histograms not being overwritten by newer histograms) are more for the histogram merging infrastructure outside of CMSSW.

cmsbuild added the pending-assignment label Nov 4, 2019

cmsbuild added core-pending hlt-pending pending-signatures and removed pending-assignment labels Nov 6, 2019

schneiml mentioned this issue Dec 13, 2019

DQM: new DQMStore. #28622

Merged

makortel mentioned this issue Feb 12, 2020

DQM: Disable assertLegacySafe when concurrent lumis are enabled. #28920

Merged

jfernan2 mentioned this issue Feb 25, 2022

fillDescriptions() for DQMEDAnalyzer modules uses the EDProducer type #37067

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent Lumis and DQM at HLT #28341

Concurrent Lumis and DQM at HLT #28341

schneiml commented Nov 4, 2019

cmsbuild commented Nov 4, 2019

Dr15Jones commented Nov 6, 2019

cmsbuild commented Nov 6, 2019

makortel commented Mar 16, 2021

jfernan2 commented Mar 16, 2021 •

edited

Loading

fwyzard commented Mar 16, 2021

jfernan2 commented Mar 17, 2021

makortel commented Mar 19, 2021

Concurrent Lumis and DQM at HLT #28341

Concurrent Lumis and DQM at HLT #28341

Comments

schneiml commented Nov 4, 2019

cmsbuild commented Nov 4, 2019

Dr15Jones commented Nov 6, 2019

cmsbuild commented Nov 6, 2019

makortel commented Mar 16, 2021

jfernan2 commented Mar 16, 2021 • edited Loading

fwyzard commented Mar 16, 2021

jfernan2 commented Mar 17, 2021

makortel commented Mar 19, 2021

jfernan2 commented Mar 16, 2021 •

edited

Loading