Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent Lumis and DQM at HLT #28341

Open
schneiml opened this issue Nov 4, 2019 · 8 comments
Open

Concurrent Lumis and DQM at HLT #28341

schneiml opened this issue Nov 4, 2019 · 8 comments

Comments

@schneiml
Copy link
Contributor

schneiml commented Nov 4, 2019

Der HLT and Core Experts,

From recent discussions I hear that HLT is planning to move to concurrent lumisections by Run3. This opens new questions regarding DQM@HLT:

  • As I understand currently all DQM@HLT uses DQMGlobalEDAnalyzer based modules. These modules hold their histograms in a Run Cache.

  • However, the histograms produced at HLT are shipped to online DQM on a per-lumisection basis. To my knowledge, this is the only way that these histograms are saved.

  • This works thanks to the edm::Service<DQMStore>, which can pull the histograms out without help form EDM. Various thread-safety concerns appear (e.g. the edm::global DQMFileSaver [1] sounds very dangerous; it does not take any locks! That is definitely racy in harvesting -- unless running any one-module blocks global modules from running), but it is logically sound as long as lumis are processed one by one.

  • However, as soon as we have concurrent lumisections, this becomes incorrect, and irreproducible. We are filling the same histograms from different lumisections, and then take a (supposedly) per-lumi snapshot at some random point.

Now, there are many ways to handle that in the future:

  • Ignore it. The resulting histograms will be perfectly fine for monitoring, even if the lumi boundaries are a bit blurry. We will also continue to need an edm::Service for online DQM in the foreseeable future, and can also keep using that in HLT. [2]

  • Migrate the DQMGlobalEDAnalyzer modules to use a lumi cache instead of a run cache, and only save run-based histograms at the end of the run (this most likely makes them completely unusable for DQM@HLT).

  • Implement a mechanism to get Run-based histograms out of HLT. This can be combined with the other solutions.

  • Migrate the DQMGlobalEDAnalyzer to DQMEDAnalyzer once we get these back to edm::stream (this should ideally happen before CMSSW11, depending on progress with the other PRs now).

I'm looking forward to any additional opinions on this aspect; it might be fine to not do anything, but we should be aware of the implications.

[1] https://cmssdt.cern.ch/dxr/CMSSW/source/DQMServices/Components/plugins/DQMFileSaver.h?from=DQMFileSaver#15

[2] Note that for online DQM we have the same problem, twice: We have modules (e.g. specific edm::one users) that can't save per-lumi, but need their histograms shown live in online DQM, and we have modules that can save per-lumi, but online DQM traditionally can update much faster than per-lumisection. So the edm::Service or similar is unavoidable here, but also less of a problem, since online DQM can happily run single-threaded, no concurrent lumis.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2019

A new Issue was created by @schneiml Marcel Schneider.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

assign hlt, core

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 6, 2019

New categories assigned: core,hlt

@Dr15Jones,@smuzaffar,@Martin-Grunewald,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Is this issue still relevant? @cms-sw/dqm-l2

(I'd guess "yes")

@jfernan2
Copy link
Contributor

jfernan2 commented Mar 16, 2021

My understanding is that yes, it is still relevant.
IMHO, the main issue is the need to keep histograms shown live in online DQM. A single thread at Online DQM is still used, but I am not sure of the plan for HLT.
Thanks

@fwyzard
Copy link
Contributor

fwyzard commented Mar 16, 2021

I guess the two options are

  • drop support for DQM at HLT: no more DQM histograms coming from the HLT to the online DQM, etc.
  • keep support for DQM at HLT: HLT still needs to make per-Run (and per-LS ?) histograms

What I recall is that the only way to get per-Run histograms from the HLT to the DQM was to "slice" them into per-LS updates; if that is not needed/supported any more, or if there is a better way to do it, we should definitely consider the alternative.

About the actual DQM modules running at HLT: I don't know what has change in DQM-land since 2019; maybe some of the original issues have already been addressed ?

Out of my head, the requirements from the HLT side are

  • run multithreaded, with multiple concurrent streams and (potentially) concurrent lumisections
  • keep only one single copy of each histogram (unless multiple copies are needed for correctness in the case of concurrent lumisections)
  • merging/accumulating histograms from all HLT nodes should be complete (or have the possibility of accounting for missing parts)
  • older histograms should not overwrite newer histograms (e.g. if a lumisection gets delayed)

Not sure what are the best means to achieve them.

@jfernan2
Copy link
Contributor

I apologize in advance if my answer does not reach the desired levels for this conversation, but unfortunately the real expert on this matter @schneiml is out of CMS since a long time as you probably know.

I am afraid that the original problems stated in this issue concerning DQM@HLT have not been addressed. According to the description by Marcel and the PRs that came afterwards, specially #28622
where DQMGlobalEDAnalyzer based on edm::global::EDProducer was modified, it is used for DQM@HLT and a few random other things, and cannot save per-lumi histograms (this is a conflict with the fact that HLT typically saves only per lumi histograms)

What seems to have changed since this issue was raised, is the change of DQMEDAnalyzer, so that now it is based on edm::stream since #28813

And this links with the last solution proposed by Marcel at the description of this issue, which may be the best option, migrate the DQMGlobalEDAnalyzer to DQMEDAnalyzer, at least for the HLT purpouses.

But I'd like to know first the opinion of the real experts who are still alive, @makortel @Dr15Jones
Thanks

@makortel
Copy link
Contributor

I would not trust much our memory, but what I recall and see in the code, migrating DQMGlobalEDAnalyzer to DQMEDAnalyzer could be the least-effort way forward (and would fulfill the requirements to support multiple threads, streams, and concurrent lumisections, and to have minimal copies of histograms in memory).

An alternative could be to add per LS histogram support to DQMGlobalEDAnalyzer. But given the way DQMStore currently appears to assume a module instance processing all events of one LS at a time for per-LS-histograms, such development would probably be quite involved (although I still believe it would be possible to craft a design that would allow that and to e.g. avoid locks in MonitorElement, but that probably goes beyond the current dicussion).

I assume the latter two points of @fwyzard (completeness of merging/accumulating from all HLT nodes, and older histograms not being overwritten by newer histograms) are more for the histogram merging infrastructure outside of CMSSW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants