Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiMuonMassBiasClient does not support concurrent lumis #39180

Closed
makortel opened this issue Aug 24, 2022 · 16 comments · Fixed by #39173
Closed

DiMuonMassBiasClient does not support concurrent lumis #39180

makortel opened this issue Aug 24, 2022 · 16 comments · Fixed by #39173

Comments

@makortel
Copy link
Contributor

Many workflows in CMSSW_12_5_X_2022-08-24-1100 fail with

----- Begin Fatal Exception 24-Aug-2022 15:57:39 CEST-----------------------
An exception of category 'ModulesSynchingOnLumis' occurred while
   [0] Calling beginJob
Exception Message:
The framework is configured to use at least two streams, but the following modules
require synchronizing on LuminosityBlock boundaries:
  DiMuonMassBiasClient ALCARECOTkAlDiMuonMassBiasClient

The situation can be fixed by either
 * modifying the modules to support concurrent LuminosityBlocks (preferred), or
 * setting 'process.options.numberOfConcurrentLuminosityBlocks = 1' in the configuration file
----- End Fatal Exception -------------------------------------------------
@makortel
Copy link
Contributor Author

assign alca

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

New categories assigned: alca

@yuanchao,@francescobrivio,@malbouis,@saumyaphor4252,@tvami,@ChrisMisan you have been requested to review this Pull request/Issue and eventually sign? Thanks

@tvami
Copy link
Contributor

tvami commented Aug 24, 2022

I think it's being resolved in
#39166 (comment)

@makortel
Copy link
Contributor Author

makortel commented Aug 24, 2022

Caused by #39148 and DiMuonMassBiasClient being an DQMEDHarvester.

A quick workaround would be to amend

# list of AlCa sequences that have modules that do not support concurrent LuminosityBlocks
AlCaNoConcurrentLumis = [

but given the widespread failures I'm wondering if this would cause too many job types to not use concurrent lumis?

@makortel
Copy link
Contributor Author

makortel commented Aug 24, 2022

I think it's being resolved in #39166 (comment)

How? This condition is not visible in single-thread tests.

@mmusich
Copy link
Contributor

mmusich commented Aug 24, 2022

dcc4868 in #39173 should solve for the time being.
In the long term we'll need a better solution.

@mmusich
Copy link
Contributor

mmusich commented Aug 24, 2022

unassign alca

@mmusich
Copy link
Contributor

mmusich commented Aug 24, 2022

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Aug 24, 2022

In the long term we'll need a better solution.

What are the options here?
It needs to be a harvester in order to fetch the MEs it consumes, since DQMEDAnalyzer doesn't have endJob-like methods and the fitting needs to happen after having seen all the data of the run.
AFAIU there's no concurrent lumi safe version of a DQMEDHarvester.
Also I would not know how to move it to another harvesting path, because the DQMEDAnalyzer runs together with the alcareco producer:

pathALCARECOTkAlDiMuonAndVertex = cms.Path(seqALCARECOTkAlDiMuonAndVertex*ALCARECOTkAlDiMuonAndVertexDQM)

and I am not sure in which job of the Tier0 production would the harvesting of this fit.

@makortel
Copy link
Contributor Author

In the long term we'll need a better solution.

What are the options here?
...
AFAIU there's no concurrent lumi safe version of a DQMEDHarvester.

Maybe it would be time to seriously consider one? (although I guess it could be a major effort)

@mmusich
Copy link
Contributor

mmusich commented Aug 25, 2022

Maybe it would be time to seriously consider one? (although I guess it could be a major effort)

@cms-sw/dqm-l2 can you please comment on the feasibility of that?
Would be appropriate openining an ad hoc issue to remind about it?

@emanueleusai
Copy link
Member

I'm not sure how to put together lumi safe version of a DQMEDHarvester, but as DQM we're open to look into it and discuss with the CMSSW experts. So I support opening a dedicated issue to keep track of the task.

@makortel
Copy link
Contributor Author

Let me backtrack a bit. Back in the days the stance of the DQM group was along "harvesting modules would never be run outside of specific harvesting jobs that are single-threaded". The DQM documentation in https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/README.md#processing-environments mentions ALCARECO and ALCAHARVEST. So the first question is, why could the DiMuonMassBiasClient not be run in any harvesting job? Maybe this is the same question as @mmusich had in #39180 (comment)

Also I would not know how to move it to another harvesting path, because the DQMEDAnalyzer runs together with the alcareco producer:

pathALCARECOTkAlDiMuonAndVertex = cms.Path(seqALCARECOTkAlDiMuonAndVertex*ALCARECOTkAlDiMuonAndVertexDQM)

and I am not sure in which job of the Tier0 production would the harvesting of this fit.

Question for @cms-sw/alca-l2 perhaps?


A challenge with the DQMEDHarvester is that it declares to listen many transitions

class DQMEDHarvester
: public edm::one::EDProducer<edm::EndLuminosityBlockProducer,
edm::EndRunProducer,
edm::EndProcessBlockProducer,
edm::one::WatchLuminosityBlocks,
edm::one::WatchRuns,
// for uncontrolled DQMStore access, and that EDM does not even attempt to
// run things in parallel (which would then be blocked by the booking lock).
edm::one::SharedResources,
edm::Accumulator> {

regardless what the class inheriting from DQMEDHarvester actually implements. In this case the DiMuonMassBiasClient implements only the dqmEndJob() function (i.e. endProcessBlock transition), which means the DiMuonMassBiasClient itself would be safe towards concurrent lumis and runs! It is just that the intermediate base class prevents it for no good (apparent) reason. The best way would be to have the concrete harvester modules to declare with base class template arguments the transitions they want to interact with. At minimum a separate (e.g.) DQMEDJobHarvester base class, that listens only the endProcessBlock transition (not lumi/run), would allow DiMuonMassBiasClient to run as part of non-harvesting job without preventing concurrent lumis (or runs).

(@mmusich by the way, since DiMuonMassBiasClient does ~nothing at lumi/run transitions, the consumption of the tokens in lumi and run

consumes<DQMToken, edm::InRun>(edm::InputTag("DiMuonMassBiasMonitor", "DQMGenerationDiMuonMassBiasMonitorRun"));
consumes<DQMToken, edm::InLumi>(edm::InputTag("DiMuonMassBiasMonitor", "DQMGenerationDiMuonMassBiasMonitorLumi"));

has no effect in practice; consuming edm::InProcess could, but DQMEDAnalyzer does not produce tokens in ProcessBlock)

@mmusich
Copy link
Contributor

mmusich commented Aug 26, 2022

So the first question is, why could the DiMuonMassBiasClient not be run in any harvesting job? Maybe this is the same question as @mmusich had in #39180 (comment)

So after thinking a bit, I came up with #39217
In that PR I move the DiMuonMassBiasClient to a regular harvesting path, actually into the DQMHarvestMuon sequence which as per autoDQM is run in the @muon sequence

'muon': ['DQMOfflineMuon',
'PostDQMOffline',
'DQMHarvestMuon+DQMCertMuon'],

that would be run at Tier-0 for the Muon PD (cf also: https://github.com/dmwm/T0/blob/0f4f9d27c5366da57c456c4ae01da81c75524595/etc/ProdOfflineConfiguration.py#L698).

The problem with that PR is that since DiMuonMassBiasMonitor and DiMuonMassBiasClient don't run in the same path, the construct is a bit fragile as there is no guarantee that the right harvesting sequence is run subsequently to the job running pathALCARECOTkAlDiMuonAndVertex other than per Tier-0 configuration stipulation.
Also I could not find any RelVal workflow in the matrix that mimics Tier0 in this respect (i.e. that runs ALCA:TkAlDiMuonAndVertex in the same job as regular DQM and then a regular harvesting job). @cms-sw/alca-l2 might help here constructing one.

For these reasons, I think it would still make sense to provide something along these lines:

At minimum a separate (e.g.) DQMEDJobHarvester base class, that listens only the endProcessBlock transition (not lumi/run), would allow DiMuonMassBiasClient to run as part of non-harvesting job without preventing concurrent lumis (or runs).

in order to allow DiMuonMassBiasClient to run together with it's counterpart in pathALCARECOTkAlDiMuonAndVertex, so I'll think I'll go ahead and create an issue about it.

About:

by the way, since DiMuonMassBiasClient does ~nothing at lumi/run transitions, the consumption of the tokens in lumi and run cmssw/DQMOffline/Alignment/src/DiMuonMassBiasClient.cc
has no effect in practice; consuming edm::InProcess could, but DQMEDAnalyzer does not produce tokens in ProcessBlock)

I removed the produces and consumes statements in a5bab3b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants