Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EndJob module ordering #28354

Closed
schneiml opened this issue Nov 6, 2019 · 36 comments
Closed

EndJob module ordering #28354

schneiml opened this issue Nov 6, 2019 · 36 comments

Comments

@schneiml
Copy link
Contributor

schneiml commented Nov 6, 2019

Dear framework experts,

while debugging the observed differences in #28316 I noticed a serious conceptual problem in HARVESTING.

The situation is the following:

  • MonitorElements are read from file.
  • In endRun, QualityTesters apply QTests (by default, they can also do that in other transitions)
  • In endJob, custom harvesting modules perform harvesting, producing new MEs.
  • MonitorElements are saved.

Why is that a problem? The MEs created in harvesting can't have QTests applied, but some do have them and that used to work fine.

How did this ever work? In the current/old DQMStore, QTests are also saved if they do not apply to any ME, and then applied when a matching ME gets booked (at least, that is how I understand it).

What could we do to make it work again?

  • Keep the old behaviour of having all QTests managed in the DQMStore. I dislike that for reasons that are probably obvious.
  • Run QTests again (or optionally, or in general) in endJob. But this does not help, unless we can enforce correct ordering of modules in the endJob transition. In endRun, this can be fixed by passing through products, but how to do it in endJob?
  • Run custom harvesting in endRun. But this is a significant semantic change, and will break multi-run harvesting. Not something we can do for CMSSW11, most likely.

Bonus problems:

  • The harvesting modules creating summary plots need to be able to access QTest results, so they need to run after the QualityTester. Which means we might need to enforce a harvester->tester->harvester sequence in endJob.
  • Some harvesting modules expect QTests on MEs that they just booked. This can only be implemented using the old behaviour, so these modules will need to be changed (hopefully there aren't many of those).

So the questions to the CMSSW experts are

  • How are modules ordered in endJob? The involved module types are edm::one::EDProducer and legacy edm::EDAnalyzer.
  • How to provide similar behaviour without global shared state?
@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 6, 2019

A new Issue was created by @schneiml Marcel Schneider.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 6, 2019

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@schneiml
Copy link
Contributor Author

schneiml commented Nov 7, 2019

Update:

I now resolved some problems in #28316 by putting QTests in endRun (this was default already), having them consume harvesting DQMTokens, so the QTester runs after endRun of harvesting modules (this dqmEndRun was actually freshly introduced for this job), and then running the majority of harvesting in endJob (this was also always like that), where it can read QReports from the QTests. Now, by lifting the harvesting code that produces MEs which need QTest'ing to endRun, things should be ordered correctly. This needs a few changes in subsystem code but it seems manageable.

But I found something that looks like a bug: SiStripMonitorClient contains a edm::EDAnalyzer legacy modules that does lots of custom harvesting in endRun. This custom harvesting needs QTest results, but the QTests are on MEs from RECO, so no problem. The only issue is that now this legacy module needs to run after its matching QualityTester (in endRun). (I don't want to move the harvesting code to endJob, since that might break multi-run harvesting, where this module is probably used as well). But that should be simple, right: I simply have the SiStripMonitorClient module (actually, the name is SiStripOfflineDQM) consume the DQMToken that its QualityTester produces, so it should order correctly.

Except (the two modules in question are siStripOfflineAnalyser and siStripQTester):

...
++++ starting: constructing module with label 'siStripQTester' id = 9
++++ finished: constructing module with label 'siStripQTester' id = 9
++++ starting: constructing module with label 'siStripOfflineAnalyser' id = 10
++++ finished: constructing module with label 'siStripOfflineAnalyser' id = 10
...
modules on path dqmHarvestingFakeHLT:
  dqmDcsInfoClient
  ecalMonitorClient
  ecalMEFormatter
  ecalPreshowerMonitorClient
  hcalOfflineHarvesting
  siStripQTester
  siStripOfflineAnalyser
  siStripBadComponentInfo
...
All modules and modules in the current process whose products they consume:
(This does not include modules from previous processes or the source)
  DQMDcsInfoClient/'dqmDcsInfoClient'
  EcalDQMonitorClient/'ecalMonitorClient'
  EcalMEFormatter/'ecalMEFormatter'
  EcalPreshowerMonitorClient/'ecalPreshowerMonitorClient'
  HcalOfflineHarvesting/'hcalOfflineHarvesting'
  QualityTester/'siStripQTester'
  SiStripOfflineDQM/'siStripOfflineAnalyser'
  SiStripBadComponentInfo/'siStripBadComponentInfo'
  QualityTester/'sipixelQTester'
...
All modules (listed by class and label) and all their consumed products.
Consumed products are listed by type, label, instance, process.
For products not in the event, 'run' or 'lumi' is added to indicate the TTree they are from.
For products that are declared with mayConsume, 'may consume' is added.
For products consumed for Views, 'element type' is added
For products only read from previous processes, 'skip current process' is added
  DQMDcsInfoClient/'dqmDcsInfoClient'
  EcalDQMonitorClient/'ecalMonitorClient'
  EcalMEFormatter/'ecalMEFormatter'
  EcalPreshowerMonitorClient/'ecalPreshowerMonitorClient'
  HcalOfflineHarvesting/'hcalOfflineHarvesting'
  QualityTester/'siStripQTester' consumes:
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingLumi' 'HARVESTING', lumi
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingRun' 'HARVESTING', run
...
  SiStripOfflineDQM/'siStripOfflineAnalyser' consumes:
    DQMToken 'siStripQTester' '' '', run
    DQMToken 'siStripQTester' '' '', lumi
  SiStripBadComponentInfo/'siStripBadComponentInfo'
  QualityTester/'sipixelQTester' consumes:
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingLumi' 'HARVESTING', lumi
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingRun' 'HARVESTING', run
...
++++ starting: begin job for module with label 'hcalOfflineHarvesting' id = 8
++++ finished: begin job for module with label 'hcalOfflineHarvesting' id = 8
++++ starting: begin job for module with label 'siStripQTester' id = 9
++++ finished: begin job for module with label 'siStripQTester' id = 9
++++ starting: begin job for module with label 'siStripOfflineAnalyser' id = 10
++++ finished: begin job for module with label 'siStripOfflineAnalyser' id = 10
++++ starting: begin job for module with label 'siStripBadComponentInfo' id = 11
++++ finished: begin job for module with label 'siStripBadComponentInfo' id = 11
...
++++++ finished: global end lumi for module: label = 'hcalOfflineHarvesting' id = 8
++++++ starting: global end lumi for module: label = 'siStripOfflineAnalyser' id = 10
++++++ finished: global end lumi for module: label = 'siStripOfflineAnalyser' id = 10
++++++ starting: global end lumi for module: label = 'siStripBadComponentInfo' id = 11
...
++++++ starting: global end lumi for module: label = 'sipixelQTester' id = 12
++++++ finished: global end lumi for module: label = 'sipixelQTester' id = 12
++++++ starting: global end lumi for module: label = 'siStripQTester' id = 9
++++++ finished: global end lumi for module: label = 'siStripQTester' id = 9
++++++ starting: global end lumi for module: label = 'dqmSaver' id = 83
++++++ finished: global end lumi for module: label = 'dqmSaver' id = 83
++++ finished: global end lumi: run = 1 lumi = 1 time = 1
++++ starting: end run: stream = 0 run = 1 time = 50000001
++++ finished: end run: stream = 0 run = 1 time = 50000001
++++ starting: global end run 1 : time = 50000001
++++++ starting: global end run for module: label = 'dqmDcsInfoClient' id = 4
++++++ finished: global end run for module: label = 'dqmDcsInfoClient' id = 4
++++++ starting: global end run for module: label = 'ecalMonitorClient' id = 5
++++++ finished: global end run for module: label = 'ecalMonitorClient' id = 5
++++++ starting: global end run for module: label = 'ecalMEFormatter' id = 6
++++++ finished: global end run for module: label = 'ecalMEFormatter' id = 6
++++++ starting: global end run for module: label = 'ecalPreshowerMonitorClient' id = 7
++++++ finished: global end run for module: label = 'ecalPreshowerMonitorClient' id = 7
++++++ starting: global end run for module: label = 'hcalOfflineHarvesting' id = 8
++++++ finished: global end run for module: label = 'hcalOfflineHarvesting' id = 8
++++++ starting: global end run for module: label = 'siStripOfflineAnalyser' id = 10
++++++ finished: global end run for module: label = 'siStripOfflineAnalyser' id = 10
++++++ starting: global end run for module: label = 'siStripBadComponentInfo' id = 11
++++++ finished: global end run for module: label = 'siStripBadComponentInfo' id = 11
...
++++++ starting: global end run for module: label = 'sipixelQTester' id = 12
++++++ finished: global end run for module: label = 'sipixelQTester' id = 12
++++++ starting: global end run for module: label = 'siStripQTester' id = 9
++++++ finished: global end run for module: label = 'siStripQTester' id = 9
++++++ starting: global end run for module: label = 'dqmSaver' id = 83
++++++ finished: global end run for module: label = 'dqmSaver' id = 83
++++ finished: global end run 1 : time = 50000001
++++ starting: end stream for module: stream = 0 label = 'dqmDcsInfoClient' id = 4
++++ finished: end stream for module: stream = 0 label = 'dqmDcsInfoClient' id = 4
...
++++ starting: end job for module with label 'hcalOfflineHarvesting' id = 8
++++ finished: end job for module with label 'hcalOfflineHarvesting' id = 8
++++ starting: end job for module with label 'siStripQTester' id = 9
++++ finished: end job for module with label 'siStripQTester' id = 9
++++ starting: end job for module with label 'siStripOfflineAnalyser' id = 10
++++ finished: end job for module with label 'siStripOfflineAnalyser' id = 10
++++ starting: end job for module with label 'siStripBadComponentInfo' id = 11
++++ finished: end job for module with label 'siStripBadComponentInfo' id = 11
...

Am I missing something or has edm just actively reordered siStripQTester after siStripOfflineAnalyser in endRun even though there is an explicit dependency the other way round? Is consumes not honored in legacy analyzers?

I'll try it next with a proper edm::one::EDAnalyzer...

Edit: To reproduce, use schneiml:dqm-remove-clientconfig-order-bug and step 5 of workflow 8.0.
Edit2: Same issue with edm::one::EDAnalyzer (on PR #28316's branch). But I might update that one, so use the one above to reproduce.
Edit3: Using consumesMany instead of consumes seems to work. So maybe I just got the InputTag wrong? But wouldn't edm complain in that case? (It feels like I asked that before, but I forgot the answer).

@Dr15Jones
Copy link
Contributor

@schneiml I'm afraid unscheduled processing of Run and Lumi products has not yet been implemented. We will move that next on the 'to-do' list. Sorry about that.

@Dr15Jones
Copy link
Contributor

See #28364

@schneiml
Copy link
Contributor Author

schneiml commented Nov 8, 2019

@Dr15Jones ah well... that is a problem, I guess?

That means in this whole "DQM sequences" cleanup, we need to not only make sure we run the correct modules, but also make sure to have them appear in the correct order? That is a unexpected, and unfortunate.

Still, what are the rules then? Tbh, scheduled execution is something that was never really clear to me, especially in the offline workflows where everything is fed through this "convert to unscheduled" mechanism. For example, why do these two modules in question get reordered, and why do they get ordered correctly when I use consumesMany? Random luck? some partially implemented functionality?

Or, is implementing that a matter of a few days? Then I'll just wait for that, given I found a random combination that mostly works.

@schneiml
Copy link
Contributor Author

schneiml commented Nov 8, 2019

@Dr15Jones (or maybe @makortel ?) please note that there is more to this question than just implementing #28364. See the comment over at the PR: #28316 (comment)

I don't understand enough about scheduling in CMSSW to make strong claims here. Was this ever guaranteed to work? if yes, which change broke it? If no, how would we fix it?

@Dr15Jones
Copy link
Contributor

@schneiml from a quick look, it appears to be just a job for a few days. However, I’m on vacation in Australia with limited access to computing time. So maybe it can ce changed by the end of next week?

@schneiml
Copy link
Contributor Author

@Dr15Jones end of next week is better than nothing, but that does not really answer my question:

Is manual scheduling in production workflows still possible? If yes, is how is it controlled? If no, how do we handle ordering dependencies in endJob? (which do exist, today, and where in production for the last few years)

@Dr15Jones
Copy link
Contributor

@schneiml upon re-evaluation, it looks like the present framework code DOES preserve consumes ordering when requesting data from Run and LuminosityBlocks. I created a very simple test module which requests data from Run and/or LuminosityBlock at either begin and/or end and setup the following ordering

process.a = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(1))

process.b = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(2), consumesEndRun = cms.InputTag("c","endRun") )

process.c = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(3), consumesEndRun = cms.InputTag("a", "endRun"))

process.p = cms.Path(process.a+process.b+process.c)

Then using the tracer I see

++++ starting: global end lumi: run = 1 lumi = 1 time = 1
++++++ starting: global end lumi for module: label = 'a' id = 3
++++++ finished: global end lumi for module: label = 'a' id = 3
++++++ starting: global end lumi for module: label = 'b' id = 4
++++++ finished: global end lumi for module: label = 'b' id = 4
++++++ starting: global end lumi for module: label = 'c' id = 5
++++++ finished: global end lumi for module: label = 'c' id = 5
++++ finished: global end lumi: run = 1 lumi = 1 time = 1
++++ starting: end run: stream = 0 run = 1 time = 15000001
++++ finished: end run: stream = 0 run = 1 time = 15000001
++++ starting: global end run 1 : time = 15000001
++++++ starting: global end run for module: label = 'a' id = 3
++++++ finished: global end run for module: label = 'a' id = 3
++++++ starting: global end run for module: label = 'c' id = 5
++++++ finished: global end run for module: label = 'c' id = 5
++++++ starting: global end run for module: label = 'b' id = 4
++++++ finished: global end run for module: label = 'b' id = 4

where you can see the end LuminosityBlock (where there is no module dependenc) yruns in 'path' order while the end Run (where there is module dependency) runs in consumes dependency order.

So now the question is, why didn't it work for you? Exactly how did you specify your dependency in the code?

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Nov 11, 2019

Is manual scheduling in production workflows still possible? If yes, is how is it controlled?

Manual ordering is only enforced during processing of Events (since Paths only have meaning for filtering during processing of Events, not during processing of Runs and LuminosityBlocks). At Run and LuminosityBlock transitions the framework concurrently runs all modules (the inter-module data dependencies automatically guarantee the ordering of dependent modules).

If no, how do we handle ordering dependencies in endJob? (which do exist, today, and where in production for the last few years)

The framework only allows data to pass from one module to another via the Run, LuminosityBlock, Run and EventSetup. No other inter-module communication is allowed. As such, since endJob transitions do not get any of those objects as an argument, the framework does not (and never has) guaranteed processing order of modules at endJob (or beginJob, beginStream or endStream) transitions.

@Dr15Jones
Copy link
Contributor

Looking at your tracer output

  SiStripOfflineDQM/'siStripOfflineAnalyser' consumes:
    DQMToken 'siStripQTester' '' '', run
    DQMToken 'siStripQTester' '' '', lumi

I sed that siStripQTester data requests are done without specifying a product instance label. Try adding a unique product instance label for both the run and lumi.

@schneiml
Copy link
Contributor Author

schneiml commented Nov 12, 2019

I sed that siStripQTester data requests are done without specifying a product instance label. Try adding a unique product instance label for both the run and lumi.

See #28354 (comment), primarily the edits. schneiml:dqm-remove-clientconfig-order-bug. This is what I tried first.
[Edit: wait, maybe I am confused. This is the code: https://github.com/schneiml/cmssw/commit/21039286d3f79aa5e9ff1b69044817c80e1c6b01 . I could add a DQMGenerationQTest there to be more specific. Though, things turned out to work correctly when using consumesMany. Apart from the issues in endJob. [Edit2: after a quick test in #28379 it seems that adding the instance label works as well as consumesMany.]]

The framework only allows data to pass from one module to another via the Run, LuminosityBlock, Run and EventSetup. No other inter-module communication is allowed. As such, since endJob transitions do not get any of those objects as an argument, the framework does not (and never has) guaranteed processing order of modules at endJob (or beginJob, beginStream or endStream) transitions.

Yet here we are, the SiStrip Certification code relies on that, since pretty much the beginning of time. Also multi-run harvesting obviously relies on communication across run boundaries, and is an official feature of DQM for years now. Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it?

@Dr15Jones
Copy link
Contributor

Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it?

That is correct.

@Dr15Jones
Copy link
Contributor

[Edit2: after a quick test in #28379 it seems that adding the instance label works as well as consumesMany.]

OK, so now it does work for you?

@schneiml
Copy link
Contributor Author

schneiml commented Nov 13, 2019

OK, so now it does work for you?

Test results show that yes, putting the instance label works. As well es consumesMany worked before.

Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it?

That is correct.

Now, for the main question in the title of this issue -- how to ensure endJob ordering -- I guess your answer is as clear as it gets. Except, that does not solve the problem at all. We somehow have to explain now to all the people who keep asking about multi-run harvesting that this was actually never supported at all, and to @mmusich and the Tracker DQM team that some parts of SiStripMonitorClient where never supposed to work at all, even though they where present and running for more than 10 years.

The only way forward that I can see is to move all the operations to endRun and tie everything down with tokens, potentially breaking multi-run harvesting -- and giving up multi-run harvesting. However, this might cause serious fallout among the users...

[Edit: another option is to restructure SiStripMonitorClient to be less modules, and do all dependent operations in a single monolith. Which does not sound ideal either, and does not change the fact that multi-run harvesting is not really possible without undefined behavior.]

@schneiml
Copy link
Contributor Author

On the concrete issue of SiStripCertificationInfo it turns out that I was actually mistaken and the transition that needs correct ordering is indeed endRun (endJob actually does not matter for this module). So it can be fixed by passing tokens around, though that would be a lot easier if

  • EDM actually complained if consumes are not satisfied. Is there an option for that? For me, the only way to see that that EDM figured out a dependency correctly is to observe the ordering with the tracer.
  • produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); would not silently compile and run but not actually produce anything.

Now, I only have to figure out which additional modules need to be reordered, since results changed, but are still not as expected...

Also, this does not change anything about the conceptual problem: Traditionally, we do harvesting in endJob transitions, so that potentially more than one run can be aggregated (multi-run harvesting), but then we cannot have dependencies between modules and I am not even sure if the output module (which is not an OutputModule at all IIRC) is guaranteed to run late enough to see all MEs. I'd be happy if we could at least specify the current behaviour of running modules ordered by their ID (which I guess, is eventually determined by the modules place on a sequence), so that we can at least manually adjust things in the configuration to work correctly.

With all data dependencies outside runs and lumis banned, we'd have to completely abandon the concept of multi-run harvesting, as far as I see; or would it be legal to, say, accumulate statistics within a module across multiple runs, and save the result at endJob, as long as everything remains local? I don't see how this could interfere with anything (of course runs could be processed in arbitrary order etc.). Could then a edm::Service detect the last endJob transition to collect and save the results? Are data dependencies between endJob and endRun fine, that is, is it guaranteed that endJob transitions for any module can only happen after all endRun transitions have finished?

@makortel
Copy link
Contributor

  • EDM actually complained if consumes are not satisfied. Is there an option for that? For me, the only way to see that that EDM figured out a dependency correctly is to observe the ordering with the tracer.

Framework complains if the consumer attempts to read a missing product.

  • produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); would not silently compile and run but not actually produce anything.

Such a behavior is in fact legal (although not really encouraged). Again, an exception is thrown if a consumer attempts to read a product that was never produced.

@schneiml
Copy link
Contributor Author

The problem with produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); is that actually it has to be produces<DQMToken, edm::Transition::EndRun>("DQMGenerationSiStripAnalyserRun");. One is an enum, the other an enum class, the values have different meanings, and I am really surprised that this compiles at all...

Re failure on reading: yes, of course, though that does not help too much when trying to figure out how to get things in the right order. Though probably I am doing something rather unusual here...

@Dr15Jones
Copy link
Contributor

The problem with produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); is that actually it has to be produces<DQMToken, edm::Transition::EndRun>("DQMGenerationSiStripAnalyserRun");. One is an enum, the other an enum class, the values have different meanings, and I am really surprised that this compiles at all...

The edm::InRun was the old legacy way of specifying where data is produced. The new edm::Transition::EndRun is the new way since it disambiguates doing produce in begin or end transition. Hence why both compile. What can be changed is a run-time check that if one specifies via the template parameter 'EndRunProducer' but then calls produces<..., edm::InRun> then a runtime error occurs. (Unfortunately there is no way I can make it a compile time fail.)

@schneiml
Copy link
Contributor Author

@Dr15Jones interesting, good to know. Does this mean I can use edm::Transition::EndRun also for consumes or are the APIs now asymmetrical (which would sort of make sense, but is also surprising)?

@Dr15Jones
Copy link
Contributor

@schneiml it looks like the consumes doesn't have a way to specify in which transition one will do the consuming, it only says in from which container the data will be obtained. The assumption right now is the data is going to be requested in the earliest possible transition matching the data. This is not the greatest, but comes from the fact that 99.99999% ( :) ) requests are in the event and very, very few ever ask for data from the Run or LuminosityBlock. In addition, Run and LuminosityBlock products are never 'conditionally' created, i.e. they are not 'made on demand' but instead will always be made. The consumes calls are just used to guarantee ordering of module calls, not to decide if a given module should be called.

I'd be all for extending the consumes API to require the explicit specification of transition in order to make everything consistent.

@schneiml
Copy link
Contributor Author

@Dr15Jones ok, I can live with that behavior for now, though the two calls next to each other look a bit weird...

Now, coming back to the main issue in this thread, we need to have a longer discussion about this at some point.

  • What can we rely on in endJob harvesting? Are we sure that e.g. all MEs are produced (all module endJob ran) before saving the output file?
  • Do we need to move all harvesting to endRun (and give up multi-run harvesting!) to be save? Note that multi-run harvesting is relied on e.g. in AlCa as well?

This should maybe go into a core software meeting at some point.

@makortel
Copy link
Contributor

@schneiml We will think about a solution for the multi-run harvesting that would conform framework policies. Could you please describe the requirements for the various actions that need to be done to achieve the multi-run harvesting?

@schneiml
Copy link
Contributor Author

@makortel on the first glimpse, what we need is

  • Ability to aggregate information over multiple runs (that means keeping state between runs)
  • Ability to post-process that information at the end of the job (ideally with multiple modules, in a well-defined order: e.g. compute efficiency, apply quality tests, create summary, in three different modules).
  • Ability to run a "output module" (this may be a EDAnalyzer/EDProducer writing a file or a real output module) after all the postprocessing.

The first can be covered by a central module managing the state (current DQMStore does that, and it is sort-of legal, as far as I understand), the second is to my knowledge not required at the moment [1] (but I might be wrong, and it is a very reasonable request), and the third is the most worrying (since I am really not sure how it works today).

[1] that is, the processing currently happens at endRun, where we can pass products around, or there are no dependencies between modules apart from "after all endRun" and "before endJob saving".

@wddgit
Copy link
Contributor

wddgit commented Mar 6, 2020

@schneiml Just hypothetically, say that a way to aggregate over multiple runs is developed. How is the range of runs to aggregate over determined? Is it good enough to aggregate over all the runs processed in one cmsRun job? In other words the range is determined by the input of one cmsRun job. How is this range determined now in multi-run DQM?

@schneiml
Copy link
Contributor Author

schneiml commented Mar 9, 2020

@wddgit The behaviour today is:

  • For single run harvesting, we run a job that filters out lumisections of a single run, aggregates over the entire job, saves at the end of the job.
  • For multi run harvesting, we run a job that reads a set of files, aggregates over the entire job, saves at the end of the job.

So indeed, today everything is determined by the input to the job. Also, today single-run and multi-run harvesting are basically identical.

This has the benefit that multi-run harvesting typically "just works", but the disadvantage that we can't single-run harvest multiple runs in one job. Since we don't really need the latter, this is a good deal; though a solution that would allow all combinations without duplicating the code (endRun vs. endJob) would be nicer.

From a practical point of view, the next larger quantity that multi-run harvesting usually runs over are data taking eras, which turn into datasets in computing (e.g. 2018A, 2018B, etc.). Within these, the configuration of the detector changes little enough that multi-run harvesting has a chance of working.

@makortel
Copy link
Contributor

makortel commented Jun 4, 2020

#30117 should provide a solution

@makortel
Copy link
Contributor

+1

#30117 was merged in CMSSW_11_2_X_2020-07-10-1100.

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@schneiml Could you start to try out the ProcessBlock? Feedback from actual use would be useful, especially in light of development towards ProcessBlock persistency. Thanks.

@schneiml
Copy link
Contributor Author

@makortel I'll try to get sth. running this week.

@wddgit
Copy link
Contributor

wddgit commented Jul 14, 2020

@schneiml There is some new documentation on the TWIKI that might help related to ProcessBlock. Let me know if you notice any problems with it.

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideProcessBlockData

@schneiml
Copy link
Contributor Author

@wddgit I tried using the new feature in #30698 but have not gotten it to work yet. Feel free to have a look. I have not looked at the docs yet, maybe that will help...

@makortel
Copy link
Contributor

makortel commented Sep 9, 2020

Could this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants