EndJob module ordering #28354

schneiml · 2019-11-06T16:16:06Z

Dear framework experts,

while debugging the observed differences in #28316 I noticed a serious conceptual problem in HARVESTING.

The situation is the following:

MonitorElements are read from file.
In endRun, QualityTesters apply QTests (by default, they can also do that in other transitions)
In endJob, custom harvesting modules perform harvesting, producing new MEs.
MonitorElements are saved.

Why is that a problem? The MEs created in harvesting can't have QTests applied, but some do have them and that used to work fine.

How did this ever work? In the current/old DQMStore, QTests are also saved if they do not apply to any ME, and then applied when a matching ME gets booked (at least, that is how I understand it).

What could we do to make it work again?

Keep the old behaviour of having all QTests managed in the DQMStore. I dislike that for reasons that are probably obvious.
Run QTests again (or optionally, or in general) in endJob. But this does not help, unless we can enforce correct ordering of modules in the endJob transition. In endRun, this can be fixed by passing through products, but how to do it in endJob?
Run custom harvesting in endRun. But this is a significant semantic change, and will break multi-run harvesting. Not something we can do for CMSSW11, most likely.

Bonus problems:

The harvesting modules creating summary plots need to be able to access QTest results, so they need to run after the QualityTester. Which means we might need to enforce a harvester->tester->harvester sequence in endJob.
Some harvesting modules expect QTests on MEs that they just booked. This can only be implemented using the old behaviour, so these modules will need to be changed (hopefully there aren't many of those).

So the questions to the CMSSW experts are

How are modules ordered in endJob? The involved module types are edm::one::EDProducer and legacy edm::EDAnalyzer.
How to provide similar behaviour without global shared state?

The text was updated successfully, but these errors were encountered:

cmsbuild · 2019-11-06T16:16:31Z

A new Issue was created by @schneiml Marcel Schneider.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2019-11-06T23:38:50Z

assign core

cmsbuild · 2019-11-06T23:39:15Z

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

schneiml · 2019-11-07T14:05:38Z

Update:

I now resolved some problems in #28316 by putting QTests in endRun (this was default already), having them consume harvesting DQMTokens, so the QTester runs after endRun of harvesting modules (this dqmEndRun was actually freshly introduced for this job), and then running the majority of harvesting in endJob (this was also always like that), where it can read QReports from the QTests. Now, by lifting the harvesting code that produces MEs which need QTest'ing to endRun, things should be ordered correctly. This needs a few changes in subsystem code but it seems manageable.

But I found something that looks like a bug: SiStripMonitorClient contains a edm::EDAnalyzer legacy modules that does lots of custom harvesting in endRun. This custom harvesting needs QTest results, but the QTests are on MEs from RECO, so no problem. The only issue is that now this legacy module needs to run after its matching QualityTester (in endRun). (I don't want to move the harvesting code to endJob, since that might break multi-run harvesting, where this module is probably used as well). But that should be simple, right: I simply have the SiStripMonitorClient module (actually, the name is SiStripOfflineDQM) consume the DQMToken that its QualityTester produces, so it should order correctly.

Except (the two modules in question are siStripOfflineAnalyser and siStripQTester):

...
++++ starting: constructing module with label 'siStripQTester' id = 9
++++ finished: constructing module with label 'siStripQTester' id = 9
++++ starting: constructing module with label 'siStripOfflineAnalyser' id = 10
++++ finished: constructing module with label 'siStripOfflineAnalyser' id = 10
...
modules on path dqmHarvestingFakeHLT:
  dqmDcsInfoClient
  ecalMonitorClient
  ecalMEFormatter
  ecalPreshowerMonitorClient
  hcalOfflineHarvesting
  siStripQTester
  siStripOfflineAnalyser
  siStripBadComponentInfo
...
All modules and modules in the current process whose products they consume:
(This does not include modules from previous processes or the source)
  DQMDcsInfoClient/'dqmDcsInfoClient'
  EcalDQMonitorClient/'ecalMonitorClient'
  EcalMEFormatter/'ecalMEFormatter'
  EcalPreshowerMonitorClient/'ecalPreshowerMonitorClient'
  HcalOfflineHarvesting/'hcalOfflineHarvesting'
  QualityTester/'siStripQTester'
  SiStripOfflineDQM/'siStripOfflineAnalyser'
  SiStripBadComponentInfo/'siStripBadComponentInfo'
  QualityTester/'sipixelQTester'
...
All modules (listed by class and label) and all their consumed products.
Consumed products are listed by type, label, instance, process.
For products not in the event, 'run' or 'lumi' is added to indicate the TTree they are from.
For products that are declared with mayConsume, 'may consume' is added.
For products consumed for Views, 'element type' is added
For products only read from previous processes, 'skip current process' is added
  DQMDcsInfoClient/'dqmDcsInfoClient'
  EcalDQMonitorClient/'ecalMonitorClient'
  EcalMEFormatter/'ecalMEFormatter'
  EcalPreshowerMonitorClient/'ecalPreshowerMonitorClient'
  HcalOfflineHarvesting/'hcalOfflineHarvesting'
  QualityTester/'siStripQTester' consumes:
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingLumi' 'HARVESTING', lumi
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingRun' 'HARVESTING', run
...
  SiStripOfflineDQM/'siStripOfflineAnalyser' consumes:
    DQMToken 'siStripQTester' '' '', run
    DQMToken 'siStripQTester' '' '', lumi
  SiStripBadComponentInfo/'siStripBadComponentInfo'
  QualityTester/'sipixelQTester' consumes:
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingLumi' 'HARVESTING', lumi
    DQMToken 'dqmDcsInfoClient' 'DQMGenerationHarvestingRun' 'HARVESTING', run
...
++++ starting: begin job for module with label 'hcalOfflineHarvesting' id = 8
++++ finished: begin job for module with label 'hcalOfflineHarvesting' id = 8
++++ starting: begin job for module with label 'siStripQTester' id = 9
++++ finished: begin job for module with label 'siStripQTester' id = 9
++++ starting: begin job for module with label 'siStripOfflineAnalyser' id = 10
++++ finished: begin job for module with label 'siStripOfflineAnalyser' id = 10
++++ starting: begin job for module with label 'siStripBadComponentInfo' id = 11
++++ finished: begin job for module with label 'siStripBadComponentInfo' id = 11
...
++++++ finished: global end lumi for module: label = 'hcalOfflineHarvesting' id = 8
++++++ starting: global end lumi for module: label = 'siStripOfflineAnalyser' id = 10
++++++ finished: global end lumi for module: label = 'siStripOfflineAnalyser' id = 10
++++++ starting: global end lumi for module: label = 'siStripBadComponentInfo' id = 11
...
++++++ starting: global end lumi for module: label = 'sipixelQTester' id = 12
++++++ finished: global end lumi for module: label = 'sipixelQTester' id = 12
++++++ starting: global end lumi for module: label = 'siStripQTester' id = 9
++++++ finished: global end lumi for module: label = 'siStripQTester' id = 9
++++++ starting: global end lumi for module: label = 'dqmSaver' id = 83
++++++ finished: global end lumi for module: label = 'dqmSaver' id = 83
++++ finished: global end lumi: run = 1 lumi = 1 time = 1
++++ starting: end run: stream = 0 run = 1 time = 50000001
++++ finished: end run: stream = 0 run = 1 time = 50000001
++++ starting: global end run 1 : time = 50000001
++++++ starting: global end run for module: label = 'dqmDcsInfoClient' id = 4
++++++ finished: global end run for module: label = 'dqmDcsInfoClient' id = 4
++++++ starting: global end run for module: label = 'ecalMonitorClient' id = 5
++++++ finished: global end run for module: label = 'ecalMonitorClient' id = 5
++++++ starting: global end run for module: label = 'ecalMEFormatter' id = 6
++++++ finished: global end run for module: label = 'ecalMEFormatter' id = 6
++++++ starting: global end run for module: label = 'ecalPreshowerMonitorClient' id = 7
++++++ finished: global end run for module: label = 'ecalPreshowerMonitorClient' id = 7
++++++ starting: global end run for module: label = 'hcalOfflineHarvesting' id = 8
++++++ finished: global end run for module: label = 'hcalOfflineHarvesting' id = 8
++++++ starting: global end run for module: label = 'siStripOfflineAnalyser' id = 10
++++++ finished: global end run for module: label = 'siStripOfflineAnalyser' id = 10
++++++ starting: global end run for module: label = 'siStripBadComponentInfo' id = 11
++++++ finished: global end run for module: label = 'siStripBadComponentInfo' id = 11
...
++++++ starting: global end run for module: label = 'sipixelQTester' id = 12
++++++ finished: global end run for module: label = 'sipixelQTester' id = 12
++++++ starting: global end run for module: label = 'siStripQTester' id = 9
++++++ finished: global end run for module: label = 'siStripQTester' id = 9
++++++ starting: global end run for module: label = 'dqmSaver' id = 83
++++++ finished: global end run for module: label = 'dqmSaver' id = 83
++++ finished: global end run 1 : time = 50000001
++++ starting: end stream for module: stream = 0 label = 'dqmDcsInfoClient' id = 4
++++ finished: end stream for module: stream = 0 label = 'dqmDcsInfoClient' id = 4
...
++++ starting: end job for module with label 'hcalOfflineHarvesting' id = 8
++++ finished: end job for module with label 'hcalOfflineHarvesting' id = 8
++++ starting: end job for module with label 'siStripQTester' id = 9
++++ finished: end job for module with label 'siStripQTester' id = 9
++++ starting: end job for module with label 'siStripOfflineAnalyser' id = 10
++++ finished: end job for module with label 'siStripOfflineAnalyser' id = 10
++++ starting: end job for module with label 'siStripBadComponentInfo' id = 11
++++ finished: end job for module with label 'siStripBadComponentInfo' id = 11
...

Am I missing something or has edm just actively reordered siStripQTester after siStripOfflineAnalyser in endRun even though there is an explicit dependency the other way round? Is consumes not honored in legacy analyzers?

I'll try it next with a proper edm::one::EDAnalyzer...

Edit: To reproduce, use schneiml:dqm-remove-clientconfig-order-bug and step 5 of workflow 8.0.
Edit2: Same issue with edm::one::EDAnalyzer (on PR #28316's branch). But I might update that one, so use the one above to reproduce.
Edit3: Using consumesMany instead of consumes seems to work. So maybe I just got the InputTag wrong? But wouldn't edm complain in that case? (It feels like I asked that before, but I forgot the answer).

Dr15Jones · 2019-11-07T21:41:18Z

@schneiml I'm afraid unscheduled processing of Run and Lumi products has not yet been implemented. We will move that next on the 'to-do' list. Sorry about that.

Dr15Jones · 2019-11-07T22:37:00Z

See #28364

schneiml · 2019-11-08T09:03:36Z

@Dr15Jones ah well... that is a problem, I guess?

That means in this whole "DQM sequences" cleanup, we need to not only make sure we run the correct modules, but also make sure to have them appear in the correct order? That is a unexpected, and unfortunate.

Still, what are the rules then? Tbh, scheduled execution is something that was never really clear to me, especially in the offline workflows where everything is fed through this "convert to unscheduled" mechanism. For example, why do these two modules in question get reordered, and why do they get ordered correctly when I use consumesMany? Random luck? some partially implemented functionality?

Or, is implementing that a matter of a few days? Then I'll just wait for that, given I found a random combination that mostly works.

schneiml · 2019-11-08T10:54:37Z

@Dr15Jones (or maybe @makortel ?) please note that there is more to this question than just implementing #28364. See the comment over at the PR: #28316 (comment)

I don't understand enough about scheduling in CMSSW to make strong claims here. Was this ever guaranteed to work? if yes, which change broke it? If no, how would we fix it?

Dr15Jones · 2019-11-08T21:08:54Z

@schneiml from a quick look, it appears to be just a job for a few days. However, I’m on vacation in Australia with limited access to computing time. So maybe it can ce changed by the end of next week?

schneiml · 2019-11-11T13:15:07Z

@Dr15Jones end of next week is better than nothing, but that does not really answer my question:

Is manual scheduling in production workflows still possible? If yes, is how is it controlled? If no, how do we handle ordering dependencies in endJob? (which do exist, today, and where in production for the last few years)

Dr15Jones · 2019-11-11T20:54:03Z

@schneiml upon re-evaluation, it looks like the present framework code DOES preserve consumes ordering when requesting data from Run and LuminosityBlocks. I created a very simple test module which requests data from Run and/or LuminosityBlock at either begin and/or end and setup the following ordering

process.a = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(1))

process.b = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(2), consumesEndRun = cms.InputTag("c","endRun") )

process.c = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(3), consumesEndRun = cms.InputTag("a", "endRun"))

process.p = cms.Path(process.a+process.b+process.c)

Then using the tracer I see

++++ starting: global end lumi: run = 1 lumi = 1 time = 1
++++++ starting: global end lumi for module: label = 'a' id = 3
++++++ finished: global end lumi for module: label = 'a' id = 3
++++++ starting: global end lumi for module: label = 'b' id = 4
++++++ finished: global end lumi for module: label = 'b' id = 4
++++++ starting: global end lumi for module: label = 'c' id = 5
++++++ finished: global end lumi for module: label = 'c' id = 5
++++ finished: global end lumi: run = 1 lumi = 1 time = 1
++++ starting: end run: stream = 0 run = 1 time = 15000001
++++ finished: end run: stream = 0 run = 1 time = 15000001
++++ starting: global end run 1 : time = 15000001
++++++ starting: global end run for module: label = 'a' id = 3
++++++ finished: global end run for module: label = 'a' id = 3
++++++ starting: global end run for module: label = 'c' id = 5
++++++ finished: global end run for module: label = 'c' id = 5
++++++ starting: global end run for module: label = 'b' id = 4
++++++ finished: global end run for module: label = 'b' id = 4

where you can see the end LuminosityBlock (where there is no module dependenc) yruns in 'path' order while the end Run (where there is module dependency) runs in consumes dependency order.

So now the question is, why didn't it work for you? Exactly how did you specify your dependency in the code?

Dr15Jones · 2019-11-11T21:00:06Z

Is manual scheduling in production workflows still possible? If yes, is how is it controlled?

Manual ordering is only enforced during processing of Events (since Paths only have meaning for filtering during processing of Events, not during processing of Runs and LuminosityBlocks). At Run and LuminosityBlock transitions the framework concurrently runs all modules (the inter-module data dependencies automatically guarantee the ordering of dependent modules).

If no, how do we handle ordering dependencies in endJob? (which do exist, today, and where in production for the last few years)

The framework only allows data to pass from one module to another via the Run, LuminosityBlock, Run and EventSetup. No other inter-module communication is allowed. As such, since endJob transitions do not get any of those objects as an argument, the framework does not (and never has) guaranteed processing order of modules at endJob (or beginJob, beginStream or endStream) transitions.

Dr15Jones · 2019-11-11T21:11:45Z

Looking at your tracer output

  SiStripOfflineDQM/'siStripOfflineAnalyser' consumes:
    DQMToken 'siStripQTester' '' '', run
    DQMToken 'siStripQTester' '' '', lumi

I sed that siStripQTester data requests are done without specifying a product instance label. Try adding a unique product instance label for both the run and lumi.

schneiml · 2019-11-12T08:56:17Z

I sed that siStripQTester data requests are done without specifying a product instance label. Try adding a unique product instance label for both the run and lumi.

See #28354 (comment), primarily the edits. schneiml:dqm-remove-clientconfig-order-bug. This is what I tried first.
[Edit: wait, maybe I am confused. This is the code: https://github.com/schneiml/cmssw/commit/21039286d3f79aa5e9ff1b69044817c80e1c6b01 . I could add a DQMGenerationQTest there to be more specific. Though, things turned out to work correctly when using consumesMany. Apart from the issues in endJob. [Edit2: after a quick test in #28379 it seems that adding the instance label works as well as consumesMany.]]

The framework only allows data to pass from one module to another via the Run, LuminosityBlock, Run and EventSetup. No other inter-module communication is allowed. As such, since endJob transitions do not get any of those objects as an argument, the framework does not (and never has) guaranteed processing order of modules at endJob (or beginJob, beginStream or endStream) transitions.

Yet here we are, the SiStrip Certification code relies on that, since pretty much the beginning of time. Also multi-run harvesting obviously relies on communication across run boundaries, and is an official feature of DQM for years now. Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it?

Dr15Jones · 2019-11-12T19:52:11Z

Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it?

That is correct.

Dr15Jones · 2019-11-12T19:57:30Z

[Edit2: after a quick test in #28379 it seems that adding the instance label works as well as consumesMany.]

OK, so now it does work for you?

schneiml · 2019-11-13T09:04:57Z

OK, so now it does work for you?

Test results show that yes, putting the instance label works. As well es consumesMany worked before.

Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it?

That is correct.

Now, for the main question in the title of this issue -- how to ensure endJob ordering -- I guess your answer is as clear as it gets. Except, that does not solve the problem at all. We somehow have to explain now to all the people who keep asking about multi-run harvesting that this was actually never supported at all, and to @mmusich and the Tracker DQM team that some parts of SiStripMonitorClient where never supposed to work at all, even though they where present and running for more than 10 years.

The only way forward that I can see is to move all the operations to endRun and tie everything down with tokens, potentially breaking multi-run harvesting -- and giving up multi-run harvesting. However, this might cause serious fallout among the users...

[Edit: another option is to restructure SiStripMonitorClient to be less modules, and do all dependent operations in a single monolith. Which does not sound ideal either, and does not change the fact that multi-run harvesting is not really possible without undefined behavior.]

schneiml · 2019-11-13T14:10:37Z

On the concrete issue of SiStripCertificationInfo it turns out that I was actually mistaken and the transition that needs correct ordering is indeed endRun (endJob actually does not matter for this module). So it can be fixed by passing tokens around, though that would be a lot easier if

EDM actually complained if consumes are not satisfied. Is there an option for that? For me, the only way to see that that EDM figured out a dependency correctly is to observe the ordering with the tracer.
produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); would not silently compile and run but not actually produce anything.

Now, I only have to figure out which additional modules need to be reordered, since results changed, but are still not as expected...

Also, this does not change anything about the conceptual problem: Traditionally, we do harvesting in endJob transitions, so that potentially more than one run can be aggregated (multi-run harvesting), but then we cannot have dependencies between modules and I am not even sure if the output module (which is not an OutputModule at all IIRC) is guaranteed to run late enough to see all MEs. I'd be happy if we could at least specify the current behaviour of running modules ordered by their ID (which I guess, is eventually determined by the modules place on a sequence), so that we can at least manually adjust things in the configuration to work correctly.

With all data dependencies outside runs and lumis banned, we'd have to completely abandon the concept of multi-run harvesting, as far as I see; or would it be legal to, say, accumulate statistics within a module across multiple runs, and save the result at endJob, as long as everything remains local? I don't see how this could interfere with anything (of course runs could be processed in arbitrary order etc.). Could then a edm::Service detect the last endJob transition to collect and save the results? Are data dependencies between endJob and endRun fine, that is, is it guaranteed that endJob transitions for any module can only happen after all endRun transitions have finished?

makortel · 2019-11-13T14:45:53Z

EDM actually complained if consumes are not satisfied. Is there an option for that? For me, the only way to see that that EDM figured out a dependency correctly is to observe the ordering with the tracer.

Framework complains if the consumer attempts to read a missing product.

produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); would not silently compile and run but not actually produce anything.

Such a behavior is in fact legal (although not really encouraged). Again, an exception is thrown if a consumer attempts to read a product that was never produced.

schneiml · 2019-11-14T08:58:30Z

The problem with produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); is that actually it has to be produces<DQMToken, edm::Transition::EndRun>("DQMGenerationSiStripAnalyserRun");. One is an enum, the other an enum class, the values have different meanings, and I am really surprised that this compiles at all...

Re failure on reading: yes, of course, though that does not help too much when trying to figure out how to get things in the right order. Though probably I am doing something rather unusual here...

Dr15Jones · 2019-11-15T16:47:04Z

The problem with produces<DQMToken, edm::InRun>("DQMGenerationSiStripAnalyserRun"); is that actually it has to be produces<DQMToken, edm::Transition::EndRun>("DQMGenerationSiStripAnalyserRun");. One is an enum, the other an enum class, the values have different meanings, and I am really surprised that this compiles at all...

The edm::InRun was the old legacy way of specifying where data is produced. The new edm::Transition::EndRun is the new way since it disambiguates doing produce in begin or end transition. Hence why both compile. What can be changed is a run-time check that if one specifies via the template parameter 'EndRunProducer' but then calls produces<..., edm::InRun> then a runtime error occurs. (Unfortunately there is no way I can make it a compile time fail.)

schneiml · 2019-11-18T09:27:47Z

@Dr15Jones interesting, good to know. Does this mean I can use edm::Transition::EndRun also for consumes or are the APIs now asymmetrical (which would sort of make sense, but is also surprising)?

Dr15Jones · 2019-11-18T17:34:31Z

@schneiml it looks like the consumes doesn't have a way to specify in which transition one will do the consuming, it only says in from which container the data will be obtained. The assumption right now is the data is going to be requested in the earliest possible transition matching the data. This is not the greatest, but comes from the fact that 99.99999% ( :) ) requests are in the event and very, very few ever ask for data from the Run or LuminosityBlock. In addition, Run and LuminosityBlock products are never 'conditionally' created, i.e. they are not 'made on demand' but instead will always be made. The consumes calls are just used to guarantee ordering of module calls, not to decide if a given module should be called.

I'd be all for extending the consumes API to require the explicit specification of transition in order to make everything consistent.

schneiml · 2019-11-19T16:21:49Z

@Dr15Jones ok, I can live with that behavior for now, though the two calls next to each other look a bit weird...

Now, coming back to the main issue in this thread, we need to have a longer discussion about this at some point.

What can we rely on in endJob harvesting? Are we sure that e.g. all MEs are produced (all module endJob ran) before saving the output file?
Do we need to move all harvesting to endRun (and give up multi-run harvesting!) to be save? Note that multi-run harvesting is relied on e.g. in AlCa as well?

This should maybe go into a core software meeting at some point.

makortel · 2019-11-22T17:46:46Z

@schneiml We will think about a solution for the multi-run harvesting that would conform framework policies. Could you please describe the requirements for the various actions that need to be done to achieve the multi-run harvesting?

schneiml · 2019-11-27T09:54:39Z

@makortel on the first glimpse, what we need is

Ability to aggregate information over multiple runs (that means keeping state between runs)
Ability to post-process that information at the end of the job (ideally with multiple modules, in a well-defined order: e.g. compute efficiency, apply quality tests, create summary, in three different modules).
Ability to run a "output module" (this may be a EDAnalyzer/EDProducer writing a file or a real output module) after all the postprocessing.

The first can be covered by a central module managing the state (current DQMStore does that, and it is sort-of legal, as far as I understand), the second is to my knowledge not required at the moment [1] (but I might be wrong, and it is a very reasonable request), and the third is the most worrying (since I am really not sure how it works today).

[1] that is, the processing currently happens at endRun, where we can pass products around, or there are no dependencies between modules apart from "after all endRun" and "before endJob saving".

wddgit · 2020-03-06T22:28:46Z

@schneiml Just hypothetically, say that a way to aggregate over multiple runs is developed. How is the range of runs to aggregate over determined? Is it good enough to aggregate over all the runs processed in one cmsRun job? In other words the range is determined by the input of one cmsRun job. How is this range determined now in multi-run DQM?

schneiml · 2020-03-09T14:34:48Z

@wddgit The behaviour today is:

For single run harvesting, we run a job that filters out lumisections of a single run, aggregates over the entire job, saves at the end of the job.
For multi run harvesting, we run a job that reads a set of files, aggregates over the entire job, saves at the end of the job.

So indeed, today everything is determined by the input to the job. Also, today single-run and multi-run harvesting are basically identical.

This has the benefit that multi-run harvesting typically "just works", but the disadvantage that we can't single-run harvest multiple runs in one job. Since we don't really need the latter, this is a good deal; though a solution that would allow all combinations without duplicating the code (endRun vs. endJob) would be nicer.

From a practical point of view, the next larger quantity that multi-run harvesting usually runs over are data taking eras, which turn into datasets in computing (e.g. 2018A, 2018B, etc.). Within these, the configuration of the detector changes little enough that multi-run harvesting has a chance of working.

makortel · 2020-06-04T23:26:20Z

#30117 should provide a solution

makortel · 2020-07-10T23:27:05Z

+1

#30117 was merged in CMSSW_11_2_X_2020-07-10-1100.

cmsbuild · 2020-07-10T23:27:25Z

This issue is fully signed and ready to be closed.

makortel · 2020-07-10T23:32:00Z

@schneiml Could you start to try out the ProcessBlock? Feedback from actual use would be useful, especially in light of development towards ProcessBlock persistency. Thanks.

schneiml · 2020-07-14T12:09:38Z

@makortel I'll try to get sth. running this week.

wddgit · 2020-07-14T19:55:27Z

@schneiml There is some new documentation on the TWIKI that might help related to ProcessBlock. Let me know if you notice any problems with it.

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideProcessBlockData

schneiml · 2020-07-15T12:00:48Z

@wddgit I tried using the new feature in #30698 but have not gotten it to work yet. Feel free to have a look. I have not looked at the docs yet, maybe that will help...

makortel · 2020-09-09T13:43:25Z

Could this issue be closed?

cmsbuild added the pending-assignment label Nov 6, 2019

schneiml mentioned this issue Nov 6, 2019

DQM: Restructure QTests #28316

Closed

cmsbuild added core-pending pending-signatures and removed pending-assignment labels Nov 6, 2019

schneiml mentioned this issue Nov 12, 2019

DQM: Restructure QTests (v2) #28379

Merged

wddgit mentioned this issue Jun 4, 2020

Add new ProcessBlock feature to the Framework #30117

Merged

cmsbuild added core-approved fully-signed and removed core-pending pending-signatures labels Jul 10, 2020

schneiml mentioned this issue Jul 15, 2020

DQM: Use ProcessBlock in Harvesting #30698

Merged

silviodonato closed this as completed Sep 23, 2020

EndJob module ordering #28354

EndJob module ordering #28354

Comments

schneiml commented Nov 6, 2019

cmsbuild commented Nov 6, 2019

Dr15Jones commented Nov 6, 2019

cmsbuild commented Nov 6, 2019

schneiml commented Nov 7, 2019 • edited Loading

Dr15Jones commented Nov 7, 2019

Dr15Jones commented Nov 7, 2019

schneiml commented Nov 8, 2019

schneiml commented Nov 8, 2019

Dr15Jones commented Nov 8, 2019

schneiml commented Nov 11, 2019

Dr15Jones commented Nov 11, 2019

Dr15Jones commented Nov 11, 2019 • edited Loading

Dr15Jones commented Nov 11, 2019

schneiml commented Nov 12, 2019 • edited Loading

Dr15Jones commented Nov 12, 2019

Dr15Jones commented Nov 12, 2019

schneiml commented Nov 13, 2019 • edited Loading

schneiml commented Nov 13, 2019

makortel commented Nov 13, 2019

schneiml commented Nov 14, 2019

Dr15Jones commented Nov 15, 2019

schneiml commented Nov 18, 2019

Dr15Jones commented Nov 18, 2019

schneiml commented Nov 19, 2019

makortel commented Nov 22, 2019

schneiml commented Nov 27, 2019

wddgit commented Mar 6, 2020

schneiml commented Mar 9, 2020

makortel commented Jun 4, 2020

makortel commented Jul 10, 2020

cmsbuild commented Jul 10, 2020

makortel commented Jul 10, 2020

schneiml commented Jul 14, 2020

wddgit commented Jul 14, 2020

schneiml commented Jul 15, 2020

makortel commented Sep 9, 2020

schneiml commented Nov 7, 2019 •

edited

Loading

Dr15Jones commented Nov 11, 2019 •

edited

Loading

schneiml commented Nov 12, 2019 •

edited

Loading

schneiml commented Nov 13, 2019 •

edited

Loading