-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EndJob module ordering #28354
Comments
A new Issue was created by @schneiml Marcel Schneider. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Update: I now resolved some problems in #28316 by putting QTests in But I found something that looks like a bug: Except (the two modules in question are
Am I missing something or has edm just actively reordered I'll try it next with a proper Edit: To reproduce, use |
@schneiml I'm afraid unscheduled processing of Run and Lumi products has not yet been implemented. We will move that next on the 'to-do' list. Sorry about that. |
See #28364 |
@Dr15Jones ah well... that is a problem, I guess? That means in this whole "DQM sequences" cleanup, we need to not only make sure we run the correct modules, but also make sure to have them appear in the correct order? That is a unexpected, and unfortunate. Still, what are the rules then? Tbh, scheduled execution is something that was never really clear to me, especially in the offline workflows where everything is fed through this "convert to unscheduled" mechanism. For example, why do these two modules in question get reordered, and why do they get ordered correctly when I use Or, is implementing that a matter of a few days? Then I'll just wait for that, given I found a random combination that mostly works. |
@Dr15Jones (or maybe @makortel ?) please note that there is more to this question than just implementing #28364. See the comment over at the PR: #28316 (comment) I don't understand enough about scheduling in CMSSW to make strong claims here. Was this ever guaranteed to work? if yes, which change broke it? If no, how would we fix it? |
@schneiml from a quick look, it appears to be just a job for a few days. However, I’m on vacation in Australia with limited access to computing time. So maybe it can ce changed by the end of next week? |
@Dr15Jones end of next week is better than nothing, but that does not really answer my question: Is manual scheduling in production workflows still possible? If yes, is how is it controlled? If no, how do we handle ordering dependencies in |
@schneiml upon re-evaluation, it looks like the present framework code DOES preserve consumes ordering when requesting data from Run and LuminosityBlocks. I created a very simple test module which requests data from Run and/or LuminosityBlock at either begin and/or end and setup the following ordering process.a = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(1))
process.b = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(2), consumesEndRun = cms.InputTag("c","endRun") )
process.c = cms.EDProducer("NonEventIntProducer", ivalue = cms.int32(3), consumesEndRun = cms.InputTag("a", "endRun"))
process.p = cms.Path(process.a+process.b+process.c) Then using the tracer I see
where you can see the end LuminosityBlock (where there is no module dependenc) yruns in 'path' order while the end Run (where there is module dependency) runs in consumes dependency order. So now the question is, why didn't it work for you? Exactly how did you specify your dependency in the code? |
Manual ordering is only enforced during processing of Events (since Paths only have meaning for filtering during processing of Events, not during processing of Runs and LuminosityBlocks). At Run and LuminosityBlock transitions the framework concurrently runs all modules (the inter-module data dependencies automatically guarantee the ordering of dependent modules).
The framework only allows data to pass from one module to another via the Run, LuminosityBlock, Run and EventSetup. No other inter-module communication is allowed. As such, since endJob transitions do not get any of those objects as an argument, the framework does not (and never has) guaranteed processing order of modules at endJob (or beginJob, beginStream or endStream) transitions. |
Looking at your tracer output
I sed that |
See #28354 (comment), primarily the edits.
Yet here we are, the SiStrip Certification code relies on that, since pretty much the beginning of time. Also multi-run harvesting obviously relies on communication across run boundaries, and is an official feature of DQM for years now. Do I read correctly that even if we set everything to be as legacy as possible (as we do in harvesting, to my knowledge), this was never supported, it could break at any moment, and there is no way to fix it? |
That is correct. |
OK, so now it does work for you? |
Test results show that yes, putting the instance label works. As well es
Now, for the main question in the title of this issue -- how to ensure The only way forward that I can see is to move all the operations to [Edit: another option is to restructure |
On the concrete issue of
Now, I only have to figure out which additional modules need to be reordered, since results changed, but are still not as expected... Also, this does not change anything about the conceptual problem: Traditionally, we do harvesting in With all data dependencies outside runs and lumis banned, we'd have to completely abandon the concept of multi-run harvesting, as far as I see; or would it be legal to, say, accumulate statistics within a module across multiple runs, and save the result at |
Framework complains if the consumer attempts to read a missing product.
Such a behavior is in fact legal (although not really encouraged). Again, an exception is thrown if a consumer attempts to read a product that was never produced. |
The problem with Re failure on reading: yes, of course, though that does not help too much when trying to figure out how to get things in the right order. Though probably I am doing something rather unusual here... |
The |
@Dr15Jones interesting, good to know. Does this mean I can use |
@schneiml it looks like the I'd be all for extending the |
@Dr15Jones ok, I can live with that behavior for now, though the two calls next to each other look a bit weird... Now, coming back to the main issue in this thread, we need to have a longer discussion about this at some point.
This should maybe go into a core software meeting at some point. |
@schneiml We will think about a solution for the multi-run harvesting that would conform framework policies. Could you please describe the requirements for the various actions that need to be done to achieve the multi-run harvesting? |
@makortel on the first glimpse, what we need is
The first can be covered by a central module managing the state (current DQMStore does that, and it is sort-of legal, as far as I understand), the second is to my knowledge not required at the moment [1] (but I might be wrong, and it is a very reasonable request), and the third is the most worrying (since I am really not sure how it works today). [1] that is, the processing currently happens at |
@schneiml Just hypothetically, say that a way to aggregate over multiple runs is developed. How is the range of runs to aggregate over determined? Is it good enough to aggregate over all the runs processed in one cmsRun job? In other words the range is determined by the input of one cmsRun job. How is this range determined now in multi-run DQM? |
@wddgit The behaviour today is:
So indeed, today everything is determined by the input to the job. Also, today single-run and multi-run harvesting are basically identical. This has the benefit that multi-run harvesting typically "just works", but the disadvantage that we can't single-run harvest multiple runs in one job. Since we don't really need the latter, this is a good deal; though a solution that would allow all combinations without duplicating the code ( From a practical point of view, the next larger quantity that multi-run harvesting usually runs over are data taking eras, which turn into datasets in computing (e.g. 2018A, 2018B, etc.). Within these, the configuration of the detector changes little enough that multi-run harvesting has a chance of working. |
#30117 should provide a solution |
+1 #30117 was merged in CMSSW_11_2_X_2020-07-10-1100. |
This issue is fully signed and ready to be closed. |
@schneiml Could you start to try out the ProcessBlock? Feedback from actual use would be useful, especially in light of development towards ProcessBlock persistency. Thanks. |
@makortel I'll try to get sth. running this week. |
@schneiml There is some new documentation on the TWIKI that might help related to ProcessBlock. Let me know if you notice any problems with it. https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideProcessBlockData |
Could this issue be closed? |
Dear framework experts,
while debugging the observed differences in #28316 I noticed a serious conceptual problem in HARVESTING.
The situation is the following:
endRun
,QualityTesters
apply QTests (by default, they can also do that in other transitions)endJob
, custom harvesting modules perform harvesting, producing new MEs.Why is that a problem? The MEs created in harvesting can't have QTests applied, but some do have them and that used to work fine.
How did this ever work? In the current/old DQMStore, QTests are also saved if they do not apply to any ME, and then applied when a matching ME gets booked (at least, that is how I understand it).
What could we do to make it work again?
endJob
. But this does not help, unless we can enforce correct ordering of modules in theendJob
transition. InendRun
, this can be fixed by passing through products, but how to do it inendJob
?endRun
. But this is a significant semantic change, and will break multi-run harvesting. Not something we can do for CMSSW11, most likely.Bonus problems:
QualityTester
. Which means we might need to enforce a harvester->tester->harvester sequence inendJob
.So the questions to the CMSSW experts are
endJob
? The involved module types areedm::one::EDProducer
and legacyedm::EDAnalyzer
.The text was updated successfully, but these errors were encountered: