-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in RECO only sample production for the 12_5_0_pre5 release validation #39287
Comments
A new Issue was created by @JinfengLiu97 JinfengLiu. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
The problem 1 was caused by #38442, and has similar nature as #38860, specifically by - typedef l1t::TkMuonVectorRef VRl1ttkmuon;
+ // This is a std::vector<TrackerMuonRef>,
+ // and should be called TrackerMuonVectorRef upstream.
+ // The L1T group should be made aware of that
+ typedef l1t::TrackerMuonRefVector VRl1ttkmuon; in FYI @cms-sw/upgrade-l2 @cms-sw/l1-l2 @cms-sw/hlt-l2 |
In problem 2, the real exception message was (parts got stripped for the Unified page)
I see |
assign upgrade, l1, hlt |
New categories assigned: upgrade,hlt,l1 @epalencia,@AdrianoDee,@missirol,@srimanob,@rekovic,@Martin-Grunewald,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks |
urgent |
Would re-generating the input file with pre5 solve the issue? |
It should, because the problem in both cases the problem is in reading a file created with pre4. |
I wonder if it would be worth to be able to catch this situation (RECO failing on a file created with the previous pre-release) in IB tests? (and either try to fix it before the next pre-release or @cms-sw/pdmv-l2 knowing that they would fail). |
At the cost of asking trivial questions: isn't backwards-incompatibility for such wf something that is often bound to happen in pre-releases (as DataFormats change)? Could it have been avoided in any way? Since this is urgent, what's the deliverable of this issue? A work-around (if it exists) to make the job run? |
We guarantee backwards compatibility between CMSSW major releases only for RAW, for everything else the backwards compatibility is kept with best-effort basis (and thus in practice when a something breaks it must happen in some pre-release). Written that, in practice we have been pretty good in keeping data formats backwards compatible (which in some cases has required non-negligible effort). |
Just to note, because of the name I became concerned if this product would be stored as part RAW, but from
and trigger::TriggerEventWithRefs is only part of HLTDebugRAW (which is used in RAWSIMHLT , RAWRECOSIMHLT and RAWDEBUGHLT )
|
Indeed, what sort of workflow is this and why is not part of the IB tests?
… On Sep 2, 2022, at 3:41 PM, Matti Kortelainen ***@***.***> wrote:
I wonder if it would be worth to be able to catch this situation (RECO failing on a file created with the previous pre-release) in IB tests? (and either try to fix it before the next pre-release or @cms-sw/pdmv-l2 knowing that they would fail).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
It might be technically possible to craft an iorule for the new versions of the classes to ignore the corresponding content of the earlier version of the class and just initialize the corresponding data members with default value. Of course those data members would be physics-wise meaningless when reading in older files, but it would be technically possible to read in an old file. Whether such a setup would make sense here (or in #38860) I don't know. |
I think for Phase-2 this is not the first time we cross this situation. Phase-2 is in development, do we want to apply the backwards compatibility or put effort to maintain it? |
Seems that the problem 1 occurred in Run 3 workflow (input dataset being |
Just to confirm that we see these issues in both Run3 and Phase2 wfs (as mentioned in the issue description). |
@kskovpen - please summarize the runTheMatrix numbers that should reproduce this. Thx. |
Sure, the affected wfs mentioned in this issue are 11834 and 39434. |
I tried to reproduce problem 1 by generating the
Following up the discussion in ORP, I see the job uses Run 3 GT |
I was able to reproduce the exception with I also started to wonder one aspects in this workflow. The problematic data product is being read by a (HLT) validation module. Is I noticed the example job is configured to use 8 threads and 2 streams. I vaguely recall such a setup being used for phase2 RelVals to keep the memory under control. Do Run 3 RelVals also require that much memory? |
Put this task in #39346. There is a point to clarify first on the RAW we will use. |
Hmm, we used to guarantee only pure RAW compatibility - AFAIK, rereco is done from RAW as well... |
That should be the case. |
Fwiw..
Regarding workarounds for
(this disables all DQM outputs of that module, so it wouldn't be that different from just doing I reproduced what was described in #39287 (comment), but even with
the errors from the dqm/validation steps continue [*].
If this refers to [*]
and then
|
Thanks for the correction. I suppose the data RelVals use the "pure RAW" rather than an output of re-HLT(?) since failures have not been seen there (or have they?). I see process.source.inputCommands = cms.untracked.vstring(
"keep *",
"drop triggerTriggerEventWithRefs_*_*_*"
) seems to be sufficient to get the step2 job to run (so apparently all modules in this job consuming are able to handle it being absent, including |
Thanks for finding the workaround!
I think so, but I don't know for sure (and I'm not aware of other failures). Below some info, but please correct if needed.
|
That is real data raw not ransim..
… On Sep 8, 2022, at 9:19 AM, Martin Grunewald ***@***.***> wrote:
Hmm, we used to guarantee only pure RAW compatibility - AFAIK, rereco is done from RAW as well...
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.
|
This issue is marked as 'urgent', but seemingly dormant. Workarounds were provided, and the last part of #39287 (comment) outlined a few possible action items. What are the next steps? (and for whom) PS.
I didn't really catch this comment back then, but I think this was fixed in #39834. |
I'd imagine the need to address these specific RelVal workflows is long gone. I wonder if we should improve testing to be able to catch these situations earlier (which would, more or less, mean a representative workflow in runTheMatrix)? |
I would say 'yes', but I guess this is a question to @cms-sw/pdmv-l2 , since they are the ones who opened the issue, and likely the ones who would implement such a test. |
@kskovpen Do you find the current mode of operation, i.e. discovering data format incompatibilities in these RelVals (that either can be worked around or not), sufficient? Or would you like to catch them sooner? |
@makortel I've been trying to implement it, but it looks quite messy in upgrade relval implementation if we want to make it dynamic and grab the latest pre release production. Also, the datasets have to be produced first before they can enter wfs in IB tests, and there is always a time delay in the relval production. I would say let's follow the usual way. |
#40288 adds a test that should catch non-backward-compatible changes to the |
+hlt Going back to #39287 (comment)
#39287 (comment) clarified that this is already the case, and gives a possible workaround for this issue.
This is less easy, and wasn't attempted for now. It might have to be reconsidered if this issue continues to appear. For now, HLT added a simple test to catch non-backward-compatible changes to |
+l1 |
Thanks a lot! |
+upgrade |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Hello, we met a failure in RECO only sample production for the 12_5_0_pre5 release validation, could you help us to solve this issue?
Two kinds of errors were found:
/RelValTTbar_14TeV/CMSSW_12_5_0_pre4-PU_124X_mcRun3_2022_realistic_v10-v2/GEN-SIM-DIGI-RAW
The failure is found in the RECO step, which can be reproduced with:
https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_12_5_0_pre5__AUTOMATED_fullsim_PU_2022_14TeV_RECOonly-TTbar_14TeV-00002
The error report can be found here:
https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_12_5_0_pre5__AUTOMATED_fullsim_PU_2022_14TeV_RECOonly-TTbar_14TeV-00002
You can also check it as below:
/RelValTTbar_14TeV/CMSSW_12_5_0_pre4-PU_124X_mcRun4_realistic_v8_2026D88PU200-v1/GEN-SIM-DIGI-RAW
The failure is also found in the RECO step, which can be reproduced with:
https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_12_5_0_pre5__AUTOMATED_UPSG_Std_2026D88PU200_RECOonly-TTbar_14TeV-00001
The error report can be found here:
https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_12_5_0_pre5__AUTOMATED_UPSG_Std_2026D88PU200_RECOonly-TTbar_14TeV-00001
It can be checked as below:
Regards
Jinfeng (for PdmV)
The text was updated successfully, but these errors were encountered: