Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in RECO only sample production for the 12_5_0_pre5 release validation #39287

Closed
JinfengLiu97 opened this issue Sep 2, 2022 · 49 comments

Comments

@JinfengLiu97
Copy link

Hello, we met a failure in RECO only sample production for the 12_5_0_pre5 release validation, could you help us to solve this issue?

Two kinds of errors were found:

  1. The input dataset is:
    /RelValTTbar_14TeV/CMSSW_12_5_0_pre4-PU_124X_mcRun3_2022_realistic_v10-v2/GEN-SIM-DIGI-RAW
    The failure is found in the RECO step, which can be reproduced with:
    https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_12_5_0_pre5__AUTOMATED_fullsim_PU_2022_14TeV_RECOonly-TTbar_14TeV-00002
    The error report can be found here:
    https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_12_5_0_pre5__AUTOMATED_fullsim_PU_2022_14TeV_RECOonly-TTbar_14TeV-00002
    You can also check it as below:
Fatal Exception (Exit code: 8001)
An exception of category 'FileReadError' occurred while
[0] Processing Event run: 1 lumi: 1 event: 2 stream: 0
[1] Running path 'validation_step'
[2] Prefetching for module HLTHiggsValidator/'hltHiggsValidator'
[3] While reading from source std::vector ak4GenJets '' HLT
[4] Rethrowing an exception that happened on a different read request.
[5] Processing Event run: 1 lumi: 1 event: 1 stream: 1
[6] Running path 'dqmoffline_1_step'
[7] Prefetching for module HLTFiltersDQMonitor/'hltFiltersDQM'
[8] While reading from source trigger::TriggerEventWithRefs hltTriggerSummaryRAW '' HLT
[9] Reading branch triggerTriggerEventWithRefs_hltTriggerSummaryRAW__HLT.
Additional Info:
[a] Fatal Root Error: @SUB=TStreamerInfo::BuildOld
Cannot convert trigger::TriggerRefsCollections::l1ttkmuonRefs_ from type: vector,l1t::TkMuon,edm::refhelper::FindUsingAdvance,l1t::TkMuon> > > to type: vector,l1t::TrackerMuon,edm::refhelper::FindUsingAdvance,l1t::TrackerMuon> > >, skip element
  1. The input dataset is:
    /RelValTTbar_14TeV/CMSSW_12_5_0_pre4-PU_124X_mcRun4_realistic_v8_2026D88PU200-v1/GEN-SIM-DIGI-RAW
    The failure is also found in the RECO step, which can be reproduced with:
    https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_12_5_0_pre5__AUTOMATED_UPSG_Std_2026D88PU200_RECOonly-TTbar_14TeV-00001
    The error report can be found here:
    https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_12_5_0_pre5__AUTOMATED_UPSG_Std_2026D88PU200_RECOonly-TTbar_14TeV-00001
    It can be checked as below:
Fatal Exception (Exit code: 8007)
An exception of category 'DictionaryNotFound' occurred while
[0] Constructing the EventProcessor
Exception Message:
No Dictionary for class: 'vector'

Regards
Jinfeng (for PdmV)

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2022

A new Issue was created by @JinfengLiu97 JinfengLiu.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@kskovpen
Copy link
Contributor

kskovpen commented Sep 2, 2022

Just to add that we have also encountered issues of similar nature in the past: #37013 and #38860.

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

The problem 1 was caused by #38442, and has similar nature as #38860, specifically by

-  typedef l1t::TkMuonVectorRef VRl1ttkmuon;
+  // This is a std::vector<TrackerMuonRef>,
+  // and should be called TrackerMuonVectorRef upstream.
+  // The L1T group should be made aware of that
+  typedef l1t::TrackerMuonRefVector VRl1ttkmuon;

in DataFormats/HLTReco/interface/TriggerRefsCollections.h. There is no way to convert std::vector<edm::Ref<std::vector<TkMuon>>> to std::vector<edm::Ref<std::vector<TrackerMuon>>>, so this change created a backwards incompatibility.

FYI @cms-sw/upgrade-l2 @cms-sw/l1-l2 @cms-sw/hlt-l2

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

In problem 2, the real exception message was (parts got stripped for the Unified page)

----- Begin Fatal Exception 28-Aug-2022 22:55:22 CEST-----------------------
An exception of category 'DictionaryNotFound' occurred while
   [0] Constructing the EventProcessor
Exception Message:
No Dictionary for class: 'vector<l1t::TkPrimaryVertex>'
----- End Fatal Exception -------------------------------------------------

(from https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_RVCMSSW_12_5_0_pre5TTbar_14TeV__2026D88PU200_RECOonly_220828_105329_7185/8007/RecoGlobal_2026D88/ad5cd513-507a-489d-8ff2-93ef8a4b2fce-0-3-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log)

I see l1t::TkPrimaryVertex was removed by #38442. It could be possible to make the job run by dropping the corresponding products on input, but given the problem 1, I'm not sure if that would really be worth it.

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

assign upgrade, l1, hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2022

New categories assigned: upgrade,hlt,l1

@epalencia,@AdrianoDee,@missirol,@srimanob,@rekovic,@Martin-Grunewald,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@perrotta
Copy link
Contributor

perrotta commented Sep 2, 2022

urgent

@cmsbuild cmsbuild added the urgent label Sep 2, 2022
@Martin-Grunewald
Copy link
Contributor

@trtomei

@Martin-Grunewald
Copy link
Contributor

Would re-generating the input file with pre5 solve the issue?

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

Would re-generating the input file with pre5 solve the issue?

It should, because the problem in both cases the problem is in reading a file created with pre4.

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

I wonder if it would be worth to be able to catch this situation (RECO failing on a file created with the previous pre-release) in IB tests? (and either try to fix it before the next pre-release or @cms-sw/pdmv-l2 knowing that they would fail).

@missirol
Copy link
Contributor

missirol commented Sep 2, 2022

At the cost of asking trivial questions: isn't backwards-incompatibility for such wf something that is often bound to happen in pre-releases (as DataFormats change)? Could it have been avoided in any way? Since this is urgent, what's the deliverable of this issue? A work-around (if it exists) to make the job run?

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

isn't backwards-incompatibility for such wf something that is often bound to happen in pre-releases (as DataFormats change)?

We guarantee backwards compatibility between CMSSW major releases only for RAW, for everything else the backwards compatibility is kept with best-effort basis (and thus in practice when a something breaks it must happen in some pre-release). Written that, in practice we have been pretty good in keeping data formats backwards compatible (which in some cases has required non-negligible effort).

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

[8] While reading from source trigger::TriggerEventWithRefs hltTriggerSummaryRAW '' HLT

Just to note, because of the name I became concerned if this product would be stored as part RAW, but from HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py I see RAW includes only


and trigger::TriggerEventWithRefs is only part of HLTDebugRAW (which is used in RAWSIMHLT, RAWRECOSIMHLT and RAWDEBUGHLT)
'keep triggerTriggerEventWithRefs_*_*_*',

@davidlange6
Copy link
Contributor

davidlange6 commented Sep 2, 2022 via email

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

Could it have been avoided in any way?

It might be technically possible to craft an iorule for the new versions of the classes to ignore the corresponding content of the earlier version of the class and just initialize the corresponding data members with default value. Of course those data members would be physics-wise meaningless when reading in older files, but it would be technically possible to read in an old file.

Whether such a setup would make sense here (or in #38860) I don't know.

@srimanob
Copy link
Contributor

srimanob commented Sep 2, 2022

I think for Phase-2 this is not the first time we cross this situation. Phase-2 is in development, do we want to apply the backwards compatibility or put effort to maintain it?

@makortel
Copy link
Contributor

makortel commented Sep 2, 2022

Seems that the problem 1 occurred in Run 3 workflow (input dataset being /RelValTTbar_14TeV/CMSSW_12_5_0_pre4-PU_124X_mcRun3_2022_realistic_v10-v2/GEN-SIM-DIGI-RAW)

@kskovpen
Copy link
Contributor

kskovpen commented Sep 2, 2022

Just to confirm that we see these issues in both Run3 and Phase2 wfs (as mentioned in the issue description).

@davidlange6
Copy link
Contributor

@kskovpen - please summarize the runTheMatrix numbers that should reproduce this. Thx.

@kskovpen
Copy link
Contributor

kskovpen commented Sep 2, 2022

Sure, the affected wfs mentioned in this issue are 11834 and 39434.

@makortel
Copy link
Contributor

makortel commented Sep 6, 2022

I tried to reproduce problem 1 by generating the step_2_cfg.py as in https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_12_5_0_pre5__AUTOMATED_fullsim_PU_2022_14TeV_RECOonly-TTbar_14TeV-00002, and instead of the exception message shown in the description I get a segfault

#3  0x00007ff533491aeb in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_0_pre5/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007ff4a84f020f in HLTExoticaSubAnalysis::insertCandidates(unsigned int const&, EVTColContainer const*, std::vector<reco::LeafCandidate, std::allocator<reco::LeafCandidate> >*, std::map<int, double, std::less<int>, std::allocator<std::pair<int const, double> > >&, std::map<int, std::vector<reco::Track const*, std::allocator<reco::Track const*> >, std::less<int>, std::allocator<std::pair<int const, std::vector<reco::Track const*, std::allocator<reco::Track const*> > > > >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_0_pre5/lib/el8_amd64_gcc10/pluginHLTriggerOfflineExotica.so
#6  0x00007ff4a84f4ac2 in HLTExoticaSubAnalysis::analyze(edm::Event const&, edm::EventSetup const&, EVTColContainer*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_0_pre5/lib/el8_amd64_gcc10/pluginHLTriggerOfflineExotica.so
#7  0x00007ff4a84ff26c in HLTExoticaValidator::analyze(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_0_pre5/lib/el8_amd64_gcc10/pluginHLTriggerOfflineExotica.so
#8  0x00007ff4a8504877 in DQMOneEDAnalyzer<>::accumulate(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_0_pre5/lib/el8_amd64_gcc10/pluginHLTriggerOfflineExotica.so

Current Modules:
Module: HLTExoticaValidator:hltExoticaValidator (crashed)

Following up the discussion in ORP, I see the job uses Run 3 GT auto:phase1_2022_realistic and Run 3 files (/store/relval/CMSSW_12_5_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_124X_mcRun3_2022_realistic_v10-v2/10000/076acdfd-7e49-4de5-a787-de1991f5d801.root etc for primary input, /store/relval/CMSSW_12_5_0_pre4/RelValMinBias_14TeV/GEN-SIM/124X_mcRun3_2022_realistic_v10-v3/10000/15d2dd24-0756-4bfe-8d1e-821dba027836.root etc for MixingModule)

@makortel
Copy link
Contributor

makortel commented Sep 6, 2022

I was able to reproduce the exception with del process.hltExoticaValidator.

I also started to wonder one aspects in this workflow. The problematic data product is being read by a (HLT) validation module. Is VALIDATION sequence necessary for the purposes of the workflow? Or could the goal be achieved by running like with data, i.e. with only DQM?

I noticed the example job is configured to use 8 threads and 2 streams. I vaguely recall such a setup being used for phase2 RelVals to keep the memory under control. Do Run 3 RelVals also require that much memory?

@srimanob
Copy link
Contributor

srimanob commented Sep 8, 2022

  • Add workflow(s) to the runTheMatrix, to be run in IBs, that exercise the behavior of these RelVal workflows (RECO+DQM+VALIDATION reading MC DIGI-RAW files from earlier pre-release)

Put this task in #39346. There is a point to clarify first on the RAW we will use.

@Martin-Grunewald
Copy link
Contributor

Hmm, we used to guarantee only pure RAW compatibility - AFAIK, rereco is done from RAW as well...

@makortel
Copy link
Contributor

makortel commented Sep 8, 2022

Do I understand correctly that for production, we will not have an issue (because no DQM/Validation is running)?

That should be the case.

@missirol
Copy link
Contributor

missirol commented Sep 8, 2022

Fwiw..

Make the reading of trigger::TriggerEventWithRefs optional in HLTFiltersDQMonitor (e.g. with configuration parameter) so the offending product could be dropped on input

Regarding workarounds for HLTFiltersDQMonitor, that DQM module won't access the offending collections if one uses

process.hltFiltersDQM.triggerResults.setProcessName('')

(this disables all DQM outputs of that module, so it wouldn't be that different from just doing del process.hltFiltersDQM.)
That plugin relies by construction on TriggerEventWithRefs. A flag like useTriggerEvent could be added, to produce only the outputs that don't require info from TriggerEvent* collections. That'd be better than deleting the module, but I don't think it would provide much of a solution (see below).

I reproduced what was described in #39287 (comment), but even with

del process.hltExoticaValidator
del process.hltFiltersDQM

the errors from the dqm/validation steps continue [*].

The problematic data product is being read by a (HLT) validation module. Is VALIDATION sequence necessary for the purposes of the workflow? Or could the goal be achieved by running like with data, i.e. with only DQM?

If this refers to hltFiltersDQM, I think that's a DQM (not Validation) module. The other exceptions I have seen so far do come from Validation modules, I think.

[*]

----- Begin Fatal Exception 08-Sep-2022 16:30:02 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 1
   [1] Running path 'validation_step'
   [2] Prefetching for module MuonAssociatorEDProducer/'tpToL3MuonAssociation'
   [3] While reading from source std::vector<PSimHit> g4SimHits 'MuonRPCHits' SIM
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 1 lumi: 1 event: 1 stream: 1
   [6] Running path 'validation_step'
   [7] Prefetching for module HLTJetMETValidation/'SingleJetMetPaths'
   [8] While reading from source trigger::TriggerEventWithRefs hltTriggerSummaryRAW '' HLT
   [9] Reading branch triggerTriggerEventWithRefs_hltTriggerSummaryRAW__HLT.
   Additional Info:
      [a] Fatal Root Error: @SUB=TStreamerInfo::BuildOld
Cannot convert trigger::TriggerRefsCollections::l1ttkmuonRefs_ from type: vector<edm::Ref<vector<l1t::TkMuon>,l1t::TkMuon,edm::refhelper::FindUsingAdvance<vector<l1t::TkMuon>,l1t::TkMuon> > > to type: vector<edm::Ref<vector<l1t::TrackerMuon>,l1t::TrackerMuon,edm::refhelper::FindUsingAdvance<vector<l1t::TrackerMuon>,l1t::TrackerMuon> > >, skip element

----- End Fatal Exception -------------------------------------------------

and then

----- Begin Fatal Exception 08-Sep-2022 17:34:10 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 1
   [1] Running path 'validation_step'
   [2] Prefetching for module MuonAssociatorEDProducer/'tpToL3MuonAssociation'
   [3] While reading from source std::vector<PSimHit> g4SimHits 'MuonRPCHits' SIM
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 1 lumi: 1 event: 1 stream: 1
   [6] Running path 'validation_step'
   [7] Prefetching for module HeavyFlavorValidation/'hfupsilon'
   [8] While reading from source trigger::TriggerEventWithRefs hltTriggerSummaryRAW '' HLT
   [9] Reading branch triggerTriggerEventWithRefs_hltTriggerSummaryRAW__HLT.
   Additional Info:
      [a] Fatal Root Error: @SUB=TStreamerInfo::BuildOld
Cannot convert trigger::TriggerRefsCollections::l1ttkmuonRefs_ from type: vector<edm::Ref<vector<l1t::TkMuon>,l1t::TkMuon,edm::refhelper::FindUsingAdvance<vector<l1t::TkMuon>,l1t::TkMuon> > > to type: vector<edm::Ref<vector<l1t::TrackerMuon>,l1t::TrackerMuon,edm::refhelper::FindUsingAdvance<vector<l1t::TrackerMuon>,l1t::TrackerMuon> > >, skip element

----- End Fatal Exception -------------------------------------------------

@makortel
Copy link
Contributor

makortel commented Sep 8, 2022

The problematic data product is being read by a (HLT) validation module. Is VALIDATION sequence necessary for the purposes of the workflow? Or could the goal be achieved by running like with data, i.e. with only DQM?

If this refers to hltFiltersDQM, I think that's a DQM (not Validation) module. The other exceptions I have seen so far do come from Validation modules, I think.

Thanks for the correction. I suppose the data RelVals use the "pure RAW" rather than an output of re-HLT(?) since failures have not been seen there (or have they?).

I see HLTFiltersDQMonitor can already handle the case of missing `trigger::TriggerEventWithRefs, so in principle one could drop the product on input. On a quick test this

process.source.inputCommands = cms.untracked.vstring(
    "keep *",
    "drop triggerTriggerEventWithRefs_*_*_*"
)

seems to be sufficient to get the step2 job to run (so apparently all modules in this job consuming are able to handle it being absent, including hltExoticaValidator). This workaround could, in principle, be applied already for the RECO-only jobs in 12_5_0_pre5 RelVals (since it is configuration-only).

@missirol
Copy link
Contributor

missirol commented Sep 8, 2022

Thanks for finding the workaround!

I suppose the data RelVals use the "pure RAW" rather than an output of re-HLT(?) since failures have not been seen there (or have they?).

I think so, but I don't know for sure (and I'm not aware of other failures). Below some info, but please correct if needed.

  • A look inside relval_steps suggests RAW and RAW-RECO data sets are used. Those don't include TriggerEventWithRefs, iiuc.

  • The only mentions of HLTDEBUG in relval_steps refer to MC samples, and led me to wfs defined in relval_identity, but I'm not sure those wfs are used (I can't find them in the IB webpage).

  • Maybe it's worth mentioning that in many wfs the DQM and Validation sequences for HLT are 'fake' (when a fake HLT menu is used); I guess it's possible that, in such wfs, one would not see the issue if no modules try to consume TriggerEventWithRefs (it's speculation, I don't have a specific case in mind).

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@missirol
Copy link
Contributor

missirol commented Dec 2, 2022

This issue is marked as 'urgent', but seemingly dormant.

Workarounds were provided, and the last part of #39287 (comment) outlined a few possible action items.

What are the next steps? (and for whom)

PS.

I suppose the data RelVals use the "pure RAW" rather than an output of re-HLT(?) since failures have not been seen there (or have they?).

I didn't really catch this comment back then, but I think this was fixed in #39834.

@makortel
Copy link
Contributor

makortel commented Dec 2, 2022

What are the next steps? (and for whom)

I'd imagine the need to address these specific RelVal workflows is long gone. I wonder if we should improve testing to be able to catch these situations earlier (which would, more or less, mean a representative workflow in runTheMatrix)?

@missirol
Copy link
Contributor

missirol commented Dec 2, 2022

I would say 'yes', but I guess this is a question to @cms-sw/pdmv-l2 , since they are the ones who opened the issue, and likely the ones who would implement such a test.

@kskovpen
Copy link
Contributor

kskovpen commented Dec 2, 2022

Hi @makortel and @missirol, we regularly produce the reco-only samples (i.e. when the input dataset comes from the previous release), and every time there are such issues, we try to report it. This said, we do test such issues in the actual production of samples.

@makortel
Copy link
Contributor

makortel commented Dec 2, 2022

@kskovpen Do you find the current mode of operation, i.e. discovering data format incompatibilities in these RelVals (that either can be worked around or not), sufficient? Or would you like to catch them sooner?

@kskovpen
Copy link
Contributor

kskovpen commented Dec 3, 2022

@makortel I've been trying to implement it, but it looks quite messy in upgrade relval implementation if we want to make it dynamic and grab the latest pre release production. Also, the datasets have to be produced first before they can enter wfs in IB tests, and there is always a time delay in the relval production. I would say let's follow the usual way.

@missirol
Copy link
Contributor

#40288 adds a test that should catch non-backward-compatible changes to the TriggerEventWithRefs data format. This is a not a general solution to this issue, but it should at least show when that data format is changed in non-backward-compatible ways; at that point, one can decide whether or not to fix that, before a new (pre)release is built.

@missirol
Copy link
Contributor

+hlt

Going back to #39287 (comment)

Make the reading of trigger::TriggerEventWithRefs optional in HLTFiltersDQMonitor (e.g. with configuration parameter) so the offending product could be dropped on input

#39287 (comment) clarified that this is already the case, and gives a possible workaround for this issue.

Separate the Refs to Phase2 objects (or the subset that has risk on further evolution) from trigger::TriggerEventWithRefs into another event data product (IIRC this was also mentioned in ORP)

This is less easy, and wasn't attempted for now. It might have to be reconsidered if this issue continues to appear.

For now, HLT added a simple test to catch non-backward-compatible changes to TriggerEventWithRefs (#40288).

@cecilecaillol
Copy link
Contributor

+l1

@kskovpen
Copy link
Contributor

#40288 adds a test that should catch non-backward-compatible changes to the TriggerEventWithRefs data format. This is a not a general solution to this issue, but it should at least show when that data format is changed in non-backward-compatible ways; at that point, one can decide whether or not to fix that, before a new (pre)release is built.

Thanks a lot!

@AdrianoDee
Copy link
Contributor

+upgrade
(and thanks Marino)

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests