Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal Exception in Express processing with CMSSW_13_0_7 #41843

Closed
francescobrivio opened this issue Jun 1, 2023 · 23 comments
Closed

Fatal Exception in Express processing with CMSSW_13_0_7 #41843

francescobrivio opened this issue Jun 1, 2023 · 23 comments

Comments

@francescobrivio
Copy link
Contributor

francescobrivio commented Jun 1, 2023

There are some reported crashed in the Express processing of StreamHLTMonitor for run 368318 (see CMSTalk post).

The exception is:

An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 368318 lumi: 159 event: 237000809 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
Exception Message:
attempting to get view from an unavailable RefToBaseProd.

Recipe to reproduce the error:

cmsrel CMSSW_13_0_7
cd CMSSW_13_0_7/src
cmsenv
cp /eos/cms/store/logs/prod/recent/Express/Express_Run368318_StreamHLTMonitor/Express/vocms0314.cern.ch-441154-0-log.tar.gz .
tar -zxvf vocms0314.cern.ch-441154-0-log.tar.gz 
cd job/WMTaskSpace/cmsRun1

edit the PSet.py to be:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents = cms.untracked.uint32(169)

run it with:

cmsRun PSet.py
@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2023

A new Issue was created by @francescobrivio .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor Author

assign reconstruction,btv-pog

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2023

New categories assigned: btv-pog,reconstruction

@mandrenguyen,@clacaputo,@soureek,@johnalison you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor Author

A possible PR which entered in 13_0_7 and touched PNet DQM is #41704

@francescobrivio
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2023

New categories assigned: dqm

@tjavaid,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor Author

urgent

  • the number of failed jobs affected by this in Tier0 is growing and growing

@cmsbuild cmsbuild added the urgent label Jun 1, 2023
@francescobrivio
Copy link
Contributor Author

A possible PR which entered in 13_0_7 and touched PNet DQM is #41704

let me tag also explicitly the people involved in this: @marinakolosova @rgerosa @scooperstein

@francescobrivio
Copy link
Contributor Author

assign hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2023

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Jun 1, 2023

The exception gets thrown from

for (const auto& jtag : *jetPNETScoreHLTHandle) {

The dereferenced edm::RefToBaseProd points to a data product with ProductID of 1:2641. The process 1 is the HLT process, and the product 2641 does not exist in the input streamer file.

The module particleNetAK8HbbTagMonitoring has been configured with

requireHLTOfflineJetMatching = cms.bool(True),
jetPNETScoreHLT = cms.InputTag("hltParticleNetDiscriminatorsJetTagsAK8","HbbVsQCD"),

(among others)

@francescobrivio
Copy link
Contributor Author

It might be a misconfiguration of the eventContent of the HLTMonitor stream (not yet confirmed).
@trtomei is testing it now on Hilton.

@perrotta
Copy link
Contributor

perrotta commented Jun 1, 2023

A possible PR which entered in 13_0_7 and touched PNet DQM is #41704

@marinakolosova

@missirol
Copy link
Contributor

missirol commented Jun 1, 2023

I think the EventContent of the HLTMonitor stream was incorrect, and has been fixed in CMSHLT-2825. I think we have seen it only after deploying 13_0_7 at T0 (as opposed to, after deploying the HLT menu with the incorrect EventContent) because #41704 exposed this mistake.

That said, I haven't verified that this fix (which is correct anyway) will solve this issue. If it does, the problem should not occur anymore starting from run-368321.

@francescobrivio
Copy link
Contributor Author

Indeed starting from run 368321 we don't see anymore failed jobs.
I'll keep monitoring the processing and if all goes fine I'll close the issue.

@francescobrivio
Copy link
Contributor Author

Indeed starting from run 368321 we don't see anymore failed jobs.
I'll keep monitoring the processing and if all goes fine I'll close the issue.

Ok I haven't seen any other failure of this kind after run 368321 (when the new menu was deployed).

Before closing this issue, is there something we could do to:

  • avoid crashing the whole job? (some protection/check before the dereferenciation?)
  • add a test for suche cases?

@makortel
Copy link
Contributor

makortel commented Jun 2, 2023

  • avoid crashing the whole job? (some protection/check before the dereferenciation?)

In principle adding something along the following

if (not jetPNETScoreHLTHandle->keyProduct().isAvailable()) {
      edm::LogWarning("ParticleNetJetTagMonitor") << "Collection used as a key by HLT Jet tags collection is not available, will skip event";
      return;
}

before the loop

for (const auto& jtag : *jetPNETScoreHLTHandle) {

would the appropriate check for the availability of the key collection. Unfortunately that code does not work today, because edm::RefToBaseProd (the type returned by keyProduct() above) does not have isAvailable() member function. It looks to be straightforward to add though (and therefore will do).

@makortel
Copy link
Contributor

makortel commented Jun 2, 2023

  • add a test for suche cases?

The framework provides a TestProcessor facility that enables fairly easy testing of modules in isolation e.g. for various corner cases such as "Run had no lumis", "Lumi had no events", "Event does not contain the necessary data products". It is documented in https://github.com/cms-sw/cmssw/blob/master/FWCore/TestProcessor/Readme.md, and many examples can be found with "git grep TestProcessor". A notable example is in the skeleton templates https://github.com/cms-sw/cmssw/blob/master/FWCore/Skeletons/mkTemplates/EDProducer/test_catch2_EDProducer.cc

@makortel
Copy link
Contributor

makortel commented Jun 2, 2023

Unfortunately that code does not work today, because edm::RefToBaseProd (the type returned by keyProduct() above) does not have isAvailable() member function. It looks to be straightforward to add though (and therefore will do).

The isAvailable() member function is added to edm::RefToBaseProd in #41858.

@mmusich
Copy link
Contributor

mmusich commented Jun 12, 2023

The isAvailable() member function is added to edm::RefToBaseProd in #41858.

given the thread in #41858 (comment) it's not clear to me if the recommendation is to NOT protect the DQM code with this because of disk reading costs, or not.

@makortel
Copy link
Contributor

The isAvailable() member function is added to edm::RefToBaseProd in #41858.

given the thread in #41858 (comment) it's not clear to me if the recommendation is to NOT protect the DQM code with this because of disk reading costs, or not.

The DQM code de-references the corresponding RefToBase's in

float dR = reco::deltaR(selectedJets[jetPNETScoreSortedIndices.at(jreco)].p4(),
jetHLTRefs.at(jetPNETScoreSortedIndicesHLT.at(jhlt))->p4());

so the data would be read from storage in any case. So the cost of an isAvailable() check there should be negligible.

@francescobrivio
Copy link
Contributor Author

I'll close this given it has been addressed in #41858 + #41930. Thanks everyone!

@missirol
Copy link
Contributor

+hlt

The EventContent of the HLTMonitor stream was fixed in CMSHLT-2825.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants