Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort signal in PromptReco for JetMET in run 361468 #40032

Closed
mpresill opened this issue Nov 9, 2022 · 20 comments
Closed

Abort signal in PromptReco for JetMET in run 361468 #40032

mpresill opened this issue Nov 9, 2022 · 20 comments

Comments

@mpresill
Copy link

mpresill commented Nov 9, 2022

Hello,

We have a paused Reco job in run 361468. The problem seems to be at module BoostedDoubleSVProducer.
At this link CMS talk from T0 you can see some relevant lines in the job’s log.

Stack trace here:

Thread 1 (Thread 0x2b17ead50940 (LWP 570) "cmsRun"):
#0  0x00002b17ea01eae1 in poll () from /lib64/libc.so.6
#1  0x00002b17efb4072f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x00002b17efb410bc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x00002b17efb43a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b17e9f48a4f in raise () from /lib64/libc.so.6
#6  0x00002b17e9f1bdb5 in abort () from /lib64/libc.so.6
#7  0x00002b17e9f1bc89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#8  0x00002b17e9f413a6 in __assert_fail () from /lib64/libc.so.6
#9  0x00002b1845109cb2 in fastjet::contrib::MeasureDefinition::get_partition(std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&, std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&) const () from /cvmfs/cms.cern.ch/el8_amd64_
gcc10/cms/cmssw/CMSSW_12_4_10/external/el8_amd64_gcc10/lib/libfastjetcontribfragile.so
#10 0x00002b18450ffc48 in fastjet::contrib::Njettiness::getTauComponents(unsigned int, std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/external/el8_amd64_gcc10/lib/libfastj
etcontribfragile.so
#11 0x00002b189b4a25b1 in BoostedDoubleSVProducer::calcNsubjettiness(edm::RefToBase<reco::Jet> const&, float&, float&, std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginRecoBTagSecondaryVertexProducer.so
#12 0x00002b189b4a3785 in BoostedDoubleSVProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/pluginRecoBTagSecondaryVertexProducer.so
#13 0x00002b17e78e86f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b17e78cdc2f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b17e7825e55 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_10/lib/el8_amd64_gcc10/libFWCoreFramework.so
...
Current Modules:

Module: BoostedDoubleSVProducer:pfBoostedDoubleSVAK8TagInfosSlimmedAK8DeepTags (crashed)
Module: MuonIdProducer:muons1stStep
Module: AlCaHcalIsotrkProducer:alcaHcalIsotrkProducer
Module: AlCaHcalIsotrkProducer:alcaHcalIsotrkProducer
Module: PoolOutputModule:write_AOD
Module: GsfElectronProducer:lowPtGsfElectronsPreRegression
Module: SiStripRecHitConverter:siStripMatchedRecHits
Module: LowPtGsfElectronSeedProducer:lowPtGsfElectronSeeds

A fatal system signal has occurred: abort signal
Complete

Probably linked to https://github.com/cms-externals/fastjet-contrib/blob/283910e44f2c3c81133fc68c8f4942b9c53da6e3/Nsubjettiness/Njettiness.cc#L179-L204

Can experts please take a look at this error?

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 9, 2022

A new Issue was created by @mpresill .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor

assign btv-pog,reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 9, 2022

New categories assigned: reconstruction,btv-pog

@mandrenguyen,@clacaputo,@soureek,@johnalison you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor

Minimal recipe to reproduce the crash:

cmsrel CMSSW_12_4_10_patch3
cd CMSSW_12_4_10_patch3/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022F/SegFault/job/WMTaskSpace/cmsRun1/PSet.p* .

Edit PSet.py to be:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents=cms.untracked.uint32(4320)

Run:

cmsRun PSet.py

@Dr15Jones
Copy link
Contributor

Dr15Jones commented Nov 9, 2022

The actual assert message appears just before the stack trace in the log and is

cmsRun: Nsubjettiness/MeasureDefinition.cc:168: fastjet::contrib::TauPartition fastjet::contrib::MeasureDefinition::get_partition(const std::vector<fastjet::PseudoJet>&, const std::vector<fastjet::PseudoJet>&) const: Assertion `has_beam()' fai
led.

which seems to correspond to

https://github.com/cms-externals/fastjet-contrib/blob/03f2fb3c7e26248f5cab3d6c52fab3e112342113/Nsubjettiness/MeasureDefinition.cc#L167-L170

which seems to be called from here (at least based on what version of fast jet is used in CMSSW_12_6)

https://github.com/cms-externals/fastjet-contrib/blob/03f2fb3c7e26248f5cab3d6c52fab3e112342113/Nsubjettiness/Njettiness.cc#L94-L95

@francescobrivio
Copy link
Contributor

The actual assert message appears just before the stack trace in the log and is

cmsRun: Nsubjettiness/MeasureDefinition.cc:168: fastjet::contrib::TauPartition fastjet::contrib::MeasureDefinition::get_partition(const std::vector<fastjet::PseudoJet>&, const std::vector<fastjet::PseudoJet>&) const: Assertion `has_beam()' fai
led.

which seems to correspond to

https://github.com/cms-externals/fastjet-contrib/blob/03f2fb3c7e26248f5cab3d6c52fab3e112342113/Nsubjettiness/MeasureDefinition.cc#L167-L170

which seems to be called from here (at least based on what version of fast jet is used in CMSSW_12_6)

https://github.com/cms-externals/fastjet-contrib/blob/03f2fb3c7e26248f5cab3d6c52fab3e112342113/Nsubjettiness/Njettiness.cc#L94-L95

Thanks Chris!
Indeed @mmusich suggested (privately) that this looks similar to an old (2015!!) issue #12680 (comment) which apparently you fixed in externals IIUC the old PR :)

@soureek
Copy link

soureek commented Nov 10, 2022

@francescobrivio As mentioned in the thread it corresponds to Nsubjettiness definitions. These are typically managed by @cms-sw/jetmet-pog-l2 . We only use the existing definitions in a given release.

@francescobrivio
Copy link
Contributor

@cmsbuild
Copy link
Contributor

New categories assigned: jetmet-pog

@alkaloge,@kirschen,@miquork you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mpresill
Copy link
Author

@alkaloge @kirschen @miquork, any feedback from your side? Thanks.
(cc the next ORM @ebrondol )

@laurenhay
Copy link
Contributor

laurenhay commented Nov 15, 2022

@mpresill hello!
I looked into this with the help of @rappoccio and we found that this fails when fjParticles has unrealistic kinematic values.
For example when printing out pt, eta, phi of the particles at before they're input to njettiness.getTau() at line 720


we find:

pt: 0.582031 eta: 0.189154 phi: 0.889279
pt: 0.307129 eta: 0.185125 phi: 1.37514
pt: 0 eta: 100000 phi: 0
cmsRun: Nsubjettiness/MeasureDefinition.cc:168: fastjet::contrib::TauPartition fastjet::contrib::MeasureDefinition::get_partition(const std::vector<fastjet::PseudoJet>&, const std::vector<fastjet::PseudoJet>&) const: Assertion `has_beam()' failed.

We suggest to protect against these inputs in the Btag code like so #40081

@mandrenguyen
Copy link
Contributor

Thank you @laurenhay !
@kdlong @juska Do you know how a PF candidate with beam kinematics arises?

@francescobrivio
Copy link
Contributor

I tested the solution proposed in #40088 using the recipe from #40032 (comment) and the job isn't crashing anymore.

@laurenhay
Copy link
Contributor

Are we okay to close this issue?

@alkaloge
Copy link
Contributor

+1

@mandrenguyen
Copy link
Contributor

@laurenhay Well, if I understood correctly there are PF candidates being created with essentially beam kinematics, and passed to FastJet. We ought to understand the root of this problem. I believe we can close this issue as solved and open a new one to be assigned to PF.

@mandrenguyen
Copy link
Contributor

+1

@mandrenguyen
Copy link
Contributor

unassign btv-pog

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@mandrenguyen
Copy link
Contributor

please close

Resolved in #40088

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants