Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation violation in PromptReco for FastjetJetProducer:ak4PFJets #41397

Closed
malbouis opened this issue Apr 24, 2023 · 40 comments
Closed

Segmentation violation in PromptReco for FastjetJetProducer:ak4PFJets #41397

malbouis opened this issue Apr 24, 2023 · 40 comments

Comments

@malbouis
Copy link
Contributor

malbouis commented Apr 24, 2023

There is one job failing Reco for Run 366451, dataset ParkingDoubleElectronLowMass, with a segmentation violation, as described in https://cms-talk.web.cern.ch/t/segmentation-error-in-promptreco-for-run-366451-dataset-parkingdoubleelectronlowmass/23152

The crash seems to be from module FastjetJetProducer:

%MSG-w TrackProducerBase:  TrackRefitter:hltTrackRefitterForSiStripMonitorTrack  24-Apr-2023 18:58:38 CEST Run: 366451 Event: 418574346
 BeamSpot is not valid
%MSG
%MSG-e TrackRefitter:  TrackRefitter:hltTrackRefitterForSiStripMonitorTrack  24-Apr-2023 18:58:38 CEST Run: 366451 Event: 418574346
 BeamSpot is (0,0,0), it is probably because is not valid in the event
%MSG

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

...

Current Modules:

Module: FastjetJetProducer:ak4PFJets (crashed)
Module: MultiHitFromChi2EDProducer:pixelLessStepHitTriplets
Module: PFClusterProducer:particleFlowClusterHBHE
Module: RecHitTask:recHitTask
Module: TrackProducer:mixedTripletStepTracks
Module: MuonIdProducer:muons1stStep
Module: TrackProducer:initialStepTracks
Module: CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets

A fatal system signal has occurred: segmentation violation

The full log is at /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023B/job_248341/job/WMTaskSpace/cmsRun1 as described in the original email.

I was able to reproduce the failure locally.

@malbouis malbouis changed the title Segmentation error in PromptReco Segmentation violation in PromptReco Apr 24, 2023
@cmsbuild
Copy link
Contributor

A new Issue was created by @malbouis .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@malbouis
Copy link
Contributor Author

assign reconstruction

@makortel
Copy link
Contributor

Full stack trace from the log

Thread 9 (Thread 0x2b854cc00700 (LWP 656) "cmsRun"):
#3  0x00002b8503ef333b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  fastjet::LazyTiling9::_tj_set_jetinfo (_jets_index=1409, jet=0x2b888c4f9688, this=0x2b854cbf8cd0) at LazyTiling9.cc:258
#6  fastjet::LazyTiling9::run (this=this@entry=0x2b854cbf8cd0) at LazyTiling9.cc:509
#7  0x00002b854bf863e4 in fastjet::ClusterSequence::_initialise_and_run_no_decant (this=0x2b854cbf9050) at ClusterSequence.cc:412
#8  0x00002b854bf09d9c in fastjet::ClusterSequenceActiveAreaExplicitGhosts::_initialise<fastjet::PseudoJet> (this=0x2b854cbf9050, pseudojets=..., jet_def_in=..., ghost_spec=<optimized out>, ghosts=<optimized out>, ghost_area=<optimized out>, writeout_combinations=@0x2b854cbf8fbf: false) at ./../include/fastjet/ClusterSequenceActiveAreaExplicitGhosts.hh:224
#9  0x00002b854bfb1e72 in fastjet::ClusterSequenceActiveAreaExplicitGhosts::ClusterSequenceActiveAreaExplicitGhosts<fastjet::PseudoJet> (writeout_combinations=@0x2b854cbf8fbf: false, ghost_spec=..., jet_def_in=..., pseudojets=..., this=0x2b854cbf9050) at ./../include/fastjet/ClusterSequenceActiveAreaExplicitGhosts.hh:69
#10 fastjet::ClusterSequenceActiveArea::_run_AA (this=0x2b888aea0800, ghost_spec=...) at ClusterSequenceActiveArea.cc:133
#11 0x00002b854bfb215b in fastjet::ClusterSequenceActiveArea::_initialise_and_run_AA (this=0x2b888aea0800, jet_def_in=..., ghost_spec=..., writeout_combinations=<optimized out>) at ClusterSequenceActiveArea.cc:61
#12 0x00002b85758bde3c in void fastjet::ClusterSequenceArea::initialize_and_run_cswa<fastjet::PseudoJet>(std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&, fastjet::JetDefinition const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoJetsJetProducers_plugins.so
#13 0x00002b85758bea51 in fastjet::ClusterSequenceArea::ClusterSequenceArea<fastjet::PseudoJet>(std::vector<fastjet::PseudoJet, std::allocator<fastjet::PseudoJet> > const&, fastjet::JetDefinition const&, fastjet::AreaDefinition const&) [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoJetsJetProducers_plugins.so
#14 0x00002b85758d82f6 in FastjetJetProducer::runAlgorithm(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoJetsJetProducers_plugins.so
#15 0x00002b8575916a16 in VirtualJetProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoJetsJetProducers_plugins.so
#16 0x00002b85758d32dd in FastjetJetProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoJetsJetProducers_plugins.so
#17 0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so


Thread 8 (Thread 0x2b854ba00700 (LWP 655) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b852842377f in HelixForwardPlaneCrossing::position(double) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsGeomPropagators.so
#5  0x00002b852ea2120e in CompositeTECWedge::groupedCompatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTkDetLayers.so
#6  0x00002b852ea216cb in CompatibleDetToGroupAdder::add(GeometricSearchDet const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTkDetLayers.so
#7  0x00002b852ea224d2 in CompositeTECPetal::groupedCompatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTkDetLayers.so
#8  0x00002b852ea216cb in CompatibleDetToGroupAdder::add(GeometricSearchDet const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTkDetLayers.so
#9  0x00002b852ea2b3fa in TECLayer::groupedCompatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator<DetGroup> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTkDetLayers.so
#10 0x00002b850a0302fb in GeometricSearchDet::compatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<std::pair<GeomDet const*, TrajectoryStateOnSurface>, std::allocator<std::pair<GeomDet const*, TrajectoryStateOnSurface> > >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsDetLayers.so
#11 0x00002b850a02f6d4 in GeometricSearchDet::compatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsDetLayers.so
#12 0x00002b856a834b1f in TrackProducerBase<reco::Track>::setSecondHitPattern(Trajectory*, reco::Track&, Propagator const*, MeasurementTrackerEvent const*, TrackerTopology const*)::{lambda(std::vector<DetLayer const*, std::allocator<DetLayer const*> > const&, TrajectoryStateOnSurface const&)#1}::operator()(std::vector<DetLayer const*, std::allocator<DetLayer const*> > const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTrackProducer.so
#13 0x00002b856a835060 in TrackProducerBase<reco::Track>::setSecondHitPattern(Trajectory*, reco::Track&, Propagator const*, MeasurementTrackerEvent const*, TrackerTopology const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTrackProducer.so
#14 0x00002b856a837e85 in KfTrackProducerBase::putInEvt(edm::Event&, Propagator const*, MeasurementTracker const*, std::unique_ptr<edm::OwnVector<TrackingRecHit, edm::ClonePolicy<TrackingRecHit> >, std::default_delete<edm::OwnVector<TrackingRecHit, edm::ClonePolicy<TrackingRecHit> > > >&, std::unique_ptr<std::vector<reco::Track, std::allocator<reco::Track> >, std::default_delete<std::vector<reco::Track, std::allocator<reco::Track> > > >&, std::unique_ptr<std::vector<reco::TrackExtra, std::allocator<reco::TrackExtra> >, std::default_delete<std::vector<reco::TrackExtra, std::allocator<reco::TrackExtra> > > >&, std::unique_ptr<std::vector<Trajectory, std::allocator<Trajectory> >, std::default_delete<std::vector<Trajectory, std::allocator<Trajectory> > > >&, std::unique_ptr<std::vector<int, std::allocator<int> >, std::default_delete<std::vector<int, std::allocator<int> > > >&, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&, TransientTrackingRecHitBuilder const*, TrackerTopology const*, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTrackProducer.so
#15 0x00002b856a7540f4 in TrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoTrackerTrackProducerPlugins.so
#16 0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 7 (Thread 0x2b854ac02700 (LWP 654) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b852f07ac53 in CellularAutomaton::createAndConnectCells(std::vector<HitDoublets const*, std::allocator<HitDoublets const*> > const&, TrackingRegion const&, CACut const&, CACut const&, float) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoPixelVertexingPixelTriplets.so
#5  0x00002b852f0746a8 in CAHitQuadrupletGenerator::hitNtuplets(IntermediateHitDoublets const&, std::vector<OrderedHitSeeds, std::allocator<OrderedHitSeeds> >&, SeedingLayerSetsHits const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoPixelVertexingPixelTriplets.so
#6  0x00002b858c452c8f in CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoPixelVertexingPixelTripletsPlugins.so
#7  0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#8  0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 6 (Thread 0x2b854a201700 (LWP 653) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b85a355538d in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#5  0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#6  0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#7  0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#8  0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#9  0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#10 0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#11 0x00002b85a3554df7 in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#12 0x00002b85a3555c1e in Basic2DGenericPFlowClusterizer::buildClusters(std::vector<reco::PFCluster, std::allocator<reco::PFCluster> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#13 0x00002b85a3579689 in PFClusterProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoParticleFlowPFClusterProducerPlugins.so
#14 0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 5 (Thread 0x2b8549600700 (LWP 652) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b850427f406 in TH2F::AddBinContent(int, double) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/external/el8_amd64_gcc11/lib/libHist.so
#5  0x00002b85042768c0 in TH2::Fill(double, double, double) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/external/el8_amd64_gcc11/lib/libHist.so
#6  0x00002b8504765295 in dqm::impl::MonitorElement::Fill(double, double) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libDQMServicesCore.so
#7  0x00002b856a2822eb in RecHitTask::_process(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginDQMHcalTasksAuto.so
#8  0x00002b856a218f60 in non-virtual thunk to DQMOneEDAnalyzer<edm::LuminosityBlockCache<hcaldqm::Cache> >::accumulate(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginDQMHcalTasksAuto.so
#9  0x00002b84fb5f165e in edm::one::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00002b84fb5d94f2 in edm::WorkerT<edm::one::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 4 (Thread 0x2b8548413700 (LWP 651) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b852af78fc6 in HcalGeometry::getGeometryRawPtr(unsigned int) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libGeometryHcalTowerAlgo.so
#5  0x00002b852b082c28 in CaloSubdetectorGeometry::cellGeomPtr(unsigned int) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libGeometryCaloGeometry.so
#6  0x00002b852af7a3fe in HcalGeometry::getGeometry(DetId const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libGeometryHcalTowerAlgo.so
#7  0x00002b852f57d073 in CaloDetIdAssociator::getDetIdPoints(DetId const&, std::vector<Point3DBase<float, GlobalTag>, std::allocator<Point3DBase<float, GlobalTag> > >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginTrackingToolsTrackAssociatorPlugins.so
#8  0x00002b852f580683 in CaloDetIdAssociator::crossedElement(Point3DBase<float, GlobalTag> const&, Point3DBase<float, GlobalTag> const&, DetId const&, double, SteppingHelixStateInfo const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginTrackingToolsTrackAssociatorPlugins.so
#9  0x00002b852f5b2e93 in DetIdAssociator::getCrossedDetIds(std::set<DetId, std::less<DetId>, std::allocator<DetId> > const&, std::vector<Point3DBase<float, GlobalTag>, std::allocator<Point3DBase<float, GlobalTag> > > const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsTrackAssociator.so
#10 0x00002b852f5c6eda in TrackDetectorAssociator::fillHcal(edm::Event const&, TrackDetMatchInfo&, TrackAssociatorParameters const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsTrackAssociator.so
#11 0x00002b852f5c9cb2 in TrackDetectorAssociator::associate(edm::Event const&, edm::EventSetup const&, TrackAssociatorParameters const&, FreeTrajectoryState const*, FreeTrajectoryState const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsTrackAssociator.so
#12 0x00002b852f5ca379 in TrackDetectorAssociator::associate(edm::Event const&, edm::EventSetup const&, reco::Track const&, TrackAssociatorParameters const&, TrackDetectorAssociator::Direction) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsTrackAssociator.so
#13 0x00002b858a3a22be in MuonIdProducer::fillMuonId(edm::Event&, edm::EventSetup const&, reco::Muon&, TrackDetectorAssociator::Direction) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoMuonMuonIdentificationPlugins.so
#14 0x00002b858a3a4c89 in MuonIdProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoMuonMuonIdentificationPlugins.so
#15 0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#16 0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 3 (Thread 0x2b8547a12700 (LWP 650) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b852f0854b7 in ThirdHitPredictionFromCircle::phi(float, float) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoPixelVertexingPixelTriplets.so
#5  0x00002b858b65973c in MultiHitGeneratorFromChi2::hitSets(TrackingRegion const&, OrderedMultiHits&, HitDoublets const&, RecHitsSortedInPhi const**, std::vector<DetLayer const*, std::allocator<DetLayer const*> > const&, int, std::vect
or<std::unique_ptr<BaseTrackerRecHit, std::default_delete<BaseTrackerRecHit> >, std::allocator<std::unique_ptr<BaseTrackerRecHit, std::default_delete<BaseTrackerRecHit> > > >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_
13_0_3/lib/el8_amd64_gcc11/pluginRecoTrackerTkSeedGeneratorPlugins.so
#6  0x00002b858b65018a in MultiHitGeneratorFromChi2::hitSets(TrackingRegion const&, OrderedMultiHits&, HitDoublets const&, std::vector<SeedingLayerSetsHits::SeedingLayer, std::allocator<SeedingLayerSetsHits::SeedingLayer> > const&, LayerHitMapCache&, std::vector<std::unique_ptr<BaseTrackerRecHit, std::default_delete<BaseTrackerRecHit> >, std::allocator<std::unique_ptr<BaseTrackerRecHit, std::default_delete<BaseTrackerRecHit> > > >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoTrackerTkSeedGeneratorPlugins.so
#7  0x00002b858b651003 in MultiHitFromChi2EDProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoTrackerTkSeedGeneratorPlugins.so
#8  0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9  0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00002b84fb56d6da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00002b84fb56db88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 1 (Thread 0x2b84fea871c0 (LWP 532) "cmsRun"):
#2  0x00002b8503eefed0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b85284248d2 in HelixArbitraryPlaneCrossing::positionInDouble(double) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsGeomPropagators.so
#5  0x00002b8528425169 in HelixArbitraryPlaneCrossing::pathLength(Plane const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsGeomPropagators.so
#6  0x00002b852f1b8872 in RKPropagatorInS::propagateWithPath(FreeTrajectoryState const&, Plane const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackPropagationRungeKutta.so
#7  0x00002b852f1b2c65 in Propagator::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackPropagationRungeKutta.so
#8  0x00002b852f1a5a72 in PropagatorWithMaterial::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsMaterialEffects.so
#9  0x00002b852842397c in Propagator::propagateWithPath(TrajectoryStateOnSurface const&, Surface const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsGeomPropagators.so
#10 0x00002b85404ee8b7 in KFTrajectorySmoother::trajectory(Trajectory const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libTrackingToolsTrackFitters.so
#11 0x00002b85404a046f in (anonymous namespace)::KFFittingSmoother::smoothingStep(Trajectory&&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginTrackingToolsTrackFittersPlugins.so
#12 0x00002b85404a3c4e in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginTrackingToolsTrackFittersPlugins.so
#13 0x00002b854049df97 in (anonymous namespace)::FlexibleKFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginTrackingToolsTrackFittersPlugins.so
#14 0x00002b856a835cd2 in TrackProducerAlgorithm<reco::Track>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libRecoTrackerTrackProducer.so
#15 0x00002b856a7559dd in TrackProducerAlgorithm<reco::Track>::runWithCandidate(TrackingGeometry const*, MagneticField const*, std::vector<TrackCandidate, std::allocator<TrackCandidate> > const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoTrackerTrackProducerPlugins.so
#16 0x00002b856a7542c2 in TrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/pluginRecoTrackerTrackProducerPlugins.so
#17 0x00002b84fb5fa95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00002b84fb5e1072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_3/lib/el8_amd64_gcc11/libFWCoreFramework.so

Current Modules:
Module: FastjetJetProducer:ak4PFJets (crashed)
Module: MultiHitFromChi2EDProducer:pixelLessStepHitTriplets
Module: PFClusterProducer:particleFlowClusterHBHE
Module: RecHitTask:recHitTask
Module: TrackProducer:mixedTripletStepTracks
Module: MuonIdProducer:muons1stStep
Module: TrackProducer:initialStepTracks
Module: CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets

@makortel
Copy link
Contributor

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Apr 24, 2023

There is one job failing Reco for Run 366351

that's not a global run. I think the original message on the Tier-0 cmstalk is about 366451

@malbouis
Copy link
Contributor Author

There is one job failing Reco for Run 366351

that's not a global run. I think the original message on the Tier-0 cmstalk is about 366451

Thanks Marco! I have updated the description.

@malbouis
Copy link
Contributor Author

malbouis commented Apr 25, 2023

Let me add a recipe to reproduce the error, as discussed at the OPR meeting today.

cmsrel CMSSW_13_0_3
cd CMSSW_13_0_3/src/
cmsenv
cp -r /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023B/job_248341/job/WMTaskSpace/ .
cd WMTaskSpace/cmsRun1/
cmsRun -e PSet.py 

@mandrenguyen
Copy link
Contributor

I don't reproduce this error using the Pkl.
Unfortunately lxplus was killing my interactive job, so I copied the input file to my T2.
On the LLR machine, the job runs to completion in 13_0_3.
The memory report gives the following:
MemoryReport> Peak virtual size 18036.2 Mbytes
MemoryReport> Peak rss size 9978.31 Mbytes

@malbouis
Copy link
Contributor Author

Thanks, @mandrenguyen !

I could reproduce it in lxplus when I tried it. Maybe could someone else double check that the crash can be reproduced at lxplus with the recipe that was posted above?

@germanfgv
Copy link
Contributor

@mandrenguyen just to confirm, were you using scram arch el8_amd64_gcc11?

@mmusich
Copy link
Contributor

mmusich commented Apr 26, 2023

were you using scram arch el8_amd64_gcc11?

i also tried last night, and if you use the regular arch one gets in lxplus (not lxplus8): slc7_amd64_gcc11 the crash is not there.

@malbouis
Copy link
Contributor Author

were you using scram arch el8_amd64_gcc11?

i also tried last night, and if you use the regular arch one gets in lxplus (not lxplus8): slc7_amd64_gcc11 the crash is not there.

Thanks Marco!
I tried it on lxplus8 and I reproduced the crash, but I had not tried it on regular lxplus.

@mmusich
Copy link
Contributor

mmusich commented Apr 26, 2023

I tried it on lxplus8 and I reproduced the crash, but I had not tried it on regular lxplus.

for the record, on an lxplus8 node, using the recipe above, and a slightly modified PSet:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads = 1
    process.source.skipEvents=cms.untracked.uint32(586)

it will segfault consistently at the first event processed.

@mandrenguyen
Copy link
Contributor

The offending line is:

ClusterSequencePtr(new fastjet::ClusterSequenceArea(fjInputs_, *fjJetDefinition_, *fjAreaDefinition_));

The problem appears to come from fjAreaDefinition_
That's as far as I understood for the moment. If @cms-sw/jetmet-pog-l2 or @laurenhay have any ideas feel free to chime in.

@Dr15Jones
Copy link
Contributor

Looking at where fjAreaDefinition_ is checked in the constructor, it seems that useConstituentSubtraction_ is also supposed to be true if fjAreaDefinition_ is used. The code causing the problem does not first check that useConstituentSubtraction_ == true.

@Dr15Jones
Copy link
Contributor

The value of fjAreaDefinition_ is set here and is only set if certain criteria are met

if (doAreaFastjet_ || doRhoFastjet_) {
if (voronoiRfact_ <= 0) {
fjActiveArea_ = std::make_shared<fastjet::GhostedAreaSpec>(ghostEtaMax_, activeAreaRepeats_, ghostArea_);
if (!useExplicitGhosts_) {
fjAreaDefinition_ = std::make_shared<fastjet::AreaDefinition>(fastjet::active_area, *fjActiveArea_);
} else {
fjAreaDefinition_ =
std::make_shared<fastjet::AreaDefinition>(fastjet::active_area_explicit_ghosts, *fjActiveArea_);
}
}
fjSelector_ = std::make_shared<fastjet::Selector>(fastjet::SelectorAbsRapMax(rhoEtaMax_));
}

@mandrenguyen
Copy link
Contributor

Since it's ak4PFJets that's crashing, I believe useConstituentSubtraction_ should not be set to true, but fjAreaDefinition_ does indeed need to be defined. Based on the snippet above though, I think the conditions are met. doAreaFastjet_ is true and voronoiRfact_ is indeed set to a negative value.

@mandrenguyen
Copy link
Contributor

mandrenguyen commented Apr 26, 2023

I looped over the jet on which the code is crashing.

    for (auto const& input : fjInputs_) {
      if(!(input.E() > 0)) std::cout<< "e "<<input.e()<<" phi "<<input.phi()<<" rap "<<input.rap()<<" px "<<input.px()<<" py "<<input.py()<<" pz "<<input.pz()<<std::endl;
    }
    fjClusterSeq_ = ClusterSequencePtr(new fastjet::ClusterSequenceArea(fjInputs_, *fjJetDefinition_, *fjAreaDefinition_));

Out of the 3080 jet constituents, one of them has NaN for e() and rap().
I guess that's what's causing fastJet to choke.
Perhaps we should see if that's coming from the input PFCandidate collection.

For what it's worth px,py,pz are set correctly:
e -nan phi 0.343687 rap -nan px 2.27882 py 0.815566 pz -1.90158

@mandrenguyen
Copy link
Contributor

Some more observations.
I can find the anomalous PFCandidate in PFLinker.cc
It's of type =1, so it's a charged hadron.

cand.trackRef() is non-null, and has the following values, which I'm not immediately finding in generalTracks (but I didn't check super carefully):
px = 6.52489e+08 py 2.33519e+08 pz -5.44475e+08
The linked calo energies are all -nan
hoEnergy()
hcalEnergy()
ecalEnergy()

I guess my next step would be to see if I can track the nan back to where charged hadrons are first created PFAlgo.cc, but I won't be able to get to it immediately.
If anyone else wants to have a look, feel free of course.

@makortel
Copy link
Contributor

Here is an issue from 2022 of a PFCandidate with NaN #39110 (I did not attempt to understand if it would be related though)

Let's anyway tag @cms-sw/pf-l2

@mandrenguyen
Copy link
Contributor

In case it's useful to examine the output, one can get the job to finish successfully by inserting the following in the loop over PF candidates in PFLinker.cc

`  if(!(cand.energy()>0) ) continue;`

@malbouis
Copy link
Contributor Author

malbouis commented Apr 27, 2023

We have 3 more occurrences of this error in pp runs, for dataset EphemeralZeroBias:

  • 2 paused jobs in run 366495
  • 1 paused job in run 366497

I post here the links for the tar files, in case someone would like to try to reproduce them (I did not yet have the chance)

https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run366495_EphemeralZeroBias17/Reco
https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run366495_EphemeralZeroBias13/Reco

https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run366497_EphemeralZeroBias18/Reco

@kdlong
Copy link
Contributor

kdlong commented Apr 27, 2023

Thanks for all the info, will take a look ASAP

@mandrenguyen
Copy link
Contributor

Thanks @kdlong
The furthest I've been able to track the nan so far is to:

chargedHadronsTotalEnergy += chargedHadron.energy();

chargedHadron.energy() is returning -nan for index = 1411

@mmusich
Copy link
Contributor

mmusich commented Apr 27, 2023

type pf

@cmsbuild cmsbuild added the pf label Apr 27, 2023
@malbouis malbouis changed the title Segmentation violation in PromptReco Segmentation violation in PromptReco for FastjetJetProducer:ak4PFJets Apr 27, 2023
@malbouis
Copy link
Contributor Author

We have yet another paused job in Tier0 due to this crash.

It is occurring for run 366729 in dataset EphemeralZeroBias10.

The tar ball can be found in https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run366729_EphemeralZeroBias10/Reco

Is there any further progress in debugging this issue?

@swagata87
Copy link
Contributor

swagata87 commented Apr 30, 2023

Here is an issue from 2022 of a PFCandidate with NaN #39110 (I did not attempt to understand if it would be related though)

yes there was a similar finding last year which was causing photon's isolation being NaN, when the bad pf candidate ended up in photon's isolation cone. A preliminary fix was to loop over pf candidate collection, check for NaN and remove those, and make a pfCandNoNaN collection, which was then passed on to calculate isolation. This is where it was done: https://github.com/cms-sw/cmssw/pull/39120/files

maybe something similar can be done for jet/met if this is easier and quicker to do. But of course the real issue need to be solved upstream.

Even if it's fixed at PF level, such extra protections in POG code are probably not a bad idea as PF code (and logic) is complex and can go wrong in various unforeseen ways, specially in startup phase where alignment/calibrations are not perfect, and several special checks/tests are ongoing using special modes (the interplay of those with PF logic can be hard to predict).

@kdlong
Copy link
Contributor

kdlong commented May 2, 2023

I was trying to reproduce this yesterday, and I couldn't get the failure. Now I can't access /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023B/job_248341/job/WMTaskSpace/. Was it removed? Is there a simple recipe someone can point me to?

@mandrenguyen
Copy link
Contributor

Hi @kdlong
You can copy over the relevant files from my area:
/afs/cern.ch/work/m/mnguyen/public/test/CMSSW_13_0_3/src

In PSet.py I skip directly to the crashing event, so you should find it immediately.
Note that the crash only occurs on lxplus8, you won't see if on SL7.

@kdlong
Copy link
Contributor

kdlong commented May 3, 2023

Thanks @mandrenguyen. Unfortunately it seems the file has already been removed from disk. Does anyone have other examples of the failure with a file that's still accessible?

@mandrenguyen
Copy link
Contributor

@kdlong Taking one of the other examples from
#41397 (comment)
I copied the input root file for safe keeping, as well as the tarball to:
/eos/cms/store/group/phys_heavyions/mnguyen/PFcrash/

@mandrenguyen
Copy link
Contributor

@kdlong You can use the following PSet.py to skip directly to the crashing event:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.SimpleMemoryCheck = cms.Service("SimpleMemoryCheck")
    process.maxEvents.input = -1
    process.source.fileNames=cms.untracked.vstring('file:/eos/cms/store/group/phys_heavyions/mnguyen/PFcrash/a9175998-1945-4443-b085-8960314354a9.root')
    process.source.skipEvents=cms.untracked.uint32(5693)
    process.options.numberOfThreads = 1

You can bypass the crash in FastJet by merging this one-liner PR: #41474

@kdlong
Copy link
Contributor

kdlong commented May 5, 2023

Thanks @mandrenguyen. I reproduced the issue finally and understood that it came from the mass-aware scaling that I introduced in #39368. In the case of a track with a huge momentum but huge uncertainty (1e7 in the example given above), the scale factor is very small and the energy rescaling computation has numeric issues. The fix is simple, remove the large ratios by calculating the energy from the rescaled momentum rather than calculating a scaling factor.

@makortel
Copy link
Contributor

Just to make sure, is the problem described in this issue fixed now?

@laurenhay
Copy link
Contributor

Just to make sure, is the problem described in this issue fixed now?

Yes this issue can be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants