Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PromptReco failure PromptReco_Run381379_ParkingSingleMuon4 #45162

Closed
Dr15Jones opened this issue Jun 6, 2024 · 38 comments
Closed

PromptReco failure PromptReco_Run381379_ParkingSingleMuon4 #45162

Dr15Jones opened this issue Jun 6, 2024 · 38 comments

Comments

@Dr15Jones
Copy link
Contributor

From https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381379-parkingsinglemuon4/42082

----- Begin Fatal Exception 06-Jun-2024 16:58:22 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 381379 lumi: 819 event: 1742750619 stream: 2
   [1] Running path 'write_AOD_step'
   [2] Prefetching for module PoolOutputModule/'write_AOD'
   [3] While reading from source GlobalObjectMapRecord hltGtStage2ObjectMap '' HLT
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 381379 lumi: 819 event: 1742683577 stream: 4
   [6] Running path 'dqmoffline_step'
   [7] Prefetching for module DQMMessageLogger/'DQMMessageLogger'
   [8] Prefetching for module LogErrorHarvester/'logErrorHarvester'
   [9] Prefetching for module CSCRecHitDProducer/'csc2DRecHits'
   [10] Prefetching for module CSCDCCUnpacker/'muonCSCDigis'
   [11] While reading from source FEDRawDataCollection rawDataCollector '' LHC
   [12] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
Exception Message:
vector::_M_default_append
----- End Fatal Exception -------------------------------------------------

The tarball can be found here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FileReadError/job/WMTaskSpace/cmsRun1
From the logs it seems to crash at event 1742503164. The error is reproducible locally.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2024

A new Issue was created by @Dr15Jones.

@antoniovilela, @sextonkennedy, @smuzaffar, @makortel, @rappoccio, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

The job can be run by setting up a CMSSW_14_0_7 area, downloading the tarball (which is at
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FileReadError/a406cf00-00a4-498e-b7e2-9ec39b964fac-216-3-logArchive.tar.gz )

Then after untarring go to directory job/WMTaskSpace/cmsRun1 and then do

cmsRun PSet.py

@Dr15Jones
Copy link
Contributor Author

There appear to be lots of extraneous exceptions being thrown (and caught) in this job. The first one encountered is

%MSG-e SiStripMonitorTrack:  SiStripMonitorTrack:HLTSiStripMonitorTrack  06-Jun-2024 17:43:09 CEST Run: 381379 Event: 1741662696
ClusterCollection is not valid!!
%MSG
[Switching to Thread 0x7fffa05fe640 (LWP 3001818)]

Thread 7 "cmsRun" hit Catchpoint 1 (exception thrown), 0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7ffde5f68b00, tinfo=0x7ffff79a0650 <typeinfo for edm::Exception>,
    dest=0x7ffff796a010 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
81      ../../../../libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.
(gdb) where
#0  0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7ffde5f68b00, tinfo=0x7ffff79a0650 <typeinfo for edm::Exception>, dest=0x7ffff796a010 <edm::Exception::~Exception()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007ffff7b7e0b2 in throwInvalidRefFromNullOrInvalidRef(edm::TypeID const&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libDataFormatsCommon.so
#2  0x00007ffff7b7ed6f in edm::RefCore::tryToGetProductPtr(std::type_info const&, edm::EDProductGetter const*) const [clone .cold] ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libDataFormatsCommon.so
#3  0x00007fffa557aa1a in reco::Track::recHitsBegin() const ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so
#4  0x00007fffa55bd779 in SingleLongTrackProducer::produce(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so
#5  0x00007ffff7e483c1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#6  0x00007ffff7e2c04e in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#7  0x00007ffff7db9159 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#8  0x00007ffff7db96c4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreFramework.so
#9  0x00007ffff7f3af28 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) ()
   from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el9_amd64_gcc12/libFWCoreConcurrency.so
#10 0x00007ffff6f1091b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7ffeafe74400, waiter=..., this=0x7ffff41c3b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#11 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7ffff41c3b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#12 tbb::detail::r1::arena::process (tls=..., this=<optimized out>)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:137
#13 tbb::detail::r1::market::process (this=<optimized out>, j=...)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/market.cpp:599
#14 0x00007ffff6f12ace in tbb::detail::r1::rml::private_worker::run (this=0x7ffff2486f00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#15 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7ffff2486f00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#16 0x00007ffff5a89c02 in start_thread () from /lib64/libc.so.6
#17 0x00007ffff5b0ec40 in clone3 () from /lib64/libc.so.6

Which is caught here

try { // (Un)Comment this line with /* to (not) allow for events with not valid hits
auto hb = track.recHitsBegin();
for (unsigned int h = 0; h < track.recHitsSize(); h++) {
auto recHit = *(hb + h);
auto const &hit = *recHit;
if (onlyValidHits && !hit.isValid()) {
hitIsNotValid = true;
continue;
}
}
} catch (cms::Exception const &e) {
deref += 1;
if (debug)
std::cerr << e.explainSelf() << std::endl;

which is problematic as the tracks are the generalTracks which are being made in this job and SHOULD have accessible hits!

@Dr15Jones
Copy link
Contributor Author

assign tracking

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Jun 6, 2024

The next group of exceptions come from

#0  0x00007ffff5b9d2f1 in __cxxabiv1::__cxa_throw (obj=0x7ffdca082400, tinfo=0x7ffff79a5628 <typeinfo for cms::Exception>, dest=0x7ffff796ee30 <cms::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffc37f8a8d in PerigeeConversions::ftsToPerigeeParameters(FreeTrajectoryState const&, Point3DBase<float, GlobalTag> const&, double&) [clone .cold] ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so
#2  0x00007fffc3806a5a in TrajectoryStateClosestToPoint::TrajectoryStateClosestToPoint(FreeTrajectoryState const&, Point3DBase<float, GlobalTag> const&) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so
#3  0x00007fffc38725a5 in TSCPBuilderNoMaterial::operator()(TrajectoryStateOnSurface const&, Point3DBase<float, GlobalTag> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackingToolsPatternTools.so
#4  0x00007fffbe679dd2 in PerigeeLinearizedTrackState::computeJacobians() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexVertexTools.so
#5  0x00007fffbe67a456 in PerigeeLinearizedTrackState::isValid() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexVertexTools.so
#6  0x00007fffbc5ac58f in KalmanVertexUpdator<5u>::positionUpdate(VertexState const&, ReferenceCountingPointer<LinearizedTrackState<5u> >, float, int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#7  0x00007fffbc5ae20d in KalmanVertexUpdator<5u>::update(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >, float, int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#8  0x00007fffbc5ae89a in KalmanVertexUpdator<5u>::add(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#9  0x00007fffbc5ae90d in KalmanVertexTrackCompatibilityEstimator<5u>::estimateNFittedTrack(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#10 0x00007fffbc5b023f in KalmanVertexTrackCompatibilityEstimator<5u>::estimate(CachingVertex<5u> const&, ReferenceCountingPointer<VertexTrack<5u> >, unsigned int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#11 0x00007fffbc5aa80e in KalmanVertexTrackCompatibilityEstimator<5u>::estimate(CachingVertex<5u> const&, ReferenceCountingPointer<LinearizedTrackState<5u> >, unsigned int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexKalmanVertexFit.so
#12 0x00007fffbc5d101c in AdaptiveVertexFitter::reWeightTracks(std::vector<ReferenceCountingPointer<LinearizedTrackState<5u> >, std::allocator<ReferenceCountingPointer<LinearizedTrackState<5u> > > > const&, CachingVertex<5u> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#13 0x00007fffbc5d1e65 in AdaptiveVertexFitter::reWeightTracks(std::vector<ReferenceCountingPointer<VertexTrack<5u> >, std::allocator<ReferenceCountingPointer<VertexTrack<5u> > > > const&, CachingVertex<5u> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#14 0x00007fffbc5d32ed in AdaptiveVertexFitter::fit(std::vector<ReferenceCountingPointer<VertexTrack<5u> >, std::allocator<ReferenceCountingPointer<VertexTrack<5u> > > > const&, VertexState const&, bool) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#15 0x00007fffbc5d46e1 in AdaptiveVertexFitter::vertex(std::vector<reco::TransientTrack, std::allocator<reco::TransientTrack> > const&, Point3DBase<float, GlobalTag> const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoVertexAdaptiveVertexFit.so
#16 0x00007fff4035710a in TemplatedInclusiveVertexFinder<edm::View<reco::Candidate>, reco::VertexCompositePtrCandidate>::produce(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginRecoVertexAdaptiveVertexFinderPlugins.so
#17 0x00007ffff7ce1e91 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so

the exception originates here

if (pt == 0.)
throw cms::Exception("PerigeeConversions", "Track with pt=0");

and is caught here

try {
theParameters = PerigeeConversions::ftsToPerigeeParameters(originalFTS, referencePoint, thePt);
if (theFTS.hasError()) {
thePerigeeError = PerigeeConversions::ftsToPerigeeError(originalFTS);
errorIsAvailable = true;
} else {
errorIsAvailable = false;
}
theField = &(originalFTS.parameters().magneticField());
} catch (const cms::Exception& ex) {
if (ex.category() != "PerigeeConversions")
throw;
edm::LogInfo("TrajectoryStateClosestToPoint_PerigeeConversions")
<< "Caught exception " << ex.explainSelf() << ".\n";
valid = false;
}

@Dr15Jones
Copy link
Contributor Author

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 6, 2024

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

By skipping the first events, I was able to get to the trackback for the exception which ultimately ended the job

#0  0x00007ffff5b9d2f1 in __cxxabiv1::__cxa_throw (obj=0x7ffe9579d1a0, tinfo=0x7ffff5d03190 <typeinfo for std::length_error>, dest=0x7ffff5bb2220 <std::length_error::~length_error()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007ffff5b942d9 in std::__throw_length_error(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libstdc++.so.6
#2  0x00007fffc38c8346 in ROOT::Detail::TCollectionProxyInfo::Pushback<std::vector<unsigned char, std::allocator<unsigned char> > >::resize(void*, unsigned long) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libDataFormatsStdDictionaries.so
#3  0x00007ffff7193701 in void TGenCollectionStreamer::ReadBufferVectorPrimitives<unsigned char>(TBuffer&, void*, TClass const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#4  0x00007ffff7110e09 in TBufferFile::ReadFastArray(void*, TClass const*, int, TMemberStreamer*, TClass const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#5  0x00007ffff735e073 in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6  0x00007ffff7211e4c in TStreamerInfoActions::VectorLooper::GenericRead(TBuffer&, void*, void const*, TStreamerInfoActions::TLoopConfiguration const*, TStreamerInfoActions::TConfiguration const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7  0x00007ffff710f5fc in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*, void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8  0x00007ffff725f38f in int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9  0x00007ffff7117eae in TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x00007ffff735cdcc in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#11 0x00007ffff71de94d in TStreamerInfoActions::GenericReadAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#12 0x00007ffff710fbb5 in TBufferFile::ApplySequence(TStreamerInfoActions::TActionSequence const&, void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#13 0x00007ffff7873b87 in TBranchElement::ReadLeavesMember(TBuffer&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#14 0x00007ffff786c429 in TBranch::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#15 0x00007ffff787ed44 in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#16 0x00007ffff787ecfd in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libTree.so
#17 0x00007fff9d66585c in edm::RootTree::getEntry(TBranch*, long long) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolInput.so
#18 0x00007fff9d64639c in edm::RootDelayedReader::getProduct_(edm::BranchID const&, edm::EDProductGetter const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolInput.so
#19 0x00007ffff7bc111f in edm::DelayedReader::getProduct(edm::BranchID const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007ffff7c6a35b in edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00007ffff7c6b7cc in edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x00007ffff7c6b918 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::DelayedReaderInputProductResolver::prefetchAsync_(edm::WaitingTaskHolder, edm::Principal const&, bool, edm::ServiceToken const&, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x00007ffff7e031d0 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#24 0x00007ffff63fe95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fff08c3ec00, waiter=..., this=0x7ffff3963b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#25 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7ffff3963b00)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#26 tbb::detail::r1::arena::process (tls=..., this=<optimized out>)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#27 tbb::detail::r1::market::process (this=<optimized out>, j=...)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#28 0x00007ffff6400b0e in tbb::detail::r1::rml::private_worker::run (this=0x7ffff17e9100)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#29 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7ffff17e9100)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#30 0x00007ffff55341ca in start_thread () from /lib64/libpthread.so.0
#31 0x00007ffff518f8d3 in clone () from /lib64/libc.so.6

@Dr15Jones
Copy link
Contributor Author

assign root

@Dr15Jones
Copy link
Contributor Author

@pcanal how can we understand better what happened during the read?

@Dr15Jones
Copy link
Contributor Author

type root

@Dr15Jones
Copy link
Contributor Author

type tracking

@slava77
Copy link
Contributor

slava77 commented Jun 6, 2024

Which is caught here

try { // (Un)Comment this line with /* to (not) allow for events with not valid hits
auto hb = track.recHitsBegin();
for (unsigned int h = 0; h < track.recHitsSize(); h++) {
auto recHit = *(hb + h);
auto const &hit = *recHit;
if (onlyValidHits && !hit.isValid()) {
hitIsNotValid = true;
continue;
}
}
} catch (cms::Exception const &e) {
deref += 1;
if (debug)
std::cerr << e.explainSelf() << std::endl;

that's just looks like a poorly written code, where try/catch is used instead of checking for trackExtra to be present.
Tracks are apparently not pure generalTracks, see

if (chiNdof < fitProb) {
fitProb = chiNdof;
bestTrack = track;
bestTrack.setExtra(track.extra());

a proper copy is made conditionally, while the rest in selTracks is going to be default-constructed reco::Tracks

@slava77
Copy link
Contributor

slava77 commented Jun 7, 2024

@borzari
please check #45162 (comment)
to possibly remove the try/catch pattern related to just acces to track.extra in the track.recHitsBegin() call.
It should be a combination of validity checks for extra() and then extra()->recHitsProduct(); by checking isNonnull() && isAvailable() for each, sequentially.
This could even be packed into a new helper method ,e.g. bool reco::Track::recHitsOk()

Please clarify if you are available to check this.
Thank you.

@borzari
Copy link
Contributor

borzari commented Jun 8, 2024

Hi @slava77

I applied what you suggested in this commit, used the opportunity to remove some duplicated code, and tested it with RelValZMM and RelValTTbar events by comparing the version with try/catch results with the version with the validity check results. Everything worked as intended and no changes to the output were observed, as expected.

Just to clarify two points:

  • I added a method inside the SingleLongTrackProducer module to check the validity of the track. Thinking out loud about what you suggested, I think you meant that the method could be included in https://github.com/cms-sw/cmssw/blob/master/DataFormats/TrackReco/interface/Track.h. If this is what you meant, I can modify the branch to have the recHitsOk method there;
  • I couldn't check the validity of the recHitsProduct(). There doesn't seem to be something similar to isNonnull() or isAvailable() for it. However, just checking track.extra() seemed enough. Was it supposed to be like this? Am I missing something about the recHitsProduct()?

@slava77
Copy link
Contributor

slava77 commented Jun 8, 2024

I couldn't check the validity of the recHitsProduct(). There doesn't seem to be something similar to isNonnull() or isAvailable() for it. However, just checking track.extra() seemed enough. Was it supposed to be like this? Am I missing something about the recHitsProduct()?

I misread the TrackExtraBase; edm::RefCore m_hitCollection; is the one that has isNonnull() and isAvailable(), but it is not publicly exposed.

So, I would add this bool recHitsOk() const {return m_hitCollection.isNonnull() && m_hitCollection.isAvailable();} in TrackExtraBase.h
And then in Track.h bool recHitsOk() const {return extra_.isNonnull() && extra_.isAvailable() && extra_->recHitsOk();}

Even though in the current setup a track without an extra is enough, there can still be cases where SingleLongTrackProducer uses input tracks where hits got dropped.

@mmusich
Copy link
Contributor

mmusich commented Jun 8, 2024

Tracks are apparently not pure generalTracks, see

if (chiNdof < fitProb) {
fitProb = chiNdof;
bestTrack = track;
bestTrack.setExtra(track.extra());

a proper copy is made conditionally, while the rest in selTracks is going to be default-constructed reco::Tracks

Out of curiosity why is that? Can't the selTracks just contain the tracks we can actually refit?

@borzari
Copy link
Contributor

borzari commented Jun 8, 2024

Tracks are apparently not pure generalTracks, see

if (chiNdof < fitProb) {
fitProb = chiNdof;
bestTrack = track;
bestTrack.setExtra(track.extra());

a proper copy is made conditionally, while the rest in selTracks is going to be default-constructed reco::Tracks

Out of curiosity why is that? Can't the selTracks just contain the tracks we can actually refit?

Hi @mmusich
The selTracks collection will only have one track, the one with smallest chiNdof. I also want to check if the rechits and hits from the hitpattern are valid to say that it is a goodTrack that can be used for the shortened tracks pT resolution. Specially because of what @slava77 mentioned here:

Even though in the current setup a track without an extra is enough, there can still be cases where SingleLongTrackProducer uses input tracks where hits got dropped.

The hit checks are to make sure that this track won't have missing layers with measurement, which is not 100% effective as I already showed during the presentations about this topic, but also doesn't impact a lot on the final result because it doesn't happen so often. I wouldn't think changing that part of the code for selTracks to only have tracks that can be refitted to have a large impact on what is going on in the SingleLongTrackProducer or after it, unless it is an extra "safety check" that can be included.

Here I added the suggestions from @slava77. Again, I tested with RelValZMM and RelValTTbar events, and things are working as expected. If you don't have other suggestions, I can open a PR with it and we can continue the discussion there

@mmusich
Copy link
Contributor

mmusich commented Jun 8, 2024

@borzari

also want to check if the rechits and hits from the hitpattern are valid to say that it is a goodTrack that can be used for the shortened tracks pT resolution.

Exactly, can't you do that before filling the vector? Default constructed tracks can't be used for refit.

@borzari
Copy link
Contributor

borzari commented Jun 8, 2024

@borzari

also want to check if the rechits and hits from the hitpattern are valid to say that it is a goodTrack that can be used for the shortened tracks pT resolution.

Exactly, can't you do that before filling the vector? Default constructed tracks can't be used for refit.

Alright, so instead of only getting the track with the smallest chiNdof, I also want it to have recHitsOk(), right?

@mmusich
Copy link
Contributor

mmusich commented Jun 8, 2024

I also want it to have recHitsOk(), right?

Right, this is what I had in mind.

@borzari
Copy link
Contributor

borzari commented Jun 8, 2024

I also want it to have recHitsOk(), right?

Right, this is what I had in mind.

It didn't work. If I move the validity check from the rechits/hitpattern check to where I select tracks (I did if (chiNdof < fitProb && track.recHitsOk())), I get the message as if I was not checking the tracks:

----- Begin Fatal Exception 08-Jun-2024 19:16:37 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 1 lumi: 76 event: 7503 stream: 6
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module SingleLongTrackProducer/'SingleLongTrackProducer'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::TrackExtra>' has been detected.
Please modify the calling code to test validity before dereferencing.
----- End Fatal Exception -------------------------------------------------

@mmusich
Copy link
Contributor

mmusich commented Jun 8, 2024

I get the message as if I was not checking the tracks:

Isn't track.recHitsOk() checking that the TrackExtra is valid?

@borzari
Copy link
Contributor

borzari commented Jun 8, 2024

Isn't track.recHitsOk() checking that the TrackExtra is valid?

Should be. I implemented it like Slava suggested here

Could it be that, although I am adding only tracks with valid TrackExtra to selTracks, the framework still needs me to check if I am looking at a valid track (that have TrackExtra) from it to check if it has valid hits/hitpattern? I am not sure how the "not valid TrackExtra" exception works, that is why I am asking

@Dr15Jones
Copy link
Contributor Author

The check I used was

if (track.extra().isAvailable()) {

@borzari
Copy link
Contributor

borzari commented Jun 10, 2024

The check I used was

if (track.extra().isAvailable()) {

Alright @Dr15Jones, but does it happens every time I am using a reco::Track anywhere?

Well, in any case, I would suggest to open a PR with these changes. At least to remove the try/catch pattern.

@mmusich
Copy link
Contributor

mmusich commented Jun 11, 2024

get the message as if I was not checking the tracks:

maybe I am missing something, but with CMSTrackingPOG@5318549 on top of borzari@95ecc4b I can run this test:

<test name="testTrackingResolution" command="testTrackingResolution.sh"/>

(even using the whole input file) without crashes.

@borzari
Copy link
Contributor

borzari commented Jun 11, 2024

@mmusich most probably I was missing something. The main differences I see (besides the better organization of the code in the way you wrote), is that I included track.recHitsOk() here in the condition to select the best track, and instead of using isNonnull() here, I would use the bestTrack.recHitsOk(). Also, and maybe here was my mistake, I removed this condition, which you didn't. That is why I asked @Dr15Jones if the check for the availability for TrackExtra is done every time a reco::Track is being used

@borzari
Copy link
Contributor

borzari commented Jun 12, 2024

@mmusich I started from your branch and tested what I mentioned above:

  • Replaced if (bestTrack.extra().isNonnull()) with if (bestTrack.recHitsOk()): didn't have any effect, as expected, and should be "safer"
  • Removed the extra check from here, and it also didn't failed, as I was thinking. I really don't know why that is the case and what is different from what I did, except for adding the check together with the chi2ndof condition to fill selTracks; I would also keep the extra check for safety reasons

May I start a PR to include your changes and the recHitsOk() method to CMSSW?

@mmusich
Copy link
Contributor

mmusich commented Jun 13, 2024

May I start a PR to include your changes and the recHitsOk() method to CMSSW?

here it is: #45213. I used the CMSTrackingPOG VO so you should be able to push more commits if necessary.

@borzari
Copy link
Contributor

borzari commented Jun 13, 2024

here it is: #45213. I used the CMSTrackingPOG VO so you should be able to push more commits if necessary.

Great! I don't think there are any other modifications that are needed. Just FYI, I also checked the output DQM histograms of that branch using RelValZMM events and they are the same as before the changes, as expected

@slava77
Copy link
Contributor

slava77 commented Jun 17, 2024

@Dr15Jones
most of the recent discussion was about the try/catch: this is now fixed in #45213 .
I don't expect though that this would address the underlying cause that lead to this github issue (the promptReco failure).
If I understand correctly the fix in tracking is mostly a convenience for debugging using catch throw to not be distracted.
Then the actual problem is likely more related to root handling the data.

Is my understanding correct?

@Dr15Jones
Copy link
Contributor Author

@slava77 I'm on vacation until Thursday. The try/catch fixes were only there to make it easier to get to the underlying problem in the debugger. It does look like the underlying problem is in ROOT.

@makortel makortel moved this from New to Work in CMS in ROOT prioritization Sep 10, 2024
@makortel
Copy link
Contributor

Coming back to problem itself, in https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381379-parkingsinglemuon4/42082/7 the likely cause was mentioned to be a corrupted file. I suppose there were no further similar failures? Under the file corruption hypothesis, maybe we could just close the issue?

@jfernan2
Copy link
Contributor

+1
Issue seems solved

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

@github-project-automation github-project-automation bot moved this from Work in CMS to Done in ROOT prioritization Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

7 participants