Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in TrackCollectionCloner::Producer::operator() #37018

Closed
makortel opened this issue Feb 21, 2022 · 23 comments · Fixed by #37154
Closed

Assertion failure in TrackCollectionCloner::Producer::operator() #37018

makortel opened this issue Feb 21, 2022 · 23 comments · Fixed by #37154

Comments

@makortel
Copy link
Contributor

In CMSSW_12_3_X_2022-02-21-1100 workflow 23234.9 step 3 fails with

cmsRun: /pool/condor/dir_285643/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/299e6f2e63951755801bd8d1d3286a02/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/src/RecoTracker/FinalTrackSelectors/src/TrackCollectionCloner.cc:49: void TrackCollectionCloner::Producer::operator()(const Tokens&, const std::vector<unsigned int>&): Assertion `selTrajs_->back().measurements().size() == trk.recHitsSize()' failed.

Thread 3 (Thread 0x2b9ae944e700 (LWP 6473) "cmsRun"):
#3  0x00002b9a9783956b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b9a9225c387 in raise () from /lib64/libc.so.6
#6  0x00002b9a9225da78 in abort () from /lib64/libc.so.6
#7  0x00002b9a922551a6 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002b9a92255252 in __assert_fail () from /lib64/libc.so.6
#9  0x00002b9ad61af81e in TrackCollectionCloner::Producer::operator()(TrackCollectionTokens const&, std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libRecoTrackerFinalTrackSelectors.so
#10 0x00002b9ad609f990 in (anonymous namespace)::DuplicateListMerger::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/pluginRecoTrackerFinalTrackSelectorsPlugins.so
#11 0x00002b9a8fca1b73 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b9a8fc9c7df in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00002b9a8fbf7f15 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b9a8fbf820b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b9a8fbfa7f5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#16 0x00002b9a8f9907d5 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-02-21-1100/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc10/CMSSW_12_3_X_2022-02-21-1100/pyRelValMatrixLogs/run/23234.9_TTbar_14TeV+2026D49_vectorHits+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D49_vectorHits+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTrigger+RecoGlobal+HARVESTGlobal.log#/

@makortel
Copy link
Contributor Author

assign reconstruction

@makortel
Copy link
Contributor Author

FYI @cms-sw/tracking-pog-l2

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@jpata,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

I guess this was caused by #36793 (because I can't imagine how the mkFit PR could have caused it)

@perrotta
Copy link
Contributor

@fabiocos

@mmusich
Copy link
Contributor

mmusich commented Feb 22, 2022

adding some more circumstantial evidence, the assertion crash doesn't happen at the first track nor at the first event processed in wf 23234.9, but rather in the second event.

adding:

-      assert(selTrajs_->back().measurements().size() == trk.recHitsSize());
+      std::cout << selTrajs_->back().measurements().size() << " ====  " << trk.recHitsSize() << std::endl;
+      assert(selTrajs_->back().measurements().size() == trk.recHitsSize());   

one gets:

...
10 ====  10
10 ====  10
4 ====  4
6 ====  6
15 ====  15
11 ====  11
5 ====  5
9 ====  8
cmsRun: /tmp/musich/CMSSW_12_3_X_2022-02-21-2300/src/RecoTracker/FinalTrackSelectors/src/TrackCollectionCloner.cc:50: void TrackCollectionCloner::Producer::operator()(const Tokens&, const std::vector<unsigned int>&): Assertion `selTrajs_->back().measurements().size() == trk.recHitsSize()' failed.

also the fact that the only wf affected concerns vector hits seems to be pointing to something done explicitly for that case.

@jpata
Copy link
Contributor

jpata commented Feb 22, 2022

also adding @JanFSchulte here (for the vectorhits-related discussion)

@fabiocos
Copy link
Contributor

Indeed 23234.9 looks to be the only vectorHits-enabled test of the IB. And simply removing that process modifier prevents the failure. It is interesting the comment in the code just above it https://github.com/cms-sw/cmssw/blob/master/RecoTracker/FinalTrackSelectors/src/TrackCollectionCloner.cc#L47 : "// we assume tracks and trajectories are one-to-one and the assocMap is useless" . Well, that "useless" map is exactly what I am using in #36793 see https://github.com/cms-sw/cmssw/blob/master/RecoMTD/TrackExtender/plugins/TrackExtenderWithMTD.cc#L724

@makortel
Copy link
Contributor Author

assign upgrade

@cmsbuild
Copy link
Contributor

New categories assigned: upgrade

@AdrianoDee,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks

@jpata
Copy link
Contributor

jpata commented Feb 23, 2022

It is my understanding that VectorHits is an ongoing R&D effort and not yet converged.
How would the tracking POG (and others) feel about switching off that workflow (with the VectorHits modifier) from the IB test for the moment, and treating it as something to be addressed as part of those developments in the medium term, rather than trying to find a quick fix now?

@mmusich
Copy link
Contributor

mmusich commented Feb 23, 2022

How would the tracking POG (and others) feel about switching off that workflow (with the VectorHits modifier) from the IB test for the moment, and treating it as something to be addressed as part of those developments in the medium term, rather than trying to find a quick fix now?

The VectorHit reconstruction is needed to pursue seeding in the Outer Tracker, which needless to say is a must for Phase-2.
I am expecting the activity to pick up pace again this year towards the goal of making it the baseline in the not too distant future.

@mmusich
Copy link
Contributor

mmusich commented Feb 24, 2022

Adding some more anedoctal evidence the track that triggers the assertion is the first one without any Inner Tracker hit:

trkParams.hitsSize 8 trkParams.hitsValid 8 trkParams.hitsInvalid 0 trkParams.layersMissed 0 trkParams.hitsPixel 0 trkParams.hitsStrip 8 trkParams.charge 1 trkParams.chi2 28.9307 trkParams.ndof 27 trkParams.norChi2 1.9307 trkParams.eta -1.24922 trkParams.etaErr 0.00306871 trkParams.theta 2.58312 trkParams.phi -0.865122 trkParams.phiErr 0.00280953 trkParams.d0 1.9976 trkParams.dz -0.858613 trkParams.dzErr 0.142761 trkParams.p 2.78861 trkParams.pt 1.47767 trkParams.ptErr 0.0132421

@JanFSchulte
Copy link
Contributor

I have started to look into this now and can confirm that it only happens for tracks reconstructed by the pixelLess iteration that is added with the vectorHits modifier. Since the other tracking iterations are still using the vectorHits, the conclusion is that the hits themselves are fine and the issue is introduced in the code specific to the new iteration. IIRC the only piece of code that is unique to that iteration is the seed generator https://github.com/cms-sw/cmssw/blob/master/RecoTracker/TkSeedGenerator/plugins/SeedingOTEDProducer.cc, so I will start digging there.

@mmusich
Copy link
Contributor

mmusich commented Feb 28, 2022

indeed excluding pixelLess with this ugly patch

diff --git a/RecoTracker/FinalTrackSelectors/python/earlyGeneralTracks_cfi.py b/RecoTracker/FinalTrackSelectors/python/earlyGeneralTracks_cfi.py
index c33f6cbfd71..5b8d9de60fc 100644
--- a/RecoTracker/FinalTrackSelectors/python/earlyGeneralTracks_cfi.py
+++ b/RecoTracker/FinalTrackSelectors/python/earlyGeneralTracks_cfi.py
@@ -114,5 +114,5 @@ def _extend_pixelLess(x):
     x.indivShareFrac += [0.095]
     x.selectedTrackQuals += ['pixelLessStepSelector:pixelLessStep']
     x.setsToMerge[0].tLists += [6]
-(trackingPhase2PU140 & vectorHits).toModify(earlyGeneralTracks, _extend_pixelLess)
+#(trackingPhase2PU140 & vectorHits).toModify(earlyGeneralTracks, _extend_pixelLess)
 
diff --git a/RecoTracker/IterativeTracking/python/InitialStep_cff.py b/RecoTracker/IterativeTracking/python/InitialStep_cff.py
index 2bf9f026faf..f70d6ed189c 100644
--- a/RecoTracker/IterativeTracking/python/InitialStep_cff.py
+++ b/RecoTracker/IterativeTracking/python/InitialStep_cff.py
@@ -424,7 +424,9 @@ trackingPhase2PU140.toModify(initialStepSelector,
 
 
 # Final sequence
-InitialStepTask = cms.Task(initialStepSeedLayers,
+from RecoLocalTracker.SiPhase2VectorHitBuilder.siPhase2VectorHits_cfi import *
+InitialStepTask = cms.Task(siPhase2VectorHits,
+                           initialStepSeedLayers,
                            initialStepTrackingRegions,
                            initialStepHitDoublets,
                            initialStepHitTriplets,
diff --git a/RecoTracker/IterativeTracking/python/iterativeTkConfig.py b/RecoTracker/IterativeTracking/python/iterativeTkConfig.py
index 9790ee16d26..868ca252184 100644
--- a/RecoTracker/IterativeTracking/python/iterativeTkConfig.py
+++ b/RecoTracker/IterativeTracking/python/iterativeTkConfig.py
@@ -61,8 +61,8 @@ _iterations_trackingPhase2PU140 = [
     "DetachedQuadStep",
     "PixelPairStep",
 ]
-from Configuration.ProcessModifiers.vectorHits_cff import vectorHits
-vectorHits.toModify(_iterations_trackingPhase2PU140, func=lambda x: x.append('PixelLessStep'))
+#from Configuration.ProcessModifiers.vectorHits_cff import vectorHits
+#vectorHits.toModify(_iterations_trackingPhase2PU140, func=lambda x: x.append('PixelLessStep'))

makes 23234.9 step 3 run to completion (of course defying the whole purpose of VH)

@jpata
Copy link
Contributor

jpata commented Mar 1, 2022

@JanFSchulte @cms-sw/tracking-pog-l2 note that according to the discussion at the ORP today, this is blocking for the upcoming release next week.
Will a fix be possible by that timescale?

The other options are to revert the MTD TrackExtender PR, or to disable the vectorhits workflow to be addressed in a follow-up.

@mmusich
Copy link
Contributor

mmusich commented Mar 1, 2022

note that according to the discussion at the ORP today, this is blocking for the upcoming release next week.

why is that? Is that a technical shortcoming of the infrastructure or just policy.

The other options are to revert the MTD TrackExtender PR, or to disable the vectorhits workflow to be addressed in a follow-up.

neither of the two seem wise IMHO.

@perrotta
Copy link
Contributor

perrotta commented Mar 1, 2022

@mmusich this issue is crashing wf 23234.9 in the IB.
I agree that we could also live with it for a while, but we'd prefer closing the release with all monitored workflows clean, at least from crashes.

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Mar 7, 2022

This seems to be originated from the fact that the PixelLess iteration is the only one (for Phase2) for which (since #36793) we don't store the Trajectory in the event. Then when we do the merge with the TrackListMerger, all the tracks that have been selected by the merger are stored. On the other hand, we store only the trajectories from collections which have a TrajTrackAssociation map (see here) And here the mismatch between trak and traj collections happens.

A simple solution would be to add:

--- a/RecoTracker/IterativeTracking/python/PixelLessStep_cff.py
+++ b/RecoTracker/IterativeTracking/python/PixelLessStep_cff.py
@@ -355,6 +355,8 @@ pixelLessStepTracks = RecoTracker.TrackProducer.TrackProducer_cfi.TrackProducer.
 )
 fastSim.toModify(pixelLessStepTracks, TTRHBuilder = 'WithoutRefit')
 
+from Configuration.Eras.Modifier_phase2_timing_layer_cff import phase2_timing_layer
+phase2_timing_layer.toModify(pixelLessStepTracks, TrajectoryInEvent = True)

 # TRACK SELECTION AND QUALITY FLAG SETTING.

which restore the alignment of the collections (and makes the comment valid ).

@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2022

@AdrianoDee

This seems to be originated from the fact that the PixelLess iteration is the only one (for Phase2) for which (since #36793) we don't store the Trajectory in the event.

thanks for this. Indeed this escaped my attention.
I've opened #37154 to test the commit.
I guess @fabiocos didn't bother to add those lines in that file, because PixelLess is not run in the Phase-2 setup, excepted when the vectorHits modifier is active.

@fabiocos
Copy link
Contributor

fabiocos commented Mar 7, 2022

@mmusich @AdrianoDee in my PR I focussed only on the collections that appeared to be needed for the generalTracks used by the TrackExtenderWithMTD, so if there is more than this which is needed I definitely missed it

@AdrianoDee
Copy link
Contributor

+upgrade
solved by #37154 (as soon as integrated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants