Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with G4 Refitter #31920

Closed
mmusich opened this issue Oct 23, 2020 · 29 comments
Closed

Segmentation fault with G4 Refitter #31920

mmusich opened this issue Oct 23, 2020 · 29 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Oct 23, 2020

Dear all,
I'd like to surface here the discussion which has been ongoing for some time in this Hypernews thread.
For reasons linked to avoiding biases in the alignment of the Tracker End Caps (apparently induced by bad description of the Tracker RECO material in the Barrel / Encap regions), we'd like to run the Geant4e track refitter in order to be able to use the more accurate description of the material from the simulation geometry.
This approach works fine with MC, but unfortunately when trying it on real data several issues have been found.
A first issue, in connection with sim::Field::GetFieldValue was solved by this PR #31203.
After applying the fix, several other segmentation faults have appeared, all generally related to the Geant4e refitter.
More information can be found in this talk.
A minimal reproducer (which does only the G4 refit) is available here.
Testing it in CMSSW_10_6_X_2020-10-22-1100 a segfault occurs after ~45 minutes of processing around:
186101st record. Run 316153, Event 817473248, LumiSection 606.
Trying to skip with the PoolSource to that event in particular, the segmentation fault does not appear anymore, seemingly indicating the issue is linked to the previous processing history.
The stack trace proceeds as follows at this link with the relevant part being:

#5  0x00007f2f1e1dcc54 in G4VPhysicalVolume::GetTranslation() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4geometry.so
#6  0x00007f2f1e22ae5e in G4VoxelNavigation::ComputeStep(CLHEP::Hep3Vector const&, CLHEP::Hep3Vector const&, double, double&, G4NavigationHistory&, bool&, CLHEP::Hep3Vector&, bool&, bool&, G4VPhysicalVolume**, int&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4geometry.so
#7  0x00007f2f1e20b690 in G4Navigator::ComputeStep(CLHEP::Hep3Vector const&, CLHEP::Hep3Vector const&, double, double&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4geometry.so
#8  0x00007f2f1e1e737e in G4ErrorPropagationNavigator::ComputeStep(CLHEP::Hep3Vector const&, CLHEP::Hep3Vector const&, double, double&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4geometry.so
#9  0x00007f2f1e219b1c in G4PropagatorInField::ComputeStep(G4FieldTrack&, double, double&, G4VPhysicalVolume*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4geometry.so
#10 0x00007f2f1d20d3b3 in G4Transportation::AlongStepGetPhysicalInteractionLength(G4Track const&, double, double, double&, G4GPILSelection*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4processes.so
#11 0x00007f2f1c190886 in G4SteppingManager::DefinePhysicalStepLength() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4tracking.so
#12 0x00007f2f1c18ecd8 in G4SteppingManager::Stepping() () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4tracking.so
#13 0x00007f2f1e6d07af in G4ErrorPropagator::MakeOneStep(G4ErrorFreeTrajState*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4error_propagation.so
#14 0x00007f2f1e6d28f8 in G4ErrorPropagator::PropagateOneStep(G4ErrorTrajState*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_X_2020-10-22-1100/external/slc7_amd64_gcc700/lib/libG4error_propagation.so
#15 0x00007f2f1eb00970 in std::pair<TrajectoryStateOnSurface, double> Geant4ePropagator::propagateGeneric<Plane>(FreeTrajectoryState const&, Plane const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/libTrackPropagationGeant4e.so
#16 0x00007f2f1eaffb03 in Geant4ePropagator::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/libTrackPropagationGeant4e.so
#17 0x00007f2f268de9af in Propagator::propagateWithPath(TrajectoryStateOnSurface const&, Surface const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/libTrackingToolsGeomPropagators.so
#18 0x00007f2f1eb2ab95 in KFTrajectorySmoother::trajectory(Trajectory const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/libTrackingToolsTrackFitters.so
#19 0x00007f2f1eb6f2e2 in (anonymous namespace)::KFFittingSmoother::smoothingStep(Trajectory&&) const () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/pluginTrackingToolsTrackFittersPlugins.so
#20 0x00007f2f1eb70c2d in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/pluginTrackingToolsTrackFittersPlugins.so
#21 0x00007f2eef0e5692 in TrackProducerAlgorithm<reco::Track>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/libRecoTrackerTrackProducer.so
#22 0x00007f2eef20bbdf in TrackProducerAlgorithm<reco::Track>::runWithTrack(TrackingGeometry const*, MagneticField const*, edm::View<reco::Track> const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&) () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/pluginRecoTrackerTrackProducerPlugins.so
#23 0x00007f2eef20561b in TrackRefitter::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02651/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_X_2020-10-18-0000/lib/slc7_amd64_gcc700/pluginRecoTrackerTrackProducerPlugins.so

Attention from Simulation (and possibly Reconstruction) group-s might be needed.
Any help with this issue is highly appreciated.

cc:
@rmankel @vbotta @connorpa

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich Marco Musich.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign simulation, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction,simulation

@mdhildreth,@slava77,@perrotta,@jpata,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor Author

mmusich commented Oct 23, 2020

Additionally, when trying to move the analysis to the master cycle (CMSSW_11_2_0_pre7) one gets immediately this error from the framework:

----- Begin Fatal Exception 23-Oct-2020 09:34:21 CEST-----------------------
An exception of category 'MustUseESGetToken' occurred while
   [0] Processing  Event run: 316153 lumi: 606 event: 817482875 stream: 0
   [1] Running path 'g4RefitPath'
   [2] Calling method for module TrackRefitter/'Geant4eTrackRefitter'
   [3] Using EventSetup component KFFittingSmootherESProducer/'G4eFitterSmoother' to make data TrajectoryFitter/'G4eFitterSmoother' in record TrajectoryFitterRecord
   [4] Running EventSetup component GeantPropagatorESProducer/'Geant4ePropagator
Exception Message:
Called EventSetupRecord::get without using a ESGetToken.
 While requesting data type:MagneticField label:''
----- End Fatal Exception -------------------------------------------------

I have tried to solve by applying this patch:

diff --git a/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.cc b/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.cc
index 44c960fabdb..c1e0232bf6c 100644
--- a/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.cc
+++ b/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.cc
@@ -1,6 +1,4 @@
 #include "GeantPropagatorESProducer.h"
-#include "MagneticField/Engine/interface/MagneticField.h"
-#include "MagneticField/Records/interface/IdealMagneticFieldRecord.h"
 #include "TrackPropagation/Geant4e/interface/Geant4ePropagator.h"
 
 #include "FWCore/Framework/interface/ESHandle.h"
@@ -13,21 +11,17 @@
 
 using namespace edm;
 
-GeantPropagatorESProducer::GeantPropagatorESProducer(const edm::ParameterSet &p) {
-  std::string myname = p.getParameter<std::string>("ComponentName");
+GeantPropagatorESProducer::GeantPropagatorESProducer(const edm::ParameterSet &p):
+  magFieldToken_(setWhatProduced(this, p.getParameter<std::string>("ComponentName")).consumesFrom<MagneticField, IdealMagneticFieldRecord>(edm::ESInputTag("","")))
+{
   pset_ = p;
-  setWhatProduced(this, myname);
 }
 
 GeantPropagatorESProducer::~GeantPropagatorESProducer() {}
 
 std::unique_ptr<Propagator> GeantPropagatorESProducer::produce(const TrackingComponentsRecord &iRecord) {
-  ESHandle<MagneticField> magfield;
-  iRecord.getRecord<IdealMagneticFieldRecord>().get(magfield);
-
   std::string pdir = pset_.getParameter<std::string>("PropagationDirection");
   std::string particleName = pset_.getParameter<std::string>("ParticleName");
-
   PropagationDirection dir = alongMomentum;
 
   if (pdir == "oppositeToMomentum")
@@ -37,5 +31,5 @@ std::unique_ptr<Propagator> GeantPropagatorESProducer::produce(const TrackingCom
   if (pdir == "anyDirection")
     dir = anyDirection;
 
-  return std::make_unique<Geant4ePropagator>(&(*magfield), particleName, dir);
+  return std::make_unique<Geant4ePropagator>(&(iRecord.get(magFieldToken_)), particleName, dir);
 }
diff --git a/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.h b/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.h
index 1b41fdd4680..272380e1dd7 100644
--- a/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.h
+++ b/TrackPropagation/Geant4e/plugins/GeantPropagatorESProducer.h
@@ -5,6 +5,9 @@
 #include "FWCore/ParameterSet/interface/ParameterSet.h"
 #include "TrackingTools/GeomPropagators/interface/Propagator.h"
 #include "TrackingTools/Records/interface/TrackingComponentsRecord.h"
+#include "MagneticField/Engine/interface/MagneticField.h"
+#include "MagneticField/Records/interface/IdealMagneticFieldRecord.h"
+
 #include <memory>
 
 /*
@@ -23,6 +26,7 @@ public:
 
 private:
   edm::ParameterSet pset_;
+  edm::ESGetToken<MagneticField, IdealMagneticFieldRecord> magFieldToken_;
 };
 
 #endif

but I am not sure if these changes are appropriate at all, as running with them, it results in an immediate segmentation fault at the 1st event.

@makortel
Copy link
Contributor

I have tried to solve by applying this patch:

Looks correct to me by eye.

it results in an immediate segmentation fault at the 1st event.

Could you share the stack trace, or is the same as in the description?

@mmusich
Copy link
Contributor Author

mmusich commented Oct 26, 2020

@makortel

Could you share the stack trace, or is the same as in the description?

no, it's different, here is a link to the stack trace.
Relevant lines seem to be:

#7  0x00007ff6ea910c93 in G4Exception(char const*, char const*, G4ExceptionSeverity, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/external/slc7_amd64_gcc820/lib/libG4global.so
#8  0x00007ff6eb0d6601 in G4ErrorPropagatorManager::InitGeant4e() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/external/slc7_amd64_gcc820/lib/libG4error_propagation.so
#9  0x00007ff6eb52d74d in Geant4ePropagator::ensureGeant4eIsInitilized(bool) const () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libTrackPropagationGeant4e.so
#10 0x00007ff6eb52dbfa in Geant4ePropagator::Geant4ePropagator(MagneticField const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, PropagationDirection) () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libTrackPropagationGeant4e.so
#11 0x00007ff6eb549335 in GeantPropagatorESProducer::produce(TrackingComponentsRecord const&) () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginTrackPropagatorsGeant4ePlugins.so
#12 0x00007ff6eb5522bd in decltype ({parm#1}()) edm::convertException::wrap<edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}>(edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginTrackPropagatorsGeant4ePlugins.so
#13 0x00007ff6eb552511 in edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginTrackPropagatorsGeant4ePlugins.so
#14 0x00007ff6eb55354f in void edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&>(edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&) () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginTrackPropagatorsGeant4ePlugins.so
#15 0x00007ff6eb5535c1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>(edm::eventsetup::Callback<GeantPropagatorESProducer, std::unique_ptr<Propagator, std::default_delete<Propagator> >, TrackingComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<TrackingComponentsRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /tmp/musich/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginTrackPropagatorsGeant4ePlugins.so
#16 0x00007ff715a8ebfd in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ff7117ab200, context_guard=..., t=t@entry=0x7ff6b8e3a840, isolation=isolation@entry=0) at ../../src/tbb/custom_scheduler.h:393
#17 0x00007ff715a8eef5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ff7117ab200, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#18 0x00007ff715a89bc1 in tbb::interface7::internal::task_arena_base::internal_execute (this=0x7ff717409040 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#19 0x00007ff7171e2621 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x00007ff71724f116 in edm::eventsetup::EventSetupRecordImpl::getFromProxy(edm::eventsetup::DataKey const&, edm::eventsetup::ComponentDescription const*&, bool, edm::EventSetupImpl const*) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#21 0x00007ff6b939ebe1 in TrackProducerBase<reco::Track>::getFromES(edm::EventSetup const&, edm::ESHandle<TrackerGeometry>&, edm::ESHandle<MagneticField>&, edm::ESHandle<TrajectoryFitter>&, edm::ESHandle<Propagator>&, edm::ESHandle<MeasurementTracker>&, edm::ESHandle<TransientTrackingRecHitBuilder>&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginRecoTrackerTrackProducerPlugins.so
#22 0x00007ff6b93dc707 in TrackRefitter::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/pluginRecoTrackerTrackProducerPlugins.so
#23 0x00007ff717342b84 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#24 0x00007ff71731d15e in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#25 0x00007ff717283245 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#26 0x00007ff7172833fd in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#27 0x00007ff717283706 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#28 0x00007ff717284e0a in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#29 0x00007ff715a8ebfd in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7ff7117ab200, context_guard=..., t=t@entry=0x7ff6b91d0a40, isolation=isolation@entry=0) at ../../src/tbb/custom_scheduler.h:393
#30 0x00007ff715a8eef5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7ff7117ab200, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#31 0x00007ff717204685 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#32 0x00007ff71720c7fe in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_0_pre7/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#33 0x000000000040f5ad in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#34 0x00007ff715a89bc1 in tbb::interface7::internal::task_arena_base::internal_execute (this=0x7ffe20aeab40, d=...) at ../../src/tbb/arena.cpp:1105
#35 0x0000000000410564 in main::{lambda()#1}::operator()() const ()
#36 0x000000000040f005 in main ()

@Dr15Jones
Copy link
Contributor

How many threads are you using to run the job?

@mmusich
Copy link
Contributor Author

mmusich commented Oct 26, 2020

How many threads are you using to run the job?

process.options = cms.untracked.PSet()
process.options.numberOfThreads = cms.untracked.uint32(1)

@civanch
Copy link
Contributor

civanch commented Nov 23, 2020

@mmusich , in #32239 and #32240 fixes are proposed. Sorry, that It takes too long but I was extremely busy and debugging takes significant time. How I see the situation now:

  1. there is harmless warning about initialisation of Geant4e. It may be removed when Geant4e will be updated but for now it may be ignored.
  2. Geant4e code is not fully thread safe, again it require update, so better to run in one thread for the time being.
  3. The crash happens when low-energy forward track was propagated backward. Because low-energy particles cannot help for the alignment, an extra threshold is added in the proposed PRs , after that failing job is running fine.
  4. The crash itself is not fully understood. Theoretically, track with momentum 0.3 GeV/c should be propagated without a problem. Crash itself means, that the navigation and geometry of Geant4e fails finding volume pointer in this particular case. For me the reason is not clear - is it overlaps in geometry (we have fixed few problems for Run-3), is it problem in VecGeom, or some luck of protections inside Geant4e itself.

In summary: I would propose, to include #32240 in the next new patch release and try to run. When Geant4 10.7 will be out (next week) I will try to work on Geant4e patch making it thread safe and more robust.

@mmusich
Copy link
Contributor Author

mmusich commented Nov 24, 2020

@civanch thanks.
I confirm that with the modifications in PR #32239 the track refitting configuration at my gist does not fail anymore in CMSSW_11_6_X
Anyway for the final word we should wait for detailed validation from @rmankel.
Since we have samples produced also in 11.1.X I propose a backport in that release cycle too: #32255.

@civanch
Copy link
Contributor

civanch commented Nov 25, 2020

@mmusich , in #32260 you mentioned new uncovered problem. What it is?

@mmusich
Copy link
Contributor Author

mmusich commented Nov 25, 2020

@civanch I describe it in #32260 (comment)
After applying the changes to be able to overcome the ESconsumes issues, there is a segfault. I don't observe that in 10.6.x (where the esconsume migration is not compulsory)

@civanch
Copy link
Contributor

civanch commented Nov 25, 2020

@mmusich , can you, please, point to how G4Refitter.py is created. Geometry instantiation was changed some time ago after 10_6 and we likely need to have a slightly different configuration.

@mmusich
Copy link
Contributor Author

mmusich commented Nov 25, 2020

@civanch Rainer Manker (@rmankel) created it some time ago (I think by hand). If you tell me how to modify it accordingly I will update it. Thanks.

@civanch
Copy link
Contributor

civanch commented Nov 25, 2020

OK, will try to make G4REfitter for 11_2.

@rmankel
Copy link
Contributor

rmankel commented Nov 25, 2020 via email

@civanch
Copy link
Contributor

civanch commented Dec 2, 2020

@mmusich , when I tried G4Reffiter.py with one of the recent IB and got following problem:

%MSG-i HCAL: (NoModuleName) 02-Dec-2020 13:47:41 CET pre-events
HcalHardcodeCalibrations::HcalHardcodeCalibrations->...
%MSG
----- Begin Fatal Exception 02-Dec-2020 13:47:42 CET-----------------------
An exception of category 'PluginNotFound' occurred while
[0] Constructing the EventProcessor
[1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
Unable to find plugin 'PGeometricDetExtraRcd@NewProxy' in category 'CondProxyFactory'. Please check spelling of name.
----- End Fatal Exception -------------------------------------------------
(END)

G4Reffiter itself seems to be relatively simple:

import FWCore.ParameterSet.Config as cms

process = cms.Process("G4eRefit")

process.load('Configuration.StandardSequences.Services_cff')
process.load('Configuration.StandardSequences.GeometryDB_cff')
process.load("Configuration.EventContent.EventContent_cff")
process.load("Configuration.StandardSequences.Reconstruction_cff")
process.load('Configuration.StandardSequences.MagneticField_cff')
process.load('Configuration.StandardSequences.EndOfProcess_cff')
process.load("Configuration.StandardSequences.FrontierConditions_GlobalTag_cff")

from Configuration.AlCa.GlobalTag import GlobalTag
process.GlobalTag.globaltag = "102X_dataRun2_TkAlSummerCamp_SG_v4"

this is necessary to get the simulation geometry

process.GlobalTag.toGet = cms.VPSet(
cms.PSet(record = cms.string("GeometryFileRcd"),
tag = cms.string("XMLFILE_Geometry_101YV4_Extended2018_mc"),
label = cms.untracked.string('Extended'),
)
)
....... lines for MessageLogger and event source ....
process.load("TrackPropagation.Geant4e.geantRefit_cff")

process.MeasurementTrackerEvent.pixelClusterProducer = 'ALCARECOTkAlMuonIsolated'
process.MeasurementTrackerEvent.stripClusterProducer = 'ALCARECOTkAlMuonIsolated'
process.MeasurementTrackerEvent.inactivePixelDetectorLabels = cms.VInputTag()
process.MeasurementTrackerEvent.inactiveStripDetectorLabels = cms.VInputTag()

process.Geant4eTrackRefitter.src = cms.InputTag("ALCARECOTkAlMuonIsolated")
process.Geant4eTrackRefitter.usePropagatorForPCA = cms.bool(True)
process.g4RefitPath = cms.Path( process.MeasurementTrackerEvent * process.geant4eTrackRefit )

The rest of it should not affect this crash. What one should changed in 11_2 in order to run Run-2 data?

@mmusich
Copy link
Contributor Author

mmusich commented Dec 2, 2020

@civanch
you need to change the Global Tag. Please use:

from Configuration.AlCa.GlobalTag import GlobalTag
process.GlobalTag.globaltag = "112X_dataRun2_v7"

by the way I have provided the full configuration updated already once here.
In addition, as I mentioned to you earlier (comment was apparently ignored) here I had to add the following extra parameters to the configuration:

    DeltaChordTracker = cms.double(0.001), ## in mm
    DeltaOneStepTracker = cms.double(1e-4),## in mm
    DeltaIntersectionTracker = cms.double(1e-6),## in mm
    RmaxTracker = cms.double(8000),        ## in mm
    ZmaxTracker = cms.double(11000),       ## in mm
    EnergyThTracker = cms.double(0.2),     ## in GeV

@mmusich
Copy link
Contributor Author

mmusich commented Dec 14, 2020

@civanch I was wondering if in the past two weeks there has been any progress on this.
Thanks,
M.

@mmusich
Copy link
Contributor Author

mmusich commented Jan 13, 2021

@civanch, could you please let me know if you have further debugged this problem?
Thanks

@civanch
Copy link
Contributor

civanch commented Jan 13, 2021

@mmusich , before the winter break I have started to fix the problem may be not in an optimal way - too general, which require more coherent changes in base classes. I will restart soon with a simple fix.

@mmusich
Copy link
Contributor Author

mmusich commented Feb 8, 2021

@civanch I tested CMSSW_11_3_X_2021-02-08-1100 which contains #32833 and indeed the configuration at https://gist.github.com/mmusich/12227a1a1d90e13ebae9502ac512c7d3 runs fine.
Thanks for providing a fix.
I guess this issue can be signed and close.

@civanch
Copy link
Contributor

civanch commented Feb 8, 2021

@mmusich , should I backport the fix to 11_2?

@mmusich
Copy link
Contributor Author

mmusich commented Feb 8, 2021

@civanch indeed, that would be great.

@slava77
Copy link
Contributor

slava77 commented Feb 11, 2021

+1

based on #31920 (comment)

@civanch
Copy link
Contributor

civanch commented Mar 23, 2021

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@qliphy qliphy closed this as completed Mar 25, 2021
@bendavid
Copy link
Contributor

Maybe this issue is fully resolved in more recent CMSSW/geant4 versions, but I've done some more debugging of some rare pathological cases I've still encountered in CMSSW_10_6_X.

The problem is that in G4VoxelNavigation::ComputeStep in some cases somehow fVoxelNode becomes out of sync with motherPhysical, and then things get messed up here https://github.com/cms-externals/geant4/blob/922600305b80a2b23e97abd352b68508cf419540/source/geometry/navigation/src/G4VoxelNavigation.cc#L197-L205 where sampleNo from the voxel node no longer corresponds to the current motherLogical volume. If one is "unlucky", sampleNo can be larger than the number of daughters, and one gets a garbage memory address for samplePhysical, leading to the crash. (if one is "lucky" then things will continue but possibly with unintended consequences due to the mismatch.)

In g4 cms/v10.4.3 a brute force fix is to resynchronize the voxel node:

diff --git a/source/geometry/navigation/src/G4VoxelNavigation.cc b/source/geometry/navigation/src/G4VoxelNavigation.cc
index 75a722c..32f396d 100644
--- a/source/geometry/navigation/src/G4VoxelNavigation.cc
+++ b/source/geometry/navigation/src/G4VoxelNavigation.cc
@@ -106,6 +106,19 @@ G4VoxelNavigation::ComputeStep( const G4ThreeVector& localPoint,
   motherLogical = motherPhysical->GetLogicalVolume();
   motherSolid = motherLogical->GetSolid();
 
+  if (motherLogical->GetVoxelHeader() != fVoxelHeaderStack[0]) {
+    std::cout << "mismatched voxel header!\n";
+    std::cout << "before update: motherLogical->GetVoxelHeader() = " << motherLogical->GetVoxelHeader() << std::endl;
+    for (unsigned int idepth = 0; idepth <= fVoxelDepth; ++idepth) {
+      std::cout << "fVoxelHeaderStack[" << idepth << "] = " << fVoxelHeaderStack[idepth] << std::endl;
+    }
+    VoxelLocate(motherLogical->GetVoxelHeader(), localPoint);
+    std::cout << "after update: motherLogical->GetVoxelHeader() = " << motherLogical->GetVoxelHeader() << std::endl;
+    for (unsigned int idepth = 0; idepth <= fVoxelDepth; ++idepth) {
+      std::cout << "fVoxelHeaderStack[" << idepth << "] = " << fVoxelHeaderStack[idepth] << std::endl;
+    }
+  }
+  

@bendavid
Copy link
Contributor

While the original underlying issue should be fixed by #40543 there are some remaining segfaults when trying to propagate very low pt states (the current minimum momentum protections should avoid this, but should follow up with some lower level protections in the future)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants