Failures in NanoDQMIO production #40676

kskovpen · 2023-02-02T17:39:27Z

We've been recently producing a new type of DQMIO datasets, as was requested here. While the local tests went fine, there are many failures seen in the production, also at different sites. Here is the example with the crash info:

cmsRun1
CMSSWStepFailure (Exit Code: 139)
 Adding last 25 lines of CMSSW stdout:
#18 0x00002b4e00301c2f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#19 0x00002b4e00259e55 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#20 0x00002b4e0025a14b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#21 0x00002b4e0025c735 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#22 0x00002b4dffff37b5 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreConcurrency.so
#23 0x00002b4e01aadbec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b4e93b81600, this=0x2b4e04623d80) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#24 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b4e04623d80) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#25 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#26 0x00002b4e001ca2d8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#27 0x00002b4e001d51fb in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#28 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#29 0x00002b4e01a9c0eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#30 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#31 0x000000000040970c in main ()

Current Modules:

Module: HLTElePhoTagAndProbeOfflineSource:egHLTElePhoHighEtaDQMOfflineTnPSource (crashed)
Module: TrackProducer:lowPtTripletStepTracks
Module: CkfTrackCandidateMaker:lowPtQuadStepTrackCandidates
Module: CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut

In this example, the input dataset is /EGamma/Run2022E-v1/RAW and the full log at /store/unmerged/logs/prod/2023/1/26/pdmvserv_Run2022E_EGamma_19Jan2023_230119_090450_268/DataProcessing/0002/3/33f0e71f-9554-40ba-87f3-80522baac221-0-3-logArchive.tar.gz.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-02-02T17:39:43Z

A new Issue was created by @kskovpen .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2023-02-02T17:57:35Z

assign dqm

FYI @cms-sw/hlt-l2

cmsbuild · 2023-02-02T17:57:57Z

New categories assigned: dqm

@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-02-02T18:00:01Z

@kskovpen Would you have a pointer to the full log? The snippet in the description is too short to be useful (beyond pointing the segfaulting module)

rvenditti · 2023-02-03T09:15:36Z

Hi @kskovpen I see that nanoDQMIO datasets for EGamma RunB, RunC, RunD have been successfully produced (or, at least, are in VALID status in DAS). Does the failure affect just the RunA-E-F-G, or all the eras are affected?

kskovpen · 2023-02-03T09:17:01Z

@kskovpen Would you have a pointer to the full log? The snippet in the description is too short to be useful (beyond pointing the segfaulting module)

We are waiting for the instructions from our computing colleagues on how to access it.

kskovpen · 2023-02-08T11:16:30Z

Hi All, here is an example with the full log containing the reported failure.

kskovpen · 2023-02-08T11:17:50Z

We are stopping this reprocessing campaign until the issue is resolved. Please let us know, if you have any ideas.

makortel · 2023-02-08T14:26:44Z

Thanks @kskovpen. The relevant stack traces of the TBB threads are

Thread 5 (Thread 0x2b166f200700 (LWP 658) "cmsRun"):
#2  0x00002b162545d2c0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b164d07a323 in CaloGeometry::getPosition(DetId const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libGeometryCaloGeometry.so
#5  0x00002b168ec90909 in EcalHaloAlgo::getPosition(DetId const&, ROOT::Math::PositionVector3D<ROOT::Math::Cartesian3D<double>, ROOT::Math::DefaultCoordinateSystemTag>) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoMETMETAlgorithms.so
#6  0x00002b168ec917ab in EcalHaloAlgo::GetHaloClusterCandidateEB(edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<HBHERecHit, edm::StrictWeakOrdering<HBHERecHit> > >&, float) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoMETMETAlgorithms.so
#7  0x00002b168ec93efb in EcalHaloAlgo::Calculate(CaloGeometry const&, edm::Handle<std::vector<reco::Photon, std::allocator<reco::Photon> > >&, edm::Handle<std::vector<reco::SuperCluster, std::allocator<reco::SuperCluster> > >&, edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<HBHERecHit, edm::StrictWeakOrdering<HBHERecHit> > >&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoMETMETAlgorithms.so
#8  0x00002b169768da3d in reco::EcalHaloDataProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoMETMETProducers.so
#9  0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Thread 4 (Thread 0x2b166e2ae700 (LWP 657) "cmsRun"):
#2  0x00002b162545d2c0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b16b0ecc46e in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#5  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#6  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#7  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#8  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#9  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#10 0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#11 0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#12 0x00002b16b0eccbbe in Basic2DGenericPFlowClusterizer::buildClusters(std::vector<reco::PFCluster, std::allocator<reco::PFCluster> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#13 0x00002b16b0eff2e8 in PFClusterProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#14 0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Thread 3 (Thread 0x2b166d8ad700 (LWP 656) "cmsRun"):
#2  0x00002b162545d2c0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b1651789e43 in CellularAutomaton::createAndConnectCells(std::vector<HitDoublets const*, std::allocator<HitDoublets const*> > const&, TrackingRegion const&, CACut const&, CACut const&, float) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoPixelVertexingPixelTriplets.so
#5  0x00002b16517737c7 in CAHitQuadrupletGenerator::hitNtuplets(IntermediateHitDoublets const&, std::vector<OrderedHitSeeds, std::allocator<OrderedHitSeeds> >&, SeedingLayerSetsHits const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoPixelVertexingPixelTriplets.so
#6  0x00002b16a0ce8219 in CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoPixelVertexingPixelTripletsPlugins.so
#7  0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Thread 1 (Thread 0x2b162066e940 (LWP 528) "cmsRun"):
#3  0x00002b1625460a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b168e7f0d5e in HLTDQMHist1D<reco::Photon, float>::fill(reco::Photon const&, edm::Event const&, edm::EventSetup const&, VarRangeCutColl<reco::Photon> const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#6  0x00002b168e80c7b2 in HLTDQMFilterEffHists<reco::Photon>::fillHists(reco::Photon const&, edm::Event const&, edm::EventSetup const&, trigger::TriggerEvent const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#7  0x00002b168e810bc4 in HLTDQMTagAndProbeEff<reco::GsfElectron, std::vector<reco::GsfElectron, std::allocator<reco::GsfElectron> >, reco::Photon, std::vector<reco::Photon, std::allocator<reco::Photon> > >::fill(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#8  0x00002b168e810d5d in HLTTagAndProbeOfflineSource<reco::GsfElectron, std::vector<reco::GsfElectron, std::allocator<reco::GsfElectron> >, reco::Photon, std::vector<reco::Photon, std::allocator<reco::Photon> > >::analyze(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#9  0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Current Modules:
Module: HLTElePhoTagAndProbeOfflineSource:egHLTElePhoHighEtaDQMOfflineTnPSource (crashed)
Module: PFClusterProducer:particleFlowClusterHBHE
Module: CAHitQuadrupletEDProducer:lowPtQuadStepHitQuadruplets
Module: EcalHaloDataProducer:EcalHaloData

makortel · 2023-02-13T22:16:07Z

I took a quick look on the code, and spotted a dangerous pattern. The HLTDQMHist1D takes a pointer toTH1 in its constructor

cmssw/DQMOffline/Trigger/interface/HLTDQMHist.h

Lines 52 to 58 in 4b49c89

    
           class HLTDQMHist1D : public HLTDQMHist<ObjType> { 
        
           public: 
        
             HLTDQMHist1D(TH1* hist, 
        
                          std::string varName, 
        
                          std::function<ValType(const ObjType&)> func, 
        
                          VarRangeCutColl<ObjType> rangeCuts) 
        
                 : var_(std::move(func)), varName_(std::move(varName)), localRangeCuts_(std::move(rangeCuts)), hist_(hist) {}

and the TH1* is obtained from MonitorElement

cmssw/DQMOffline/Trigger/interface/HLTDQMFilterEffHists.h

Line 142 in 4b49c89

    
           hist = std::make_unique<HLTDQMHist1D<ObjType, float> >(mePass->getTH1(), vsVar, vsVarFunc, rangeCuts);

cmssw/DQMOffline/Trigger/interface/HLTDQMFilterEffHists.h

Line 148 in 4b49c89

    
           hist = std::make_unique<HLTDQMHist1D<ObjType, float> >(meTot->getTH1(), vsVar, vsVarFunc, rangeCuts);

As far as I have understood (and still remember) the internals of DQMStore the unprotected use of getTH1() is dangerous because of how DQMStore internally handles the raw ROOT objects in presence of concurrency. Prolonged use of the once-obtained raw ROOT objects in an edm::stream module in the presence of concurrent LuminosityBlocks is even more so and basically asking for trouble.

I think we (=CMS) should officially deprecate the raw ROOT object access via getTH1() (and others), and find and replace every call with a safe alternative (as already discussed in some other issue(s)). For example here it seems to me the HLTDQMHist1D and ?HLTDQMHist2D could easily be made to just use MonitorElement instead of TH1.

I can't tell if what I described above is the source of the problem (code looks like only debugger could tell), but it looks important-enough to be addressed anyhow.

missirol · 2023-02-14T11:12:44Z

@makortel

Thanks for the info in #40676 (comment). #40760 tries to implement the suggestion from that comment.

(code looks like only debugger could tell)

Last week, I tried to reproduce the crash (based on the tarball from PdmV), but failed. Could you please suggest in more detail how to debug this?

LukaLambrecht · 2023-02-14T13:53:00Z

@makortel, @missirol thanks for having a look! I also tried to reproduce the crash locally just now, but could not. I'm running the same cmsDriver command as in the production, on the same input file and same run/lumisection/event that caused the crash according to the log file linked above, and with the same number of streams and threads. But it still finishes without crashes. (This is Luka by the way, involved in the nanoDQMIO production from the DQM side.)

makortel · 2023-02-14T14:49:04Z

The difficulties in reproducing the failures are compatible with a data race. If the failures are rare (as it seems), I'm not aware of a better way than fixing the obvious problems and then testing massively.

@cms-sw/dqm-l2 By the way, do we have any runTheMatrix workflows or other tests in IBs exercising NanoDQMIO?

missirol · 2023-02-16T10:05:31Z

Assuming #40760 is merged, should it be backported down to 12_4_X ?

kskovpen · 2023-02-16T10:09:34Z

Yes, please (at least from the PdmV point of you). This way, we can build a new release (a note to @cms-sw/dqm-l2 to request it), and resubmit the whole massive thing.

rvenditti · 2023-02-16T10:16:22Z

Hi @missirol thank you for the prompt feedback! Yes, a backport to 12_4 is needed in order to launch the production again. We will ask to build a new 12_4 release after 40760 has been merged.

missirol · 2023-02-16T12:16:03Z

@rvenditti , the backports are in place, but DQM has to review/sign these PRs, starting from #40760. Can you please have a look?

missirol · 2023-03-21T16:13:33Z

@kskovpen , out of curiosity, did this NanoDQMIO issue get resolved ? Did #40787 help ?

kskovpen · 2023-03-21T16:31:43Z

It looks like this fix has resolved the DQM problem, thanks. @cms-sw/dqm-l2 can also confirm.

However, there are now other issues (not connected to DQM, it seems). Here is an example of the failed workflow: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022D-ZeroBias-23Feb2023-00001. In case someone has an idea.

makortel · 2023-03-21T17:02:58Z

Seems like a harvesting job is failing with

----- Begin Fatal Exception 18-Mar-2023 18:42:59 CET-----------------------
An exception of category 'NoRecord' occurred while
   [0] Processing global end Run run: 357894
   [1] Prefetching for module SiStripBadComponentInfo/'siStripBadComponentInfo'
   [2] Calling method for EventSetup module SiStripQualityESProducer/'mergedSiStripQualityProducer'
   [3] While getting dependent Record from Record SiStripQualityRcd
Exception Message:
No "SiStripDetVOffRcd" record found in the EventSetup.

 The Record is delivered by an ESSource or ESProducer but there is no valid IOV for the synchronization value.
 Please check 
   a) if the synchronization value is reasonable and report to the hypernews if it is not.
   b) else check that all ESSources have been properly configured. 
----- End Fatal Exception -------------------------------------------------

<snip>

Thread 1 (Thread 0x2b3a96bac940 (LWP 524) "cmsRun"):
#3  0x00002b3a9be92a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b3abf5969e0 in SiStripQuality::cleanUp(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libCalibFormatsSiStripObjects.so
#6  0x00002b3adac92d86 in SiStripQualityWithFromFedErrorsHelper::getMergedQuality(dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libCalibTrackerSiStripQuality.so
#7  0x00002b3adab38ef4 in SiStripBadComponentInfo::dqmEndJob(dqm::implementation::IBooker&, dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/pluginDQMSiStripMonitorClientPlugins.so
#8  0x00002b3adab3a304 in non-virtual thunk to DQMEDHarvester::endProcessBlockProduce(edm::ProcessBlock&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/pluginDQMSiStripMonitorClientPlugins.so
#9  0x00002b3a9373e060 in edm::one::EDProducerBase::doEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b3a93723ab0 in edm::WorkerT<edm::one::EDProducerBase>::implDoEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#11 0x00002b3a9362d687 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b3a9362da37 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#13 0x00002b3a9362df7a in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b3a9362e041 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b3a9341b055 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreConcurrency.so
#16 0x00002b3a94ed6bec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b3ae56ed000, this=0x2b3a97a63e00) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b3a97a63e00) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#18 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#19 0x00002b3a935f9d67 in edm::EventProcessor::endProcessBlock(bool, bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#20 0x00002b3a935fe3d9 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#21 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00002b3a94ec50eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#23 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040970c in main ()

Current Modules:
Module: SiStripBadComponentInfo:siStripBadComponentInfo (crashed)

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_ZeroBias_23Feb2023_230223_095833_1345/139/DataProcessingMergeDQMoutputEndOfRunDQMHarvestMerged/b7b16153-39cf-4d4e-892f-f829f3542421-EndOfRun-Harvest-20-3-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

makortel · 2023-03-21T17:04:04Z

(in general it would be good open a new issue for each new problem, but since this seemed to be connected to DQM I kept it here)

mmusich · 2023-03-21T18:27:45Z

However, there are now other issues (not connected to DQM, it seems)

This is still very much connected to DQM.

makortel · 2023-03-21T20:24:54Z

The segfault is caused by a helper class consuming the SiStripQuality product in endRun

cmssw/CalibTracker/SiStripQuality/src/SiStripQualityWithFromFedErrorsHelper.cc

Lines 186 to 199 in 4c4bee9

    
           bool SiStripQualityWithFromFedErrorsHelper::endRun(const edm::EventSetup& iSetup) { 
        
             if (stripQualityWatcher_.check(iSetup)) { 
        
               if (keepCopy_) { 
        
                 mergedQuality_ = std::make_unique<SiStripQuality>(iSetup.getData(stripQualityToken_)); 
        
                 if (addBadCompFromFedErr_) { 
        
                   fedCabling_ = std::make_unique<SiStripFedCabling>(iSetup.getData(fedCablingToken_)); 
        
                 } 
        
                 merged_ = false; 
        
               } 
        
               return true; 
        
             } else { 
        
               return false; 
        
             } 
        
           }

(called from

cmssw/DQM/SiStripMonitorClient/plugins/SiStripBadComponentInfo.cc

Lines 39 to 45 in 4c4bee9

    
           void SiStripBadComponentInfo::endRun(edm::Run const& run, edm::EventSetup const& eSetup) { 
        
             LogDebug("SiStripBadComponentInfo") << "SiStripBadComponentInfo:: End of Run"; 
        
             // Retrieve tracker topology from geometry 
        
             tTopo_ = std::make_unique<TrackerTopology>(eSetup.getData(tTopoToken_)); 
        
             if (withFedErrHelper_.endRun(eSetup) && !withFedErrHelper_.addBadCompFromFedErr()) {

)

and then using the product in endProcessBlock

cmssw/CalibTracker/SiStripQuality/src/SiStripQualityWithFromFedErrorsHelper.cc

Lines 201 to 223 in 4c4bee9

    
           const SiStripQuality& SiStripQualityWithFromFedErrorsHelper::getMergedQuality( 
        
               dqm::harvesting::DQMStore::IGetter& getter) { 
        
             if (!merged_) { 
        
               if (addBadCompFromFedErr_) { 
        
                 std::unique_ptr<SiStripQuality> fedErrQuality{}; 
        
                 if (fedErrLegacyFile_.empty()) { 
        
                   edm::LogInfo("SiStripQuality") << "Adding bad components from FED errors in DQM store"; 
        
                   fedErrQuality = sistrip::badStripFromFedErr(getter, *fedCabling_, fedErrCutoff_); 
        
                 } else { 
        
                   edm::LogInfo("SiStripQuality") << "Adding bad components from FED errors in legacy DQM file " 
        
                                                  << fedErrLegacyFile_; 
        
                   fedErrQuality = sistrip::badStripFromFedErrLegacyDQMFile( 
        
                       fedErrLegacyFile_, fedErrFileRunNumber_, *fedCabling_, fedErrCutoff_); 
        
                 } 
        
                 mergedQuality_->add(fedErrQuality.get()); 
        
                 mergedQuality_->cleanUp(); 
        
                 mergedQuality_->fillBadComponents(); 
        
               } 
        
               merged_ = true; 
        
             } 
        
             return *mergedQuality_; 
        
           }

(called from

cmssw/DQM/SiStripMonitorClient/plugins/SiStripBadComponentInfo.cc

Lines 142 to 145 in 4c4bee9

    
           void SiStripBadComponentInfo::dqmEndJob(DQMStore::IBooker& ibooker, DQMStore::IGetter& igetter) { 
        
             LogDebug("SiStripBadComponentInfo") << "SiStripBadComponentInfo::dqmEndRun"; 
        
             if (withFedErrHelper_.addBadCompFromFedErr()) { 
        
               fillBadComponentMaps(&withFedErrHelper_.getMergedQuality(igetter));

)

without checking that the copy is there (i.e. the endRun() function was called). In this case the SiStripBadComponentInfo::endRun() can not be called because an ES data product it would need can not be produced, but as part of the framework's shutdown, the endProcessBlockProduce() is called.

makortel · 2023-03-21T20:26:52Z

The log has many printouts like

%MSG-w L1TdeCSCTPGClient:  L1TdeCSCTPGClient:l1tdeCSCTPGClient@endLumi  18-Mar-2023 18:42:56 CET Run: 357894 Lumi: 1
void L1TdeCSCTPGClient::processHistograms(dqm::legacy::DQMStore::IGetter&) could not load the necessary histograms for harvesting alct_quality_data_ME11 / alct_quality_emul_ME11
%MSG

and

%MSG-e DQMGenericClient:  DQMGenericClient:trackingEffFromHitPattern@endRun  18-Mar-2023 18:42:59 CET End Run: 357894
 DQMGenericClient::findAllSubdirectories ==> Missing folder Tracking/TrackParameters/highPurityTracks/dzPV0p1 !!!
%MSG

Are all of these really necessary (in production)? They make the log file very cumbersome to load in a browser.

cmsbuild added the pending-assignment label Feb 2, 2023

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels Feb 2, 2023

This was referenced Feb 14, 2023

stop using bare ROOT objects in HLTDQMHist* classes #40760

Merged

Error in THashTable::Rehash in HLTFiltersDQMonitor #40419

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in NanoDQMIO production #40676

Failures in NanoDQMIO production #40676

kskovpen commented Feb 2, 2023

cmsbuild commented Feb 2, 2023

makortel commented Feb 2, 2023

cmsbuild commented Feb 2, 2023

makortel commented Feb 2, 2023

rvenditti commented Feb 3, 2023

kskovpen commented Feb 3, 2023

kskovpen commented Feb 8, 2023

kskovpen commented Feb 8, 2023

makortel commented Feb 8, 2023

makortel commented Feb 13, 2023

missirol commented Feb 14, 2023

LukaLambrecht commented Feb 14, 2023

makortel commented Feb 14, 2023

missirol commented Feb 16, 2023

kskovpen commented Feb 16, 2023

rvenditti commented Feb 16, 2023

missirol commented Feb 16, 2023

missirol commented Mar 21, 2023

kskovpen commented Mar 21, 2023

makortel commented Mar 21, 2023

makortel commented Mar 21, 2023

mmusich commented Mar 21, 2023

makortel commented Mar 21, 2023

makortel commented Mar 21, 2023

Failures in NanoDQMIO production #40676

Failures in NanoDQMIO production #40676

Comments

kskovpen commented Feb 2, 2023

cmsbuild commented Feb 2, 2023

makortel commented Feb 2, 2023

cmsbuild commented Feb 2, 2023

makortel commented Feb 2, 2023

rvenditti commented Feb 3, 2023

kskovpen commented Feb 3, 2023

kskovpen commented Feb 8, 2023

kskovpen commented Feb 8, 2023

makortel commented Feb 8, 2023

makortel commented Feb 13, 2023

missirol commented Feb 14, 2023

LukaLambrecht commented Feb 14, 2023

makortel commented Feb 14, 2023

missirol commented Feb 16, 2023

kskovpen commented Feb 16, 2023

rvenditti commented Feb 16, 2023

missirol commented Feb 16, 2023

missirol commented Mar 21, 2023

kskovpen commented Mar 21, 2023

makortel commented Mar 21, 2023

makortel commented Mar 21, 2023

mmusich commented Mar 21, 2023

makortel commented Mar 21, 2023

makortel commented Mar 21, 2023