Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in NanoDQMIO production #40676

Open
kskovpen opened this issue Feb 2, 2023 · 24 comments
Open

Failures in NanoDQMIO production #40676

kskovpen opened this issue Feb 2, 2023 · 24 comments

Comments

@kskovpen
Copy link
Contributor

kskovpen commented Feb 2, 2023

We've been recently producing a new type of DQMIO datasets, as was requested here. While the local tests went fine, there are many failures seen in the production, also at different sites. Here is the example with the crash info:

cmsRun1
CMSSWStepFailure (Exit Code: 139)
 Adding last 25 lines of CMSSW stdout:
#18 0x00002b4e00301c2f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#19 0x00002b4e00259e55 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#20 0x00002b4e0025a14b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#21 0x00002b4e0025c735 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#22 0x00002b4dffff37b5 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreConcurrency.so
#23 0x00002b4e01aadbec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b4e93b81600, this=0x2b4e04623d80) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#24 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b4e04623d80) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#25 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#26 0x00002b4e001ca2d8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#27 0x00002b4e001d51fb in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so
#28 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#29 0x00002b4e01a9c0eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#30 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#31 0x000000000040970c in main ()

Current Modules:

Module: HLTElePhoTagAndProbeOfflineSource:egHLTElePhoHighEtaDQMOfflineTnPSource (crashed)
Module: TrackProducer:lowPtTripletStepTracks
Module: CkfTrackCandidateMaker:lowPtQuadStepTrackCandidates
Module: CkfTrackCandidateMaker:muonSeededTrackCandidatesInOut

In this example, the input dataset is /EGamma/Run2022E-v1/RAW and the full log at /store/unmerged/logs/prod/2023/1/26/pdmvserv_Run2022E_EGamma_19Jan2023_230119_090450_268/DataProcessing/0002/3/33f0e71f-9554-40ba-87f3-80522baac221-0-3-logArchive.tar.gz.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 2, 2023

A new Issue was created by @kskovpen .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Feb 2, 2023

assign dqm

FYI @cms-sw/hlt-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 2, 2023

New categories assigned: dqm

@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Feb 2, 2023

@kskovpen Would you have a pointer to the full log? The snippet in the description is too short to be useful (beyond pointing the segfaulting module)

@rvenditti
Copy link
Contributor

Hi @kskovpen I see that nanoDQMIO datasets for EGamma RunB, RunC, RunD have been successfully produced (or, at least, are in VALID status in DAS). Does the failure affect just the RunA-E-F-G, or all the eras are affected?

@kskovpen
Copy link
Contributor Author

kskovpen commented Feb 3, 2023

@kskovpen Would you have a pointer to the full log? The snippet in the description is too short to be useful (beyond pointing the segfaulting module)

We are waiting for the instructions from our computing colleagues on how to access it.

@kskovpen
Copy link
Contributor Author

kskovpen commented Feb 8, 2023

Hi All, here is an example with the full log containing the reported failure.

@kskovpen
Copy link
Contributor Author

kskovpen commented Feb 8, 2023

We are stopping this reprocessing campaign until the issue is resolved. Please let us know, if you have any ideas.

@makortel
Copy link
Contributor

makortel commented Feb 8, 2023

Thanks @kskovpen. The relevant stack traces of the TBB threads are

Thread 5 (Thread 0x2b166f200700 (LWP 658) "cmsRun"):
#2  0x00002b162545d2c0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b164d07a323 in CaloGeometry::getPosition(DetId const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libGeometryCaloGeometry.so
#5  0x00002b168ec90909 in EcalHaloAlgo::getPosition(DetId const&, ROOT::Math::PositionVector3D<ROOT::Math::Cartesian3D<double>, ROOT::Math::DefaultCoordinateSystemTag>) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoMETMETAlgorithms.so
#6  0x00002b168ec917ab in EcalHaloAlgo::GetHaloClusterCandidateEB(edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<HBHERecHit, edm::StrictWeakOrdering<HBHERecHit> > >&, float) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoMETMETAlgorithms.so
#7  0x00002b168ec93efb in EcalHaloAlgo::Calculate(CaloGeometry const&, edm::Handle<std::vector<reco::Photon, std::allocator<reco::Photon> > >&, edm::Handle<std::vector<reco::SuperCluster, std::allocator<reco::SuperCluster> > >&, edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> > >&, edm::Handle<edm::SortedCollection<HBHERecHit, edm::StrictWeakOrdering<HBHERecHit> > >&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoMETMETAlgorithms.so
#8  0x00002b169768da3d in reco::EcalHaloDataProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoMETMETProducers.so
#9  0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Thread 4 (Thread 0x2b166e2ae700 (LWP 657) "cmsRun"):
#2  0x00002b162545d2c0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b16b0ecc46e in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#5  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#6  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#7  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#8  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#9  0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#10 0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#11 0x00002b16b0ecbfbd in Basic2DGenericPFlowClusterizer::growPFClusters(reco::PFCluster const&, std::vector<bool, std::allocator<bool> > const&, unsigned int, unsigned int, double, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#12 0x00002b16b0eccbbe in Basic2DGenericPFlowClusterizer::buildClusters(std::vector<reco::PFCluster, std::allocator<reco::PFCluster> > const&, std::vector<bool, std::allocator<bool> > const&, std::vector<reco::PFCluster, std::allocator<reco::PFCluster> >&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#13 0x00002b16b0eff2e8 in PFClusterProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoParticleFlowPFClusterProducerPlugins.so
#14 0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Thread 3 (Thread 0x2b166d8ad700 (LWP 656) "cmsRun"):
#2  0x00002b162545d2c0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b1651789e43 in CellularAutomaton::createAndConnectCells(std::vector<HitDoublets const*, std::allocator<HitDoublets const*> > const&, TrackingRegion const&, CACut const&, CACut const&, float) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoPixelVertexingPixelTriplets.so
#5  0x00002b16517737c7 in CAHitQuadrupletGenerator::hitNtuplets(IntermediateHitDoublets const&, std::vector<OrderedHitSeeds, std::allocator<OrderedHitSeeds> >&, SeedingLayerSetsHits const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libRecoPixelVertexingPixelTriplets.so
#6  0x00002b16a0ce8219 in CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginRecoPixelVertexingPixelTripletsPlugins.so
#7  0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Thread 1 (Thread 0x2b162066e940 (LWP 528) "cmsRun"):
#3  0x00002b1625460a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b168e7f0d5e in HLTDQMHist1D<reco::Photon, float>::fill(reco::Photon const&, edm::Event const&, edm::EventSetup const&, VarRangeCutColl<reco::Photon> const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#6  0x00002b168e80c7b2 in HLTDQMFilterEffHists<reco::Photon>::fillHists(reco::Photon const&, edm::Event const&, edm::EventSetup const&, trigger::TriggerEvent const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#7  0x00002b168e810bc4 in HLTDQMTagAndProbeEff<reco::GsfElectron, std::vector<reco::GsfElectron, std::allocator<reco::GsfElectron> >, reco::Photon, std::vector<reco::Photon, std::allocator<reco::Photon> > >::fill(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#8  0x00002b168e810d5d in HLTTagAndProbeOfflineSource<reco::GsfElectron, std::vector<reco::GsfElectron, std::allocator<reco::GsfElectron> >, reco::Photon, std::vector<reco::Photon, std::allocator<reco::Photon> > >::analyze(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/pluginDQMOfflineTriggerPlugins.so
#9  0x00002b161d2076f3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_12/lib/el8_amd64_gcc10/libFWCoreFramework.so

Current Modules:
Module: HLTElePhoTagAndProbeOfflineSource:egHLTElePhoHighEtaDQMOfflineTnPSource (crashed)
Module: PFClusterProducer:particleFlowClusterHBHE
Module: CAHitQuadrupletEDProducer:lowPtQuadStepHitQuadruplets
Module: EcalHaloDataProducer:EcalHaloData

@makortel
Copy link
Contributor

I took a quick look on the code, and spotted a dangerous pattern. The HLTDQMHist1D takes a pointer toTH1 in its constructor

class HLTDQMHist1D : public HLTDQMHist<ObjType> {
public:
HLTDQMHist1D(TH1* hist,
std::string varName,
std::function<ValType(const ObjType&)> func,
VarRangeCutColl<ObjType> rangeCuts)
: var_(std::move(func)), varName_(std::move(varName)), localRangeCuts_(std::move(rangeCuts)), hist_(hist) {}

and the TH1* is obtained from MonitorElement
hist = std::make_unique<HLTDQMHist1D<ObjType, float> >(mePass->getTH1(), vsVar, vsVarFunc, rangeCuts);

hist = std::make_unique<HLTDQMHist1D<ObjType, float> >(meTot->getTH1(), vsVar, vsVarFunc, rangeCuts);

As far as I have understood (and still remember) the internals of DQMStore the unprotected use of getTH1() is dangerous because of how DQMStore internally handles the raw ROOT objects in presence of concurrency. Prolonged use of the once-obtained raw ROOT objects in an edm::stream module in the presence of concurrent LuminosityBlocks is even more so and basically asking for trouble.

I think we (=CMS) should officially deprecate the raw ROOT object access via getTH1() (and others), and find and replace every call with a safe alternative (as already discussed in some other issue(s)). For example here it seems to me the HLTDQMHist1D and ?HLTDQMHist2D could easily be made to just use MonitorElement instead of TH1.

I can't tell if what I described above is the source of the problem (code looks like only debugger could tell), but it looks important-enough to be addressed anyhow.

@missirol
Copy link
Contributor

@makortel

Thanks for the info in #40676 (comment). #40760 tries to implement the suggestion from that comment.

(code looks like only debugger could tell)

Last week, I tried to reproduce the crash (based on the tarball from PdmV), but failed. Could you please suggest in more detail how to debug this?

@LukaLambrecht
Copy link
Contributor

@makortel, @missirol thanks for having a look! I also tried to reproduce the crash locally just now, but could not. I'm running the same cmsDriver command as in the production, on the same input file and same run/lumisection/event that caused the crash according to the log file linked above, and with the same number of streams and threads. But it still finishes without crashes. (This is Luka by the way, involved in the nanoDQMIO production from the DQM side.)

@makortel
Copy link
Contributor

The difficulties in reproducing the failures are compatible with a data race. If the failures are rare (as it seems), I'm not aware of a better way than fixing the obvious problems and then testing massively.

@cms-sw/dqm-l2 By the way, do we have any runTheMatrix workflows or other tests in IBs exercising NanoDQMIO?

@missirol
Copy link
Contributor

Assuming #40760 is merged, should it be backported down to 12_4_X ?

@kskovpen
Copy link
Contributor Author

Yes, please (at least from the PdmV point of you). This way, we can build a new release (a note to @cms-sw/dqm-l2 to request it), and resubmit the whole massive thing.

@rvenditti
Copy link
Contributor

Hi @missirol thank you for the prompt feedback! Yes, a backport to 12_4 is needed in order to launch the production again. We will ask to build a new 12_4 release after 40760 has been merged.

@missirol
Copy link
Contributor

@rvenditti , the backports are in place, but DQM has to review/sign these PRs, starting from #40760. Can you please have a look?

@missirol
Copy link
Contributor

@kskovpen , out of curiosity, did this NanoDQMIO issue get resolved ? Did #40787 help ?

@kskovpen
Copy link
Contributor Author

It looks like this fix has resolved the DQM problem, thanks. @cms-sw/dqm-l2 can also confirm.

However, there are now other issues (not connected to DQM, it seems). Here is an example of the failed workflow: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022D-ZeroBias-23Feb2023-00001. In case someone has an idea.

@makortel
Copy link
Contributor

Seems like a harvesting job is failing with

----- Begin Fatal Exception 18-Mar-2023 18:42:59 CET-----------------------
An exception of category 'NoRecord' occurred while
   [0] Processing global end Run run: 357894
   [1] Prefetching for module SiStripBadComponentInfo/'siStripBadComponentInfo'
   [2] Calling method for EventSetup module SiStripQualityESProducer/'mergedSiStripQualityProducer'
   [3] While getting dependent Record from Record SiStripQualityRcd
Exception Message:
No "SiStripDetVOffRcd" record found in the EventSetup.

 The Record is delivered by an ESSource or ESProducer but there is no valid IOV for the synchronization value.
 Please check 
   a) if the synchronization value is reasonable and report to the hypernews if it is not.
   b) else check that all ESSources have been properly configured. 
----- End Fatal Exception -------------------------------------------------

<snip>

Thread 1 (Thread 0x2b3a96bac940 (LWP 524) "cmsRun"):
#3  0x00002b3a9be92a0b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b3abf5969e0 in SiStripQuality::cleanUp(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libCalibFormatsSiStripObjects.so
#6  0x00002b3adac92d86 in SiStripQualityWithFromFedErrorsHelper::getMergedQuality(dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libCalibTrackerSiStripQuality.so
#7  0x00002b3adab38ef4 in SiStripBadComponentInfo::dqmEndJob(dqm::implementation::IBooker&, dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/pluginDQMSiStripMonitorClientPlugins.so
#8  0x00002b3adab3a304 in non-virtual thunk to DQMEDHarvester::endProcessBlockProduce(edm::ProcessBlock&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/pluginDQMSiStripMonitorClientPlugins.so
#9  0x00002b3a9373e060 in edm::one::EDProducerBase::doEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b3a93723ab0 in edm::WorkerT<edm::one::EDProducerBase>::implDoEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#11 0x00002b3a9362d687 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b3a9362da37 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#13 0x00002b3a9362df7a in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b3a9362e041 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b3a9341b055 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreConcurrency.so
#16 0x00002b3a94ed6bec in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b3ae56ed000, this=0x2b3a97a63e00) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b3a97a63e00) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#18 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#19 0x00002b3a935f9d67 in edm::EventProcessor::endProcessBlock(bool, bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#20 0x00002b3a935fe3d9 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_13/lib/el8_amd64_gcc10/libFWCoreFramework.so
#21 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00002b3a94ec50eb in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:698
#23 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040970c in main ()

Current Modules:
Module: SiStripBadComponentInfo:siStripBadComponentInfo (crashed)

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_ZeroBias_23Feb2023_230223_095833_1345/139/DataProcessingMergeDQMoutputEndOfRunDQMHarvestMerged/b7b16153-39cf-4d4e-892f-f829f3542421-EndOfRun-Harvest-20-3-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

@makortel
Copy link
Contributor

(in general it would be good open a new issue for each new problem, but since this seemed to be connected to DQM I kept it here)

@mmusich
Copy link
Contributor

mmusich commented Mar 21, 2023

However, there are now other issues (not connected to DQM, it seems)

This is still very much connected to DQM.

@makortel
Copy link
Contributor

The segfault is caused by a helper class consuming the SiStripQuality product in endRun

bool SiStripQualityWithFromFedErrorsHelper::endRun(const edm::EventSetup& iSetup) {
if (stripQualityWatcher_.check(iSetup)) {
if (keepCopy_) {
mergedQuality_ = std::make_unique<SiStripQuality>(iSetup.getData(stripQualityToken_));
if (addBadCompFromFedErr_) {
fedCabling_ = std::make_unique<SiStripFedCabling>(iSetup.getData(fedCablingToken_));
}
merged_ = false;
}
return true;
} else {
return false;
}
}

(called from
void SiStripBadComponentInfo::endRun(edm::Run const& run, edm::EventSetup const& eSetup) {
LogDebug("SiStripBadComponentInfo") << "SiStripBadComponentInfo:: End of Run";
// Retrieve tracker topology from geometry
tTopo_ = std::make_unique<TrackerTopology>(eSetup.getData(tTopoToken_));
if (withFedErrHelper_.endRun(eSetup) && !withFedErrHelper_.addBadCompFromFedErr()) {

)

and then using the product in endProcessBlock

const SiStripQuality& SiStripQualityWithFromFedErrorsHelper::getMergedQuality(
dqm::harvesting::DQMStore::IGetter& getter) {
if (!merged_) {
if (addBadCompFromFedErr_) {
std::unique_ptr<SiStripQuality> fedErrQuality{};
if (fedErrLegacyFile_.empty()) {
edm::LogInfo("SiStripQuality") << "Adding bad components from FED errors in DQM store";
fedErrQuality = sistrip::badStripFromFedErr(getter, *fedCabling_, fedErrCutoff_);
} else {
edm::LogInfo("SiStripQuality") << "Adding bad components from FED errors in legacy DQM file "
<< fedErrLegacyFile_;
fedErrQuality = sistrip::badStripFromFedErrLegacyDQMFile(
fedErrLegacyFile_, fedErrFileRunNumber_, *fedCabling_, fedErrCutoff_);
}
mergedQuality_->add(fedErrQuality.get());
mergedQuality_->cleanUp();
mergedQuality_->fillBadComponents();
}
merged_ = true;
}
return *mergedQuality_;
}

(called from
void SiStripBadComponentInfo::dqmEndJob(DQMStore::IBooker& ibooker, DQMStore::IGetter& igetter) {
LogDebug("SiStripBadComponentInfo") << "SiStripBadComponentInfo::dqmEndRun";
if (withFedErrHelper_.addBadCompFromFedErr()) {
fillBadComponentMaps(&withFedErrHelper_.getMergedQuality(igetter));

)

without checking that the copy is there (i.e. the endRun() function was called). In this case the SiStripBadComponentInfo::endRun() can not be called because an ES data product it would need can not be produced, but as part of the framework's shutdown, the endProcessBlockProduce() is called.

@makortel
Copy link
Contributor

The log has many printouts like

%MSG-w L1TdeCSCTPGClient:  L1TdeCSCTPGClient:l1tdeCSCTPGClient@endLumi  18-Mar-2023 18:42:56 CET Run: 357894 Lumi: 1
void L1TdeCSCTPGClient::processHistograms(dqm::legacy::DQMStore::IGetter&) could not load the necessary histograms for harvesting alct_quality_data_ME11 / alct_quality_emul_ME11
%MSG

and

%MSG-e DQMGenericClient:  DQMGenericClient:trackingEffFromHitPattern@endRun  18-Mar-2023 18:42:59 CET End Run: 357894
 DQMGenericClient::findAllSubdirectories ==> Missing folder Tracking/TrackParameters/highPurityTracks/dzPV0p1 !!!
%MSG

Are all of these really necessary (in production)? They make the log file very cumbersome to load in a browser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants