Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in THashTable::Rehash in HLTFiltersDQMonitor #40419

Open
makortel opened this issue Jan 3, 2023 · 19 comments
Open

Error in THashTable::Rehash in HLTFiltersDQMonitor #40419

makortel opened this issue Jan 3, 2023 · 19 comments

Comments

@makortel
Copy link
Contributor

makortel commented Jan 3, 2023

Workflow 139.003 step 3 failed in CMSSW_13_0_X_2023-01-03-1100 on el8_amd64_gcc11 with

----- Begin Fatal Exception 03-Jan-2023 14:09:46 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing  stream begin Run run: 346512 stream: 0
   [1] Calling method for module HLTFiltersDQMonitor/'hltFiltersDQM'
   Additional Info:
      [a] Fatal Root Error: @SUB=THashTable::Rehash
During the rehash of 0x2b908da53120 one or more element was added or removed. The initalize size was 748 and now it is 749

----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_X_2023-01-03-1100/pyRelValMatrixLogs/run/139.003_RunHLTPhy2021/step3_RunHLTPhy2021.log#/

(I think I saw this error also earlier, but couldn't find it now)

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 3, 2023

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Jan 3, 2023

assign dqm

FYI @cms-sw/hlt-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 3, 2023

New categories assigned: dqm

@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

makortel commented Jan 3, 2023

As far as I can tell, the only ROOT operations HLTFiltersDQMonitor::bookHistograms() does are creation of TProfile and calling SetBinLabel() repeatedly for the x TAxis of the profile in three places

meMenu_ = iBooker.bookProfile(efficPlotNamePrefix_ + hltMenuName,
"Path Efficiency",
triggerNames.size(),
0.,
triggerNames.size(),
-0.1,
1.1,
"");
if (meMenu_ and meMenu_->getTProfile() and meMenu_->getTProfile()->GetXaxis()) {
for (size_t idx = 0; idx < triggerNames.size(); ++idx) {
meMenu_->getTProfile()->GetXaxis()->SetBinLabel(idx + 1, triggerNames.at(idx).c_str());
}
}

const std::string meDatasetName(efficPlotNamePrefix_ + idset);
meDatasetMap_[meDatasetName] = iBooker.bookProfile(
meDatasetName.c_str(), meDatasetName.c_str(), dsetPathNames.size(), 0., dsetPathNames.size(), -0.1, 1.1, "");
TProfile* meDatasetTProf(nullptr);
if (meDatasetMap_.at(meDatasetName)) {
meDatasetTProf = meDatasetMap_.at(meDatasetName)->getTProfile();
}
for (size_t idxPath = 0; idxPath < dsetPathNames.size(); ++idxPath) {
auto const& iPathName(dsetPathNames.at(idxPath));
if (std::find(triggerNames.begin(), triggerNames.end(), iPathName) == triggerNames.end()) {
continue;
}
if (meDatasetTProf and meDatasetTProf->GetXaxis()) {
meDatasetTProf->GetXaxis()->SetBinLabel(idxPath + 1, iPathName.c_str());
}

const std::string mePathName(efficPlotNamePrefix_ + idset + "_" + iPathName);
mePathMap_[mePathName] = iBooker.bookProfile(mePathName.c_str(),
iPathName.c_str(),
mePath_binLabels.size(),
0.,
mePath_binLabels.size(),
-0.1,
1.1,
"");
if (mePathMap_.at(mePathName)) {
auto* const mePathTProf(mePathMap_.at(mePathName)->getTProfile());
if (mePathTProf and mePathTProf->GetXaxis()) {
for (size_t iMod = 0; iMod < mePath_binLabels.size(); ++iMod) {
mePathTProf->GetXaxis()->SetBinLabel(iMod + 1, mePath_binLabels.at(iMod).c_str());
}
}
}

Poking around in ROOT code I see TAxis::SetBinLabel() can lead to call to THashTable::Rehash() if the label of the given bin was already set
https://github.com/cms-sw/root/blob/cms/v6-24-00-patches/b5aa8fd/hist/hist/src/TAxis.cxx#L832-L843

If I recall/understood correctly the DQMStore's behavior on "local MonitorElements" that the module instance sees (described here), all "local MonitorElements" of the module's stream instances use the data of the connected "global MonitorElement" underneath, which means that in a multi-stream job the axis labels are re-set for all but first stream to call HLTFiltersDQMonitor::bookHistograms() (in the log quoted in the issue description the streams 1-3 had started processing events before the exception is printed for stream 0).

The locking in DQMStore::bookTransaction() guarantees that only one call to any bookHistograms() is active at any time, so there should be no concurrent calls to TAxis::SetBinLabel(). But given the log, something that could be happening are concurrent calls to HLTFiltersDQMonitor::bookHistograms() and HLTFiltersDQMonitor::analyze().

The relevant member functions in THashList (and THashTable() that is used underneath) that TAxis uses seem to have proper locking, but I'm not sure if e.g. TAxis::SetBinLabel() and TAxis::FindBin() are safe to be called concurrently.

(or I'm completely wrong of the call chain)

Adding @pcanal

@missirol
Copy link
Contributor

missirol commented Jan 3, 2023

Fwiw, I couldn't reproduce the crash locally yet, and I haven't spotted it in other recent IBs.

Trying to follow the analysis in #40419 (comment) ..

  • I'd suspect that the exception came from meMenu_

    for (size_t idx = 0; idx < triggerNames.size(); ++idx) {
    meMenu_->getTProfile()->GetXaxis()->SetBinLabel(idx + 1, triggerNames.at(idx).c_str());
    }

    because, in this case, triggerNames.size() == 748, which suspiciously matches the 748 mentioned in the error message.

  • Afaiu, one way to improve the behaviour of the client might be to (1) do something like

    if(meMenu_->getTProfile()->GetXaxis()->GetBinLabel(idx + 1) != triggerNames.at(idx).c_str());
       meMenu_->getTProfile()->GetXaxis()->SetBinLabel(idx + 1, triggerNames.at(idx).c_str());

    in bookHistograms, and (2) avoid the many calls to FindBin in analyze (for example, adding a class member with a triggerName-to-binIndex map, to be filled per run in bookHistograms). But please let me know if I'm missing something.

@makortel
Copy link
Contributor Author

makortel commented Jan 3, 2023

While testing locally (trying to insert an artificial delay to stream 0, although I didn't succeed in that yet), I was able to get a segfault

Thread 5 (Thread 0x7f6f847ff700 (LWP 4308) "cmsRun"):
#2  0x00007f6fc9414630 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7f6f1d400738, __str=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/slc7_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/gcc-11.2.1/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:254
#5  0x00007f6f886ff1b8 in l1t::L1TGlobalUtil::retrieveL1Event(edm::Event const&, edm::EventSetup const&, edm::EDGetToken) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/libL1TriggerL1TGlobal.so
#6  0x00007f6f79760f57 in HLTPrescaleProvider::prescaleSet(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/libHLTriggerHLTcore.so
#7  0x00007f6f578fd9ea in pat::PATTriggerProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginPhysicsToolsPatAlgos_plugins.so
#8  0x00007f6fd335d13d in edm::stream::EDProducerAdaptorBase::doEvent (this=0x7f6f3ff041e0, info=..., act=0x7f6fc4057810, mcc=mcc@entry=0x7f6f33c9e168) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/stream/EDProducerAdaptorBase.cc:82

Thread 4 (Thread 0x7f6f85928700 (LWP 4307) "cmsRun"):
#2  0x00007f6fc9414630 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f6fd22c9bcd in TListIter::Next() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#5  0x00007f6fc86eb297 in TAxis::SetBinLabel(int, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libHist.so
#6  0x00007f6f70c83138 in HLTFiltersDQMonitor::bookHistograms (this=0x7f6f751e5900, iBooker=..., iRun=..., iSetup=...) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/DQMOffline/Trigger/plugins/HLTFiltersDQMonitor.cc:158
#7  0x00007f6f70bca008 in DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)::{lambda(dqm::implementation::IBooker&)#1}::operator()(dqm::implementation::IBooker&) const (booker=..., __closure=<synthetic pointer>) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/src/DQMServices/Core/interface/DQMEDAnalyzer.h:87
#8  dqm::implementation::DQMStore::bookTransaction<DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)::{lambda(dqm::implementation::IBooker&)#1}>(DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)::{lambda(dqm::implementation::IBooker&)#1}, unsigned long, bool) (canSaveByLumi=<optimized out>, moduleId=144, f=..., this=0x7f6fc252c480) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/src/DQMServices/Core/interface/DQMStore.h:705
#9  DQMEDAnalyzer::beginRun (this=0x7f6f751e5900, run=..., setup=...) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/src/DQMServices/Core/interface/DQMEDAnalyzer.h:84
#10 0x00007f6fd335e7ad in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun (this=0x7f6f85921b40, id=..., info=..., mcc=0x7f6f712528a8) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc:224

Thread 3 (Thread 0x7f6f86329700 (LWP 4306) "cmsRun"):
#2  0x00007f6fc9414630 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f6fd0b3e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#5  0x00007f6fd0b39eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#6  0x00007f6fd0b39daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#7  0x00007f6fca06ad8c in dqm::implementation::DQMStore::enterLumi(unsigned int, unsigned int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/libDQMServicesCore.so
#8  0x00007f6f710490bd in non-virtual thunk to DQMOneEDAnalyzer<>::accumulate(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginDQMServicesComponentsPlugins.so
#9  0x00007f6fd3355a50 in edm::one::EDProducerBase::doEvent (this=0x7f6f71f32fa0, info=..., act=0x7f6fc4057810, mcc=mcc@entry=0x7f6f33668568) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/one/EDProducerBase.cc:61

Thread 1 (Thread 0x7f6fce99f740 (LWP 4221) "cmsRun"):
#3  0x00007f6fc9417deb in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/cms/cmssw/CMSSW_13_0_X_2023-01-01-0000/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f6fd22ccafc in TList::Clear(char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#6  0x00007f6fd22c695e in THashTable::Clear(char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#7  0x00007f6fd22c7ddd in THashTable::Rehash(int, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#8  0x00007f6fd22c84bd in THashTable::Add(TObject*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#9  0x00007f6fd22c50bd in THashList::AddLast(TObject*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libCore.so
#10 0x00007f6fc86eb0ed in TAxis::FindBin(char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/slc7_amd64_gcc11/cms/cmssw-patch/CMSSW_13_0_X_2023-01-03-1100/external/slc7_amd64_gcc11/lib/libHist.so
#11 0x00007f6f70c8165e in HLTFiltersDQMonitor::analyze (this=0x7f6f751e6800, iEvent=..., iSetup=...) at /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/include/c++/11.2.1/bits/basic_string.h:194
#12 0x00007f6fd335d13d in edm::stream::EDProducerAdaptorBase::doEvent (this=0x7f6f71251680, info=..., act=0x7f6fc4057810, mcc=mcc@entry=0x7f6f32c460a8) at /build/mkortela/debug/CMSSW_13_0_X_2023-01-03-1100/src/FWCore/Framework/src/stream/EDProducerAdaptorBase.cc:82
#13 0x00007f6fd334167f in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo (this=0x7f6f32c46080, info=..., mcc=0x7f6f32c460a8) at /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02766/slc7_amd64_gcc11/external/gcc/11.2.1-f9b9dfdd886f71cd63f5538223d8f161/include/c++/11.2.1/bits/shared_ptr_base.h:1295

Current Modules:
Module: HLTFiltersDQMonitor:hltFiltersDQM (crashed)
Module: HLTFiltersDQMonitor:hltFiltersDQM
Module: DQMEventInfo:dqmEnvHLT
Module: PATTriggerProducer:patTrigger

which demonstrates that somehow in this case TAxis::FindBin() (via analyze()) ends up in adding a new element, which leads to call to THashTable::Rehash() (via bookHistograms()), which then has a race condition with the iteration over labels in TAxis::SetBinLabel().

@pcanal
Copy link
Contributor

pcanal commented Jan 3, 2023

but I'm not sure if e.g. TAxis::SetBinLabel() and TAxis::FindBin() are safe to be called concurrently.

I don't recall that we made it so. @lmoneta can you confirm whether this is the case or not?

@makortel
Copy link
Contributor Author

makortel commented Jan 3, 2023

  • one way to improve the behaviour of the client might be to (1) do something like
    if(meMenu_->getTProfile()->GetXaxis()->GetBinLabel(idx + 1) != triggerNames.at(idx).c_str());
       meMenu_->getTProfile()->GetXaxis()->SetBinLabel(idx + 1, triggerNames.at(idx).c_str());
    in bookHistograms

A better way would be to move the SetBinLabel() to be called within the booking, i.e. something along

meMenu_ = iBooker.bookProfile(efficPlotNamePrefix_ + hltMenuName,
                                "Path Efficiency",
                                triggerNames.size(),
                                0.,
                                triggerNames.size(),
                                -0.1,
                                1.1,
                                "",
                                [&triggerNames](TProfile *profile) {
                                  for (size_t idx = 0; idx < triggerNames.size(); ++idx) {
                                    profile->GetXaxis()->SetBinLabel(idx + 1, triggerNames.[idx].c_str());
                                  }
                                });

Then the DQMStore would call it only when a new TProfile object is created.

@makortel
Copy link
Contributor Author

makortel commented Jan 3, 2023

It seems to me that one general issue that has not been fully addressed yet is the DQM infrastructure allowing access to bare ROOT objects (which is the case here). I see some earlier discussion in #25106 and in #29096.

Maybe now would be a good time to finally move away from that?

@dan131riley
Copy link

(I think I saw this error also earlier, but couldn't find it now)

Seen around a month ago,

Begin processing the 2nd record. Run 346512, Event 272599352, LumiSection 280 on stream 0 at 28-Nov-2022 05:55:48.212 CET
----- Begin Fatal Exception 28-Nov-2022 05:55:48 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing  stream begin Run run: 346512 stream: 2
   [1] Calling method for module HLTFiltersDQMonitor/'hltFiltersDQM'
   Additional Info:
      [a] Fatal Root Error: @SUB=THashTable::Rehash
During the rehash of 0x4001a2ff9e40 one or more element was added or removed. The initalize size was 756 and now it is 757

----- End Fatal Exception -------------------------------------------------

Same run #, not sure if it is the same WF.

@missirol
Copy link
Contributor

missirol commented Jan 4, 2023

@makortel, thanks for the suggestions in #40419 (comment) and #40419 (comment). I tried to implement a fix in #40426.

@missirol
Copy link
Contributor

As far as I know, #40426 fixed this problem.

I wonder if it should be backported, though (for example, to 12_6_X, which will be used online in the near future).

@perrotta
Copy link
Contributor

As far as I know, #40426 fixed this problem.

I wonder if it should be backported, though (for example, to 12_6_X, which will be used online in the near future).

If it can help the few MWGR's in which 12_6_X will be used, why not?

@missirol
Copy link
Contributor

Okay. Any reason to consider even lower cycles (12_4, 12_5) ?

@perrotta
Copy link
Contributor

Okay. Any reason to consider even lower cycles (12_4, 12_5) ?

I don't think so, but I'd let it up to you

@missirol
Copy link
Contributor

Okay. #40536 is the 12_6_X backport. If others think other backports are useful, it can be discussed here.

@missirol
Copy link
Contributor

The proposed fix was integrated in 13_0_X (#40426) and 12_6_X (#40536). Unless there are doubts about the fix, or requests for other backports, this issue can be closed.

@makortel
Copy link
Contributor Author

It seems to me that one general issue that has not been fully addressed yet is the DQM infrastructure allowing access to bare ROOT objects (which is the case here). I see some earlier discussion in #25106 and in #29096.

Maybe now would be a good time to finally move away from that?

I think we should follow up with the bare ROOT object access from MonitorElement, but probably in a separate issue. @cms-sw/dqm-l2 @cms-sw/orp-l2 what do you think?

@missirol
Copy link
Contributor

It seems to me that one general issue that has not been fully addressed yet is the DQM infrastructure allowing access to bare ROOT objects (which is the case here). I see some earlier discussion in #25106 and in #29096.
Maybe now would be a good time to finally move away from that?

I think we should follow up with the bare ROOT object access from MonitorElement, but probably in a separate issue. @cms-sw/dqm-l2 @cms-sw/orp-l2 what do you think?

@cms-sw/dqm-l2

Based on #40676 (comment) , it's probably time to open a dedicated issue about deprecating/forbidding access to bare ROOT objects via MonitorElement. And then this issue could also be closed (?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants