Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Dr15Jones · 2020-04-30T17:01:07Z

We are seeing segmentation faults in 3 of the integration build workflows. The stack trace is

#4  &lt;signal handler called&gt;
#5  0x00002ab08a2ea5b8 in hcaldqm::DigiRunSummary::endLuminosityBlock(dqm::implementation::IBooker&amp, dqm::implementation::IGetter&amp, edm::LuminosityBlock const&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02626/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-04-30-1100/lib/slc7_amd64_gcc820/libDQMHcalTasks.so
#6  0x00002ab08a22d7d8 in HcalOfflineHarvesting::_dqmEndLuminosityBlock(dqm::implementation::IBooker&amp, dqm::implementation::IGetter&amp, edm::LuminosityBlock const&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02626/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-04-30-1100/lib/slc7_amd64_gcc820/pluginDQMHcalTasksAuto.so
#7  0x00002ab08a1fa0ca in DQMEDHarvester::endLuminosityBlockProduce(edm::LuminosityBlock&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02626/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-04-30-1100/lib/slc7_amd64_gcc820/pluginDQMHcalTasksAuto.so

The likely culprits are either #29544 or #29586. The former changed code in that same package while the latter could be causing a module that was not being run before to now be run.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2020-04-30T17:01:30Z

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2020-04-30T17:08:11Z

assign dqm

cmsbuild · 2020-04-30T17:08:32Z

New categories assigned: dqm

@jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2020-04-30T17:13:12Z

Workflows in question are 4.28, 134.807, and 136.757.

Given that the segfault occurs in the HARVESTING step, I would guess #29586 to be unlikely the cause (still possible if running "unexpected" module in DQM step could somehow trigger a segfault in HARVESTING). Also all DQM modules use the Accumulator extension (right?), so they should have been running also before #29586 .

makortel · 2020-04-30T17:40:53Z

I ran 4.28 on my developer area of CMSSW_11_1_X_2020-04-21-23000+#29586 and it succeeded. So #29544 looks more probable cause.

jfernan2 · 2020-04-30T18:12:52Z

@lwang046 It seems your PR is causing trouble that we did not detect.
Can you please have a look?

Dr15Jones · 2020-04-30T20:17:14Z

The code crashes here
https://github.com/cms-sw/cmssw/blob/master/DQM/HcalTasks/src/DigiRunSummary.cc#L112

because meNumEvents is nullptr. The variable is set in the line just above
https://github.com/cms-sw/cmssw/blob/master/DQM/HcalTasks/src/DigiRunSummary.cc#L111

so the request from the IGetter is failing. The value for _subsystem appears to be 'Hcal' making the requested MonitorElement 'Hcal/RunInfo/NumberOfEvents'.

lwang046 · 2020-05-01T09:45:15Z

Hi @jfernan2 @Dr15Jones, may I ask is there some principle differences between these 4 workflows and the other workflows? I'm a bit puzzled why the other workflows can get this ME but these 4 cannot. The problem indeed is related to getBinContent, when I comment out this meNumEvents related codes, another newly booked ME cOccupancy_depth also has issue to getBinContent. Both of these are lumi based histograms. I met this same bug about a months ago, but then one day after a certain IB this bug suddenly disappeared for all the other runTheMatrix test workflows. Do you remember if there was a PR during this time that might affect getting lumi based histograms or have similar effects?

schneiml · 2020-05-01T11:18:44Z

Hi @lwang046 ,

A few comments:

Harvesting code must always check the return value of IGetter::get for nullptr. Lots of code does not do that today, and this causes crashes every now and then. Histograms can be missing for all sorts of reasons, typically DQM configuration changes in the RECO step, and harvesting should not crash in this situation.
There are a lot of changes in CMSSW_11_1 compared to earlier releases due to the "new DQMStore" work. However, this is all more than a month ago. See also the new README [1].
My guess is that the problem is due to changes in the DQM or HARVESTING sequences. It does happen that runTheMatrix workflows use misconfigured sequences that work by accident, but fail on seemingly unrelated changes.

[1] https://github.com/cms-sw/cmssw/tree/master/DQMServices/Core

makortel · 2020-05-01T12:55:44Z

@schneiml

Harvesting code must always check the return value of IGetter::get for nullptr. Lots of code does not do that today, and this causes crashes every now and then. Histograms can be missing for all sorts of reasons, typically DQM configuration changes in the RECO step, and harvesting should not crash in this situation.

Just a thought, would it make sense for IGetter::get to throw an exception (e.g. with the ME name) in case the requested ME is missing?

lwang046 · 2020-05-04T08:16:29Z

Dear @schneiml , @makortel, in an attempt to fix this issue, I found that the null MEs were booked using the the scope concept, auto scope = DQMStore::IBooker::UseLumiScope(ib);. May I ask if this scope has compatibility issue with DQMOneEDAnalyzer + edm::LuminosityBlockCache?

schneiml · 2020-05-04T10:23:29Z

@makortel We used to have an API that did that, but nobody used it, so I removed it. We should not change the default behaviour, since a lot of code handles missing MEs just fine (and I think there are valid reasons for MEs to be missing in harvesting, e.g. conditional booking in RECO). Also for general robustness the HARVESTING code should not blow up just because RECO config changed (pragmatically; arguably we should always make sure that the harvesting is configured correctly to match the RECO config, but in practice we'd blow up Tier0 a lot if we assumed that).

@lwang046 Reading a lumi-saved ME in endLumi harvesting code really should work. Since my recent PR (#29321) it should also be possible to per-lumi save MEs in DQMOneEDAnalyzer. Let me have a look.

schneiml · 2020-05-04T11:33:37Z

Ok, this looks a lot like a bug with #29321, maybe caused by the fact that these WF now actually do concurrent lumis? Investigating...

schneiml · 2020-05-04T13:39:05Z

Ok, the issue is actually pretty obvious if you think about it the right way:

The symptom is that the per-lumi MEs in question (NumberOfEvents) are not present in all lumisections.

The reason for that is that they are booked in a DQMOneEDAnalyzer, which switches lumisections per-event, to not rely on lumisection transitions. However, this means that the per-lumi MEs are never created for the lumisections that never "happend" for this Module: the empty lumisections with no events.

Now, there are a lot of possible solutions:

Consider it a feature. If the ME is not present, that is because there was no data for this lumisection. (But that is inconsistent with the behaviour of DQMEDAnalyzer).
Fix this by creating per-lumi MEs for each module centrally, on globalBeginLumi/globalEndLumi. This means more work for the DQMStore on the edm service callbacks. I think we should avoid that (are there any implications for parallel running on edm service callbacks?), but it is the cleanest solution: initialize the per lumi MEs for all modules in the global transition callback; enterLumi can then assume the global ME to be there already.
Fix this in DQMOneEDAnalyzer by doing an enterLumi in global lumi transitions, in addition to the per-event switching. But to get exposed to the global transitions we need to declare a cache, which conflicts with derived classes declaring a cache. Could probably be done with a bit of fiddling around with templates.
Fix this only for the affected modules, by calling enterLumi from subsystem code.

I don't like the first option much, because it violates the invariant "all lumis in a run have the same MEs". We don't really rely on that anywhere (and legacy/harvesting modules can always violate it), but it is nice to have. I like the second option, though we should not go for that if @makortel or @Dr15Jones have concerns about it. The last option might be a minimally-invasive fix, but it is not as clean as I'd like.

makortel · 2020-05-04T13:51:14Z

@schneiml Thanks for the detailed analysis! I share the dislike of option 1 pretty much for the same reasons. About the option 2

Fix this by creating per-lumi MEs for each module centrally, on globalBeginLumi/globalEndLumi. This means more work for the DQMStore on the edm service callbacks. I think we should avoid that (are there any implications for parallel running on edm service callbacks?), but it is the cleanest solution: initialize the per lumi MEs for all modules in the global transition callback; enterLumi can then assume the global ME to be there already.

I did not fully understand your concern. In beginLumi (endLumi) transitions (callbacks) the modules are run concurrently (following the data dependencies).

Dr15Jones · 2020-05-04T15:37:34Z

(are there any implications for parallel running on edm service callbacks?)

A service must be able to handle all ActivityRegistry service callbacks in a thread-safe manner. For this case, multiple module level 'endLuminosityBlock' callbacks (i.e. PreModuleGlobalBeginLumi and PostModuleGlobalBeginLumi) can be happening to the Services simultaneously since multiple modules could be running concurrently as well as multiple LuminosityBlocks could be ending concurrently.

schneiml · 2020-05-04T16:18:10Z

@makortel As Chris correctly guessed this is about these callbacks: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L680-L694

From @Dr15Jones answer I'd say it is ok to do more work there (apart from the edm::Service interface maybe being deprecated in general at some point?). I wanted to get away without these callbacks (apart from for supporting legacy modules), but the cleanupLumi ones are needed anyways, so I think it should be ok to also have a initLumi (symmetrical to cleanupLumi) there. I have that mostly implemented by now, just need to get it to actually work...

Note that while these callbacks are threadsafe, they are of course threadsafe by taking a lock for essentially the entire duration. I don't expect initLumi to take very long, but it might need to initialize a few thousand TH1s in rare cases.

As reported in cms-sw#29605, it can happen that a DQMOneEDAnalyzer does not produce it's per lumi MEs, because there were no events in the Lumisection and it only calls enterLumi as needed per event. To prevent this, we need to make sure that lumi ME are always created for every ME, but we can't have lumitransitions in DQMOneEDAnalyzer, so it needs to happen in a global callback. But there we can't safely use enterLumi, since that would corrupt the module's local MEs. The solution is to have a new method, initLumi, dedicated to initializing global MEs, but not touching local MEs. This is actually a nice thing in general, since now initLumi/cleanupLumi form a symmetrical pair (create/destroy global MEs), as well as enterLumi/leavLumi (update local MEs). All of these are idempotent, as before. There are a bunch of corner cases aorund booking: initLumi *must* be called before enterLumi, but the global callback that triggers it globally might (will) happen before booking has happened for the module. So we also need to do it after booking. There is a race related to lumi MEs if globalBeginLumi can happen *before* beginRun finishes for all plugins. This should not happen for now.

cmsbuild added the pending-assignment label Apr 30, 2020

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels Apr 30, 2020

makortel mentioned this issue Apr 30, 2020

Hcal DQM remove OneLumiEDAnalyzer #29544

Merged

schneiml mentioned this issue May 5, 2020

DQM: Fix empty lumis with per-lumi plots in DQMOneEDAnalyzer #29738

Merged

cmsbuild closed this as completed in #29738 May 5, 2020

silviodonato reopened this May 6, 2020

silviodonato closed this as completed May 6, 2020

silviodonato mentioned this issue May 6, 2020

Segmentation fault in DQMStore #29743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Dr15Jones commented Apr 30, 2020

cmsbuild commented Apr 30, 2020

makortel commented Apr 30, 2020

cmsbuild commented Apr 30, 2020

makortel commented Apr 30, 2020

makortel commented Apr 30, 2020

jfernan2 commented Apr 30, 2020

Dr15Jones commented Apr 30, 2020

lwang046 commented May 1, 2020

schneiml commented May 1, 2020

makortel commented May 1, 2020

lwang046 commented May 4, 2020

schneiml commented May 4, 2020 •

edited

Loading

schneiml commented May 4, 2020

schneiml commented May 4, 2020

makortel commented May 4, 2020

Dr15Jones commented May 4, 2020

schneiml commented May 4, 2020

Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Comments

Dr15Jones commented Apr 30, 2020

cmsbuild commented Apr 30, 2020

makortel commented Apr 30, 2020

cmsbuild commented Apr 30, 2020

makortel commented Apr 30, 2020

makortel commented Apr 30, 2020

jfernan2 commented Apr 30, 2020

Dr15Jones commented Apr 30, 2020

lwang046 commented May 1, 2020

schneiml commented May 1, 2020

makortel commented May 1, 2020

lwang046 commented May 4, 2020

schneiml commented May 4, 2020 • edited Loading

schneiml commented May 4, 2020

schneiml commented May 4, 2020

makortel commented May 4, 2020

Dr15Jones commented May 4, 2020

schneiml commented May 4, 2020

schneiml commented May 4, 2020 •

edited

Loading