-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in HcalOfflineHarvesting in recent IB #29605
Comments
A new Issue was created by @Dr15Jones Chris Jones. @Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm |
New categories assigned: dqm @jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Workflows in question are 4.28, 134.807, and 136.757. Given that the segfault occurs in the HARVESTING step, I would guess #29586 to be unlikely the cause (still possible if running "unexpected" module in DQM step could somehow trigger a segfault in HARVESTING). Also all DQM modules use the |
@lwang046 It seems your PR is causing trouble that we did not detect. |
The code crashes here because so the request from the IGetter is failing. The value for |
Hi @jfernan2 @Dr15Jones, may I ask is there some principle differences between these 4 workflows and the other workflows? I'm a bit puzzled why the other workflows can get this ME but these 4 cannot. The problem indeed is related to |
Hi @lwang046 , A few comments:
[1] https://github.com/cms-sw/cmssw/tree/master/DQMServices/Core |
Just a thought, would it make sense for |
@makortel We used to have an API that did that, but nobody used it, so I removed it. We should not change the default behaviour, since a lot of code handles missing MEs just fine (and I think there are valid reasons for MEs to be missing in harvesting, e.g. conditional booking in RECO). Also for general robustness the HARVESTING code should not blow up just because RECO config changed (pragmatically; arguably we should always make sure that the harvesting is configured correctly to match the RECO config, but in practice we'd blow up Tier0 a lot if we assumed that). @lwang046 Reading a lumi-saved ME in endLumi harvesting code really should work. Since my recent PR (#29321) it should also be possible to per-lumi save MEs in |
Ok, this looks a lot like a bug with #29321, maybe caused by the fact that these WF now actually do concurrent lumis? Investigating... |
Ok, the issue is actually pretty obvious if you think about it the right way: The symptom is that the per-lumi MEs in question ( The reason for that is that they are booked in a Now, there are a lot of possible solutions:
I don't like the first option much, because it violates the invariant "all lumis in a run have the same MEs". We don't really rely on that anywhere (and legacy/harvesting modules can always violate it), but it is nice to have. I like the second option, though we should not go for that if @makortel or @Dr15Jones have concerns about it. The last option might be a minimally-invasive fix, but it is not as clean as I'd like. |
@schneiml Thanks for the detailed analysis! I share the dislike of option 1 pretty much for the same reasons. About the option 2
I did not fully understand your concern. In beginLumi (endLumi) transitions (callbacks) the modules are run concurrently (following the data dependencies). |
A service must be able to handle all ActivityRegistry service callbacks in a thread-safe manner. For this case, multiple module level 'endLuminosityBlock' callbacks (i.e. PreModuleGlobalBeginLumi and PostModuleGlobalBeginLumi) can be happening to the Services simultaneously since multiple modules could be running concurrently as well as multiple LuminosityBlocks could be ending concurrently. |
@makortel As Chris correctly guessed this is about these callbacks: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L680-L694 From @Dr15Jones answer I'd say it is ok to do more work there (apart from the edm::Service interface maybe being deprecated in general at some point?). I wanted to get away without these callbacks (apart from for supporting legacy modules), but the Note that while these callbacks are threadsafe, they are of course threadsafe by taking a lock for essentially the entire duration. I don't expect |
As reported in cms-sw#29605, it can happen that a DQMOneEDAnalyzer does not produce it's per lumi MEs, because there were no events in the Lumisection and it only calls enterLumi as needed per event. To prevent this, we need to make sure that lumi ME are always created for every ME, but we can't have lumitransitions in DQMOneEDAnalyzer, so it needs to happen in a global callback. But there we can't safely use enterLumi, since that would corrupt the module's local MEs. The solution is to have a new method, initLumi, dedicated to initializing global MEs, but not touching local MEs. This is actually a nice thing in general, since now initLumi/cleanupLumi form a symmetrical pair (create/destroy global MEs), as well as enterLumi/leavLumi (update local MEs). All of these are idempotent, as before. There are a bunch of corner cases aorund booking: initLumi *must* be called before enterLumi, but the global callback that triggers it globally might (will) happen before booking has happened for the module. So we also need to do it after booking. There is a race related to lumi MEs if globalBeginLumi can happen *before* beginRun finishes for all plugins. This should not happen for now.
We are seeing segmentation faults in 3 of the integration build workflows. The stack trace is
The likely culprits are either #29544 or #29586. The former changed code in that same package while the latter could be causing a module that was not being run before to now be run.
The text was updated successfully, but these errors were encountered: