-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault in DQMStore #29743
Comments
assign dqm |
New categories assigned: dqm @jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @silviodonato Silvio Donato. @Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
In |
Well ok, that means that #29738 did not catch all cases. I'll have a look; in the worst case there is a simple workaround but I'd prefer to get it right. |
4.52 passes for me locally in CMSSW_11_1_X_2020-05-05-2300 IB. Let me look at the IB logs... |
@schneiml I'm not sure, but it might be related to the multithreading |
@schneiml it's weird. We are getting fewer and different errors in |
Yes, that is my guess as well. Can you point me to the precise commands used? |
It could well be a race, maybe there is no synchronization between framework callbacks where I expected one. In the worst case we can just run |
The exact command should be @schneiml The weird thing is that even about the issue #29744, we see fewer and different errors in |
yes, you should use |
@silviodonato I have a proposed fix. I have no idea if it actually fixes the problem, but it's a very obvious bug and looks suspicious. It should also be not very dangerous to merge. I tried to reproduce the problem, with varying number of threads, but never succeeded. I suspect this is a race that is sort-of hard to win, so it only fails in a handful times out of 100's (1000's?) of IB tests, even though it probably affects ~all workflows. |
Can you explain what expectations you had for the callback? |
#29745 solved the issue. We don't see anymore the errors in CMSSW_11_1_X_2020-05-06-1300! |
@Dr15Jones The assumption is "All begin run things will be done before the first begin lumi starts". If this is violated, the per-lumi plots (which are still booked in beginRun, via bookHistograms) could be initialized via the global hook before they exist, and, effectively, not initialized. This is a problem I was afraid of with the new change. However, the crashes were completely unrelated to that, and simply due to a stupid copy-paste mistake. So, I think this condition holds, and we also need it do hold, else the DQMStore needs another change (We could simpliy initialize on demand, like before -- but then we risk loosing a set of lumi MEs if the module never calls enterLumi (that was the very original issue here), and also I think it is architecturally cleaner to not do things on-demand that can be done ahead of time. |
We are getting segmentation fault in wf 4.52, 122.0, 136.732, 136.87, 1362.18, 11650.0, 12634.0, 25212.17
Please note also the assertion fail
cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/f57771e3fd2a453450b4e62a76c14438/opt/cmssw/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-05-05-2300/src/DQMServices/Core/src/DQMStore.cc:480: void dqm::implementation::DQMStore::enterLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion `anyme && checkScope(anyme->getScope()) == false' failed.
It seems related to #29738 (and issue #29605). Regarding the warnings
%MSG-e HLTConfigProvider: METAnalyzer:pfMetDQMAnalyzerMiniAOD@beginRun 06-May-2020 06:04:04 CEST Run: 194533 Process name 'RECO' not found in registry! %MSG
, they might be related to #29254https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_1_X_2020-05-05-2300/pyRelValMatrixLogs/run/4.52_RunMu2012B+RunMu2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step3_RunMu2012B+RunMu2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log
The text was updated successfully, but these errors were encountered: