-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures in NanoDQMIO production #40676
Comments
A new Issue was created by @kskovpen . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm FYI @cms-sw/hlt-l2 |
New categories assigned: dqm @ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@kskovpen Would you have a pointer to the full log? The snippet in the description is too short to be useful (beyond pointing the segfaulting module) |
Hi @kskovpen I see that nanoDQMIO datasets for EGamma RunB, RunC, RunD have been successfully produced (or, at least, are in VALID status in DAS). Does the failure affect just the RunA-E-F-G, or all the eras are affected? |
We are waiting for the instructions from our computing colleagues on how to access it. |
Hi All, here is an example with the full log containing the reported failure. |
We are stopping this reprocessing campaign until the issue is resolved. Please let us know, if you have any ideas. |
Thanks @kskovpen. The relevant stack traces of the TBB threads are
|
I took a quick look on the code, and spotted a dangerous pattern. The cmssw/DQMOffline/Trigger/interface/HLTDQMHist.h Lines 52 to 58 in 4b49c89
and the TH1* is obtained from MonitorElement
As far as I have understood (and still remember) the internals of I think we (=CMS) should officially deprecate the raw ROOT object access via I can't tell if what I described above is the source of the problem (code looks like only debugger could tell), but it looks important-enough to be addressed anyhow. |
Thanks for the info in #40676 (comment). #40760 tries to implement the suggestion from that comment.
Last week, I tried to reproduce the crash (based on the tarball from PdmV), but failed. Could you please suggest in more detail how to debug this? |
@makortel, @missirol thanks for having a look! I also tried to reproduce the crash locally just now, but could not. I'm running the same cmsDriver command as in the production, on the same input file and same run/lumisection/event that caused the crash according to the log file linked above, and with the same number of streams and threads. But it still finishes without crashes. (This is Luka by the way, involved in the nanoDQMIO production from the DQM side.) |
The difficulties in reproducing the failures are compatible with a data race. If the failures are rare (as it seems), I'm not aware of a better way than fixing the obvious problems and then testing massively. @cms-sw/dqm-l2 By the way, do we have any runTheMatrix workflows or other tests in IBs exercising NanoDQMIO? |
Assuming #40760 is merged, should it be backported down to |
Yes, please (at least from the PdmV point of you). This way, we can build a new release (a note to @cms-sw/dqm-l2 to request it), and resubmit the whole massive thing. |
Hi @missirol thank you for the prompt feedback! Yes, a backport to 12_4 is needed in order to launch the production again. We will ask to build a new 12_4 release after 40760 has been merged. |
@rvenditti , the backports are in place, but DQM has to review/sign these PRs, starting from #40760. Can you please have a look? |
It looks like this fix has resolved the DQM problem, thanks. @cms-sw/dqm-l2 can also confirm. However, there are now other issues (not connected to DQM, it seems). Here is an example of the failed workflow: https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022D-ZeroBias-23Feb2023-00001. In case someone has an idea. |
Seems like a harvesting job is failing with
|
(in general it would be good open a new issue for each new problem, but since this seemed to be connected to DQM I kept it here) |
This is still very much connected to DQM. |
The segfault is caused by a helper class consuming the cmssw/CalibTracker/SiStripQuality/src/SiStripQualityWithFromFedErrorsHelper.cc Lines 186 to 199 in 4c4bee9
(called from cmssw/DQM/SiStripMonitorClient/plugins/SiStripBadComponentInfo.cc Lines 39 to 45 in 4c4bee9
) and then using the product in endProcessBlock cmssw/CalibTracker/SiStripQuality/src/SiStripQualityWithFromFedErrorsHelper.cc Lines 201 to 223 in 4c4bee9
(called from cmssw/DQM/SiStripMonitorClient/plugins/SiStripBadComponentInfo.cc Lines 142 to 145 in 4c4bee9
) without checking that the copy is there (i.e. the |
The log has many printouts like
and
Are all of these really necessary (in production)? They make the log file very cumbersome to load in a browser. |
We've been recently producing a new type of DQMIO datasets, as was requested here. While the local tests went fine, there are many failures seen in the production, also at different sites. Here is the example with the crash info:
In this example, the input dataset is /EGamma/Run2022E-v1/RAW and the full log at /store/unmerged/logs/prod/2023/1/26/pdmvserv_Run2022E_EGamma_19Jan2023_230119_090450_268/DataProcessing/0002/3/33f0e71f-9554-40ba-87f3-80522baac221-0-3-logArchive.tar.gz.
The text was updated successfully, but these errors were encountered: