-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash of hcalgpu
DQM client in run-362239
#40115
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm FYI @cms-sw/hcal-dpg-l2 |
New categories assigned: dqm @jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@missirol Thanks for this check. We have contacted the HCAL DQM experts and pointed them to this github issue. |
The error message comes from https://github.com/cms-sw/cmssw/blob/30c906888163b0cf2616db5ff14e262b68d53e1c/DQM/HcalTasks/plugins/HcalGPUComparisonTask.cc#L101_L102 Throwing an exception seems intentional, there ? By the way, the configuration seems inverted between GPU and CPU (target vs reference): https://github.com/cms-sw/cmssw/blob/98f2cad9f7996c581a38682059d4d72aebb0f6b6/DQM/Integration/python/clients/hcalgpu_dqm_sourceclient-live_cfg.py#L105_L106 |
@missirol just to note, hcalgpu actually crashed already also in run 362221, I couldn't find if the pixel trigger prescale for this GPUvsCPU trigger path was also wrong for this run. |
Yes. As noted in the 1st line of the description, this happened for a few runs up until run-362239 (the latter being the last one with this crash).
They were (which confirms the initial diagnosis). |
@lwang046 Maybe Long Wang (HCAL DQM developers) could comments on the code in hcalgpu client? |
Hi, thank you all for the investigation. A quick follow-up from our side, we are now preparing a protection to bypass the "collection not found exception", and indeed the warning logs are a bit confusing because of the mistakenly mixing of ref/target names here. |
hcalgpu
DQM client in run-362239
please close
This has been integrated in 13_0_X (#40117), and backported to 12_6_X (#40257), 12_5_X (#40118), and 12_4_X (#40119). |
The
hcalgpu
client of the online DQM crashed in run-362239 (and a few runs before then, during tests with the HLT HIon menu without collisions).https://cmsweb.cern.ch/dqm/dqm-square/tmp/tmp/content_parser_productionPARSER_run362239PARSER_job16.log
The exception message is in [1], but the event content of the relevant stream in the HLT menu (
DQMGPUvsCPU
) does include the collectionhltHbherecoFromGPU
.What I think triggered the crash is the fact that those streamer files included events where that HCAL collection was not produced. This happened because one the triggers going to that stream (but not the trigger running the HCAL reco) mistakenly had a looser prescale. This explanation is consistent with the fact that (1) the issue went away after prescales were updated such that the HCAL trigger fired for all events of the stream, and (2) this issue was not observed before, afaik (because the prescales and seeds of the triggers of that stream were always in sync in the past). Some more details in [2].
The exception likely comes from
https://github.com/cms-sw/cmssw/blob/CMSSW_12_5_2_patch1/DQM/HcalTasks/plugins/HcalGPUComparisonTask.cc#L100
which suggests the HCAL client has a hard requirement on the existence of its input collections (I guess the DQM plugins in the analagous
ecalgpu
client handle this more gracefully, since the latter client was not crashing in these runs).I'd suggest DQM and HCAL to double-check this, and implement improvements if needed.
FYI: @cms-sw/heterogeneous-l2 (GPUs are mentioned, but this is not really related to heterogeneity)
[1]
(By the way, the message seems confusing, since the collection
hltHbherecoFromGPU
is from GPU, but the message says it is the "CPU" collection.)[2]
The triggers sending events to the DQMGPUvsCPU streams are
DQM_HIPixelReconstruction_v
,DQM_HIEcalReconstruction_v
, andDQM_HIHcalReconstruction_v
. In run-362239, the Pixel trigger was mistakenly unprescaled, firing at much higher rate compared to the ECAL and HCAL ones; in this run the HCAL-GPU DQM client crashed. In run-362243, the HLT prescales were fixed, setting them equal for the 3 triggers; in this and following runs, the HCAL-GPU DQM client did not crash.The text was updated successfully, but these errors were encountered: