Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash of hcalgpu DQM client in run-362239 #40115

Closed
missirol opened this issue Nov 18, 2022 · 10 comments
Closed

Crash of hcalgpu DQM client in run-362239 #40115

missirol opened this issue Nov 18, 2022 · 10 comments

Comments

@missirol
Copy link
Contributor

missirol commented Nov 18, 2022

The hcalgpu client of the online DQM crashed in run-362239 (and a few runs before then, during tests with the HLT HIon menu without collisions).
https://cmsweb.cern.ch/dqm/dqm-square/tmp/tmp/content_parser_productionPARSER_run362239PARSER_job16.log

The exception message is in [1], but the event content of the relevant stream in the HLT menu (DQMGPUvsCPU) does include the collection hltHbherecoFromGPU.

What I think triggered the crash is the fact that those streamer files included events where that HCAL collection was not produced. This happened because one the triggers going to that stream (but not the trigger running the HCAL reco) mistakenly had a looser prescale. This explanation is consistent with the fact that (1) the issue went away after prescales were updated such that the HCAL trigger fired for all events of the stream, and (2) this issue was not observed before, afaik (because the prescales and seeds of the triggers of that stream were always in sync in the past). Some more details in [2].

The exception likely comes from
https://github.com/cms-sw/cmssw/blob/CMSSW_12_5_2_patch1/DQM/HcalTasks/plugins/HcalGPUComparisonTask.cc#L100
which suggests the HCAL client has a hard requirement on the existence of its input collections (I guess the DQM plugins in the analagous ecalgpu client handle this more gracefully, since the latter client was not crashing in these runs).

I'd suggest DQM and HCAL to double-check this, and implement improvements if needed.

FYI: @cms-sw/heterogeneous-l2 (GPUs are mentioned, but this is not really related to heterogeneity)

[1]

----- Begin Fatal Exception 17-Nov-2022 19:24:17 CET-----------------------
An exception of category 'HCALDQM' occurred while
   [0] Processing  Event run: 362239 lumi: 2 event: 769 stream: 0
   [1] Running path 'tasksPath'
   [2] Calling method for module HcalGPUComparisonTask/'hcalGPUComparisonTask'
Exception Message:
HcalGPUComparisonTask::The CPU HBHERecHitCollection "hltHbherecoFromGPU" is not available
----- End Fatal Exception -------------------------------------------------

(By the way, the message seems confusing, since the collection hltHbherecoFromGPU is from GPU, but the message says it is the "CPU" collection.)

[2]
The triggers sending events to the DQMGPUvsCPU streams are DQM_HIPixelReconstruction_v, DQM_HIEcalReconstruction_v, and DQM_HIHcalReconstruction_v. In run-362239, the Pixel trigger was mistakenly unprescaled, firing at much higher rate compared to the ECAL and HCAL ones; in this run the HCAL-GPU DQM client crashed. In run-362243, the HLT prescales were fixed, setting them equal for the 3 triggers; in this and following runs, the HCAL-GPU DQM client did not crash.

@cmsbuild
Copy link
Contributor

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign dqm

FYI @cms-sw/hcal-dpg-l2

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@syuvivida
Copy link
Contributor

@missirol Thanks for this check. We have contacted the HCAL DQM experts and pointed them to this github issue.

@fwyzard
Copy link
Contributor

fwyzard commented Nov 19, 2022

@syuvivida
Copy link
Contributor

@missirol just to note, hcalgpu actually crashed already also in run 362221, I couldn't find if the pixel trigger prescale for this GPUvsCPU trigger path was also wrong for this run.
https://cmsweb.cern.ch/dqm/dqm-square/tmp/content_parser_productionPARSER_run362221

@missirol
Copy link
Contributor Author

hcalgpu actually crashed already also in run 362221

Yes. As noted in the 1st line of the description, this happened for a few runs up until run-362239 (the latter being the last one with this crash).

I couldn't find if the pixel trigger prescale for this GPUvsCPU trigger path was also wrong for this run.

They were (which confirms the initial diagnosis).

@syuvivida
Copy link
Contributor

@lwang046 Maybe Long Wang (HCAL DQM developers) could comments on the code in hcalgpu client?

@lwang046
Copy link
Contributor

Hi, thank you all for the investigation. A quick follow-up from our side, we are now preparing a protection to bypass the "collection not found exception", and indeed the warning logs are a bit confusing because of the mistakenly mixing of ref/target names here.

@missirol missirol changed the title Crash of HCAL-GPU DQM client in run-362239 Crash of hcalgpu DQM client in run-362239 Nov 20, 2022
@missirol
Copy link
Contributor Author

please close

Hi, thank you all for the investigation. A quick follow-up from our side, we are now preparing a protection to bypass the "collection not found exception", and indeed the warning logs are a bit confusing because of the mistakenly mixing of ref/target names here.

This has been integrated in 13_0_X (#40117), and backported to 12_6_X (#40257), 12_5_X (#40118), and 12_4_X (#40119).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants