-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477
Comments
cms-bot internal usage |
A new Issue was created by @Sam-Harper. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@cms-sw/pf-l2 FYI |
type pf |
(in the meantime) "HCAL had just come back into global after doing tests" - according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done. HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question. |
Allocation of PF rechit fraction SoA is currently the number of rechits I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard? Otherwise the "safest" configuration is |
It depends on what is needed for the dynamic allocation. If the only requirement is to change a configuration value with a runtime value, I don't expect any impact. If it also requires splitting a kernel in two, it may add some overhead. |
In CUDA we had a |
Than you Salavat for the correction. Indeed there was a small game of telephone which lead to my misunderstanding that the laser alignment tests were ongoing when they started just after this run. |
@Sam-Harper
Update: |
It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the |
assign hlt, reconstruction |
New categories assigned: hlt,reconstruction @Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Just to clarify, given that |
+1 |
proposed solutions:
|
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
There were wide spread crashes (~900) in cosmics run 383219.
No changes to the HLT / release had been made around this time and no other runs had this issue either immediately preceding nor after. It should be noted that HCAL had just come back into global after doing tests. Thus seems it plausible that HCAL came back in a werid state and this is the cause of the crashes. Thus I think HCAL experts should probably review to this to event (and this run) to ensure they were sending good data to us.
The crash is fully reproducible on the hilton and also on my local CPU only machine. The crash happens if the PFClustering is run, if this is not run, the crash does not happen.
An example event which crashes is at
/eos/cms/store/group/tsg/FOG/debug/240715_run383219/run383219_ls013_29703372.root
The cosmics menu run is at
/eos/cms/store/group/tsg/FOG/debug/240715_run38219/hlt.py
A minimal menu with just the HCAL reco is
/eos/cms/store/group/tsg/FOG/debug/240715_run383219/hltMinimal.py
The release was CMSSW_14_0_11_MULTIARCHS but is also reproduced in CMSSW_14_0_11
The error on CPU is
The error on GPU is
gpuCrash.log
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
@cms-sw/hcal-dpg-l2 FYI
The text was updated successfully, but these errors were encountered: