-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation violation in PromptReco for FastjetJetProducer:ak4PFJets #41397
Comments
A new Issue was created by @malbouis . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction |
Full stack trace from the log
|
assign reconstruction |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
that's not a global run. I think the original message on the Tier-0 cmstalk is about 366451 |
Thanks Marco! I have updated the description. |
Let me add a recipe to reproduce the error, as discussed at the OPR meeting today.
|
I don't reproduce this error using the Pkl. |
Thanks, @mandrenguyen ! I could reproduce it in lxplus when I tried it. Maybe could someone else double check that the crash can be reproduced at lxplus with the recipe that was posted above? |
@mandrenguyen just to confirm, were you using scram arch |
i also tried last night, and if you use the regular arch one gets in lxplus (not lxplus8): |
Thanks Marco! |
for the record, on an import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.options.numberOfThreads = 1
process.source.skipEvents=cms.untracked.uint32(586) it will segfault consistently at the first event processed. |
The offending line is:
The problem appears to come from fjAreaDefinition_ That's as far as I understood for the moment. If @cms-sw/jetmet-pog-l2 or @laurenhay have any ideas feel free to chime in. |
Looking at where |
The value of cmssw/RecoJets/JetProducers/plugins/VirtualJetProducer.cc Lines 238 to 250 in a346606
|
Since it's ak4PFJets that's crashing, I believe |
I looped over the jet on which the code is crashing.
Out of the 3080 jet constituents, one of them has NaN for For what it's worth px,py,pz are set correctly: |
Some more observations.
I guess my next step would be to see if I can track the nan back to where charged hadrons are first created |
Here is an issue from 2022 of a PFCandidate with NaN #39110 (I did not attempt to understand if it would be related though) Let's anyway tag @cms-sw/pf-l2 |
In case it's useful to examine the output, one can get the job to finish successfully by inserting the following in the loop over PF candidates in
|
We have 3 more occurrences of this error in pp runs, for dataset EphemeralZeroBias:
I post here the links for the tar files, in case someone would like to try to reproduce them (I did not yet have the chance) https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run366495_EphemeralZeroBias17/Reco |
Thanks for all the info, will take a look ASAP |
Thanks @kdlong cmssw/RecoParticleFlow/PFProducer/src/PFAlgo.cc Line 2746 in 9fa6185
chargedHadron.energy() is returning -nan for index = 1411
|
type pf |
We have yet another paused job in Tier0 due to this crash. It is occurring for run 366729 in dataset The tar ball can be found in https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run366729_EphemeralZeroBias10/Reco Is there any further progress in debugging this issue? |
yes there was a similar finding last year which was causing photon's isolation being NaN, when the bad pf candidate ended up in photon's isolation cone. A preliminary fix was to loop over pf candidate collection, check for NaN and remove those, and make a pfCandNoNaN collection, which was then passed on to calculate isolation. This is where it was done: https://github.com/cms-sw/cmssw/pull/39120/files maybe something similar can be done for jet/met if this is easier and quicker to do. But of course the real issue need to be solved upstream. Even if it's fixed at PF level, such extra protections in POG code are probably not a bad idea as PF code (and logic) is complex and can go wrong in various unforeseen ways, specially in startup phase where alignment/calibrations are not perfect, and several special checks/tests are ongoing using special modes (the interplay of those with PF logic can be hard to predict). |
Thanks @swagata87 ! This seems like a good solution in order to get rid of these crashes for now. We have indeed 4 more occurrences today. run 366729: run 366727: |
I was trying to reproduce this yesterday, and I couldn't get the failure. Now I can't access |
Hi @kdlong In PSet.py I skip directly to the crashing event, so you should find it immediately. |
Thanks @mandrenguyen. Unfortunately it seems the file has already been removed from disk. Does anyone have other examples of the failure with a file that's still accessible? |
@kdlong Taking one of the other examples from |
@kdlong You can use the following PSet.py to skip directly to the crashing event:
You can bypass the crash in FastJet by merging this one-liner PR: #41474 |
Thanks @mandrenguyen. I reproduced the issue finally and understood that it came from the mass-aware scaling that I introduced in #39368. In the case of a track with a huge momentum but huge uncertainty (1e7 in the example given above), the scale factor is very small and the energy rescaling computation has numeric issues. The fix is simple, remove the large ratios by calculating the energy from the rescaled momentum rather than calculating a scaling factor. |
Just to make sure, is the problem described in this issue fixed now? |
Yes this issue can be closed. |
@cmsbuild, please close |
+1 |
This issue is fully signed and ready to be closed. |
There is one job failing Reco for Run 366451, dataset ParkingDoubleElectronLowMass, with a segmentation violation, as described in https://cms-talk.web.cern.ch/t/segmentation-error-in-promptreco-for-run-366451-dataset-parkingdoubleelectronlowmass/23152
The crash seems to be from module FastjetJetProducer:
The full log is at /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023B/job_248341/job/WMTaskSpace/cmsRun1 as described in the original email.
I was able to reproduce the failure locally.
The text was updated successfully, but these errors were encountered: