Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reco::Candidate pt = -nan in some rare cases #39110

Closed
swagata87 opened this issue Aug 19, 2022 · 55 comments
Closed

reco::Candidate pt = -nan in some rare cases #39110

swagata87 opened this issue Aug 19, 2022 · 55 comments

Comments

@swagata87
Copy link
Contributor

This is to keep track of (and hopefully solve at some point) an issue that reco::Candidate pt is -nan in some rare cases. It seems that this problem can lead to egamma object's PF isolation being -nan as well, if the problematic candidate ends up in egamma object's isolation cone.

One such problematic event is in this AOD file:
root://xrootd-cms.infn.it//store/mc/Run3Winter22DR/GJet_Pt-10to40_DoubleEMEnriched_TuneCP5_13p6TeV_pythia8/AODSIM/FlatPU0to70_122X_mcRun3_2021_realistic_v9-v2/2430000/6cd37543-62ec-4f62-9fa9-23b7c66f9c20.root ,
the exact run:event number is:
eventsToProcess = cms.untracked.VEventRange('1:78326956-1:78326956'),

If we run AOD->MiniAOD step (I ran in CMSSW_12_2_1) for this event, we get the following warning in the event

%MSG-w PrimaryVertexSorting:   PFCandidatePrimaryVertexSorter:primaryVertexAssociationJME  19-Aug-2022 13:06:05 CEST Run: 1 Event: 78326956
Scaling is NAN ignoring this candidate/track
%MSG

triggered from here:

if (edm::isNotFinite(scale)) {
edm::LogWarning("PrimaryVertexSorting") << "Scaling is NAN ignoring this candidate/track" << std::endl;

I checked that, in that loop, c->pt() = -nan, and the pdgId is -211.
In this event, we have 2 reconstructed gedPhotons, both have chargedHadronIso()=-nan.

Another example of such problematic event is in this GEN-SIM-DIGI-RAW file:
/eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root ,
exact event is:
eventsToProcess = cms.untracked.VEventRange('1:36727-1:36727')

Running the RECO step (in CMSSW_12_1_0_pre4) on this event, we get the same warning

%MSG-w PrimaryVertexSorting:   PFCandidatePrimaryVertexSorter:primaryVertexAssociationJME  19-Aug-2022 13:18:45 CEST Run: 1 Event: 36727
Scaling is NAN ignoring this candidate/track
%MSG

In this case, reco::Candidate pt = -nan and its pdgId is 211.
And the event has 2 reconstructed ged photons with -nan value of charged hadron isolation.

Next I plan to check if the issue of photon's charged hadron isolation being -nan happen in data also or not.

@cmsbuild
Copy link
Contributor

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@swagata87
Copy link
Contributor Author

thanks to @shdutta16 and @Prasant1993 for bringing this to my notice, this came up while training new photon IDs for Run3.

FYI @cms-sw/pf-l2

@slava77
Copy link
Contributor

slava77 commented Aug 19, 2022

@swagata87
do you see NaNs in the persisted data, or does it appear as a part of running some algorithms (e.g. in or upstream of PFCandidatePrimaryVertexSorter).
Having a specific pf candidate with a nan would be more direct to trace.

@makortel
Copy link
Contributor

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jpata,@clacaputo,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@swagata87
Copy link
Contributor Author

Looping over particleFlow candidates (vector<reco::PFCandidate> "particleFlow" "" "RECO") in those problematic events, I see that pt, eta, phi of one pf candidate is -nan in those events. @slava77
so I guess its a general problem, not specific to the PFCandidatePrimaryVertexSorter algo

@slava77
Copy link
Contributor

slava77 commented Aug 19, 2022

Looping over particleFlow candidates (vector<reco::PFCandidate> "particleFlow" "" "RECO") in those problematic events, I see that pt, eta, phi of one pf candidate is -nan in those events. @slava77 so I guess its a general problem, not specific to the PFCandidatePrimaryVertexSorter algo

considering that it was a pion, is there a nan in the generalTracks collection?

@swagata87
Copy link
Contributor Author

yes the -nan PF candidate in both events were pions.

I do not see any nan in the general track pt/eta/phi in those events.
The issue seem to be in PF charged hadron.
It's also reflected in the fact that, while photon's charged hadron isolation is -nan;
the track isolation values make sense:

In one event, both photons have:
trkSumPtSolidCone=0 trkSumPtHollowCone=0

In the other event:
photon1: trkSumPtSolidCone=17.5619 trkSumPtHollowCone=17.5619
photon2: trkSumPtSolidCone=38.5868 trkSumPtHollowCone=38.5868

@slava77
Copy link
Contributor

slava77 commented Aug 19, 2022

assign pf

@cmsbuild
Copy link
Contributor

New categories assigned: pf

@kdlong,@juska you have been requested to review this Pull request/Issue and eventually sign? Thanks

@swagata87
Copy link
Contributor Author

swagata87 commented Aug 19, 2022

As an intermediate solution, what we can do is to not consider such problematic PF candidates in isolation sum of photons. This way we won't have nan isolation for photons anymore. This does not address the root cause of why such PF candidates are reconstructed in the first place. But it's better than nothing, and serves the purpose for egamma.

@swagata87
Copy link
Contributor Author

right so now I checked that the issue also happen (rarely though) in data (checked in 2022C).
I don't quite like photon's charged isolation being -nan;
so I prepared a patch to fix photon's isolation: #39120

@swagata87
Copy link
Contributor Author

okay after a few more checks, the real issue seem to here:

math::XYZTLorentzVectorD momentumSec;
momentumSec = momentumPrim / momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy());

before doing that division, we should check that the divisor is >0

with this patch, the issue seems to be solved:

diff --git a/RecoParticleFlow/PFProducer/src/PFCandConnector.cc b/RecoParticleFlow/PFProducer/src/PFCandConnector.cc
index 6d3d1fa9ec8..bd96361516e 100644
--- a/RecoParticleFlow/PFProducer/src/PFCandConnector.cc
+++ b/RecoParticleFlow/PFProducer/src/PFCandConnector.cc
@@ -156,9 +156,10 @@ void PFCandConnector::analyseNuclearWPrim(PFCandidateCollection& pfCand,
 
   const math::XYZTLorentzVectorD& momentumPrim = primaryCand.p4();
 
-  math::XYZTLorentzVectorD momentumSec;
+  math::XYZTLorentzVectorD momentumSec(0,0,0,0);
 
-  momentumSec = momentumPrim / momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy());
+  if ( (momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy())) > 0.0 )
+      momentumSec = momentumPrim / momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy());
 
   map<double, math::XYZTLorentzVectorD> candidatesWithTrackExcess;
   map<double, math::XYZTLorentzVectorD> candidatesWithoutCalo;

I did not check the effect of the above patch on other objects (jet/met/tau etc)

@perrotta
Copy link
Contributor

@swagata87 wouldn't it be enough checking that momentumPrim.E() > 0 ?

@kdlong
Copy link
Contributor

kdlong commented Aug 23, 2022

Thanks a lot @swagata87. I don't really understand how the energy can be zero here, is the p4 also fully zero? Can you share the setup you're using for testing (just rerunning reach on the RAW event you shared above)?

@swagata87
Copy link
Contributor Author

wouldn't it be enough checking that momentumPrim.E() > 0 ?

Hello @perrotta, you are right;
The divisor is just momentumPrim.E(), so checking that to be >0 is sufficient to solve the issue.

@swagata87
Copy link
Contributor Author

I don't really understand how the energy can be zero here, is the p4 also fully zero?

yeah.. no idea why momentumPrim.E() is zero,
actually px, py, pz are all zero in the 2 events where I am checking these things.

Can you share the setup you're using for testing (just rerunning reach on the RAW event you shared above)?

yes I am running the RECO step on RAW. Below is the setup I have:

  1. Run3 DATA
    CMSSW_12_4_6
    Config file obtained with this cmsDriver command:
    cmsDriver.py --filein /store/data/Run2022C/EGamma/RAW/v1/000/355/872/00000/3b478527-d206-4b8e-8004-08e9aff7758b.root --fileout file:AOD.root --data --eventcontent AOD --runUnscheduled --customise Configuration/DataProcessing/Utils.addMonitoring --datatier AOD --conditions 124X_dataRun3_Prompt_v4 --step RAW2DIGI,RECO --geometry DB:Extended --era Run3 --python_filename aod_cfg.py --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --no_exec -n -1

Added eventsToProcess = cms.untracked.VEventRange('355872:546279379-355872:546279379') , to just run on the problematic event.

  1. RelVal MC
    CMSSW_12_1_0_pre4
    Config file obtained with this cmsDriver command:
    cmsDriver.py --filein 'file:/eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root' --fileout file:AOD.root --mc --eventcontent AODSIM --runUnscheduled --customise Configuration/DataProcessing/Utils.addMonitoring --datatier AODSIM --conditions 121X_mcRun3_2021_realistic_v10 --step RAW2DIGI,RECO --geometry DB:Extended --era Run3 --python_filename aod_cfg.py --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --no_exec -n -1

Added eventsToProcess = cms.untracked.VEventRange('1:36727-1:36727') , to just run on the problematic event.

@jpata
Copy link
Contributor

jpata commented Aug 24, 2022

Even though #39120 will mitigate it for photons, I could imagine there's a high likelihood a NaN PFCandidate will mess up also other things.

Therefore, @cms-sw/pf-l2 @laurenhay it might be useful to consider a fix in PF on a short timescale (or understanding if the issue comes from somewhere upstream in reco).

@kdlong
Copy link
Contributor

kdlong commented Aug 24, 2022

Hi @jpata yes I am taking a look. We seem to have candidates with 0's for kinematics before the corrections which puts them to nans. Trying to understand and fix this early in the chain, will have news later this week.

@kdlong
Copy link
Contributor

kdlong commented Sep 1, 2022

Hi all, sorry, it took me a bit longer to reproduce the error and figure out a suitable debug setup to chase down what's happening, but I think I have some useful info now.

The nan comes from the line identified by Swagata, in a step where the nuclear interactions are recombined into a single candidate. The original issue, however, is coming from a candidate with zero four momentum.

Looking at event 355872:546279379 from the file /store/data/Run2022C/EGamma/RAW/v1/000/355/872/00000/92d09274-5b85-48a0-9ed9-78c0b33ba560.root (it's mixed up above). The track is displaced associated to this candidate (hence the nuclear interaction recovery). It has valid kinematics, but a high qoverpError wrt qoverp, which causes its momentum to be rescaled for consistency with the total calo energy this is where the zero comes in. The trackRef key is 418, it has track.qoverpError() = 0.03167 and track.qoverp() = -0.02654. I checked if this is fixed by the mkFit pixelless revert in CMSSW_12_4_7, but it isn't, as the track isn't from a pixelless iteration: track.algoName() = "highPtTripletStep".

I believe what's happening in these lines is a rescaling the kinematics of all charged hadron tracks simultaneously such that the sum of charged hadron tracks is consistent with the sum of calo energy. Because the track error of this track is so high, it's getting a negative value from the fit, and is reset to zero here: https://github.com/cms-sw/cmssw/blob/master/RecoParticleFlow/PFProducer/src/PFAlgo.cc#L2510-L2511

in general I don't think zero is a reasonable fallback, as it gives a "ghost" candidate with no kinematics or mass. I guess it would be better to remove it at that point, or at the very least make sure the mass is taken into account to avoid problems downstream. At the moment I'm not sure how technically difficult it would be to remove a candidate at this point in the code, certainly one has to be careful. On the other hand, it might make sense to exclude these tracks from getting promoted to PF cands in the first place, but I need to think/investigate more for that.

I guess this issue is happening more often now if mkFit is giving a higher fraction of displaced tracks with high uncertainty. Is it known or expected? Maybe the mkFit experts want to provide some feedback on this particular track?

@slava77
Copy link
Contributor

slava77 commented Sep 1, 2022

I guess this issue is happening more often now if mkFit is giving a higher fraction of displaced tracks with high uncertainty. Is it known or expected? Maybe the mkFit experts want to provide some feedback on this particular track?

what is the direction of this track, and also the number of hits?
mkFit tracks are on average shorter than previously in CKF; this can affect the resolution.
This may also be a very forward 3-hit (actual 3-layer) track, at which point it's probably no matter if it's mkFit or CKF.

@kdlong
Copy link
Contributor

kdlong commented Sep 1, 2022

Hi Slava, here is some more info on the track

In [9]: print(track.pt(), track.eta(), track.phi(), track.p())
5.456 -2.620 -1.916 37.67
In [12]: track.numberOfValidHits()
Out[12]: 5
In [13]: track.numberOfLostHits()
Out[13]: 0

@slava77
Copy link
Contributor

slava77 commented Sep 1, 2022

Hi Slava, here is some more info on the track

In [9]: print(track.pt(), track.eta(), track.phi(), track.p())
5.456 -2.620 -1.916 37.67
In [12]: track.numberOfValidHits()
Out[12]: 5
In [13]: track.numberOfLostHits()
Out[13]: 0

also in your original post The track is displaced
How large in dxy(bs)? it's a bit odd to have a large displacement in the pixel-based iteration. The seeding region is limited to 0.02 cm.

@mmusich
Copy link
Contributor

mmusich commented Sep 5, 2022

There is a relVal MC where the issue was seen, for that we have the RAW,
not sure if it is useful, but this is the raw: /eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root

thanks @swagata87 it is useful.

@mmusich
Copy link
Contributor

mmusich commented Sep 5, 2022

using the setup at #39110 (comment)
but changing the global tag to 121X_mcRun3_2021_design_v9 (including ideal alignment / calibrations) the warning

%MSG-w PrimaryVertexSorting:   PFCandidatePrimaryVertexSorter:primaryVertexAssociationJME  05-Sep-2022 12:18:59 CEST Run: 1 Event: 36727
Scaling is NAN ignoring this candidate/track
%MSG

is gone. So it seems it's a feature of the calibrations used in the realistic GT.
Addin @cms-sw/trk-dpg-l2 to the thread.

@mmusich
Copy link
Contributor

mmusich commented Sep 5, 2022

can someone of you prepare a list of the DetIds of the hits associated to the problematic track in the MC event?

@kdlong
Copy link
Contributor

kdlong commented Sep 7, 2022

can someone of you prepare a list of the DetIds of the hits associated to the problematic track in the MC event?

@mmusich was this for me or for the dpg conveners? I spent some time messing with it, The trackRef key = 3 and ID = 1463 in the AOD that I produce from the RAW, but tracks.recHits() hasn't been stored. Can you give me the keep statement to store all the tracks so this works?

@mmusich
Copy link
Contributor

mmusich commented Sep 7, 2022

was this for me or for the dpg conveners?

to anyone who can devote some some to this :)

I spent some time messing with it, The trackRef key = 3 and ID = 1463 in the AOD that I produce from the RAW

thanks, this is already helping.

Can you give me the keep statement to store all the tracks so this works?

I think that if you retain the RECO data-tier, instead of AOD you will have all the info needed.

@kdlong
Copy link
Contributor

kdlong commented Sep 9, 2022

Thanks, good point. I'm still not 100% sure about the interface, if I do track.recHit(i) I get a SiPixelRecHit which always returns null from its det function. I can give you the collection, keys, and raw ID from these objects, though:

for i in range(track.recHitsSize()):
     rh = track.recHit(i)
     print(rh.id().id(), rh.key(), rh.get().rawId())

gives

454 40 303063048
454 41 344241156
454 42 344242180
454 43 344503300
454 44 344504324
454 45 344765444
454 46 344766468

@slava77
Copy link
Contributor

slava77 commented Sep 9, 2022

Thanks, good point. I'm still not 100% sure about the interface, if I do track.recHit(i) I get a SiPixelRecHit which always returns null from its det function.

can you try to print details from the reco job itself?

@kdlong
Copy link
Contributor

kdlong commented Sep 12, 2022

@slava77 not immediately sure where I should add the printouts, probably someone from the tracking POG would be able to do this quicker than me.

Addressed from the PF side in #39368

@clacaputo
Copy link
Contributor

type pf

@cmsbuild cmsbuild added the pf label Sep 12, 2022
@shdutta16
Copy link
Contributor

I noticed that one of the algorithms (that I was running) was quitting with the error that the charged hadron isolation had NaN value, while running it over Run 3 GJet samples. This is actually when I noticed this issue.
When I had run the same algorithm on Run 2 GJet samples earlier, this did not crop up. So, most likely Run 2 samples did not have this issue.

@mmusich
Copy link
Contributor

mmusich commented Oct 7, 2022

I add a further look to this:

454 40 303063048
454 41 344241156
454 42 344242180
454 43 344503300
454 44 344504324
454 45 344765444
454 46 344766468

these are all Pixel DetIds:

Info on the DetIds
Det raw: 303063048 0 1 1
Barrel layer, ladder, module 1 6 2
shell mO(1) 4 1 -3 -3 0 BPix_BmO_SEC4_LYR1_LDR3F_MOD3 v2x8 303063048 
303063048 BPix_BmO_SEC4_LYR1_LDR3F_MOD3 r/phi/z = 2.73876/2.8958/-16.75 cmssw layer/ladder/module 1/6/2 E-phi -0.245795
Position 2.73876 -10.05 2.8958 0.0285 160 416

Det: 344241156 0 1 2
endcap, side=1 disk=1, blade=11, panel=1, plaq=1
mO FPix_BmO_D1_BLD5_PNL1_RNG1 1 5 1 1 1 v2x8 344241156
344241156 FPix_BmO_D1_BLD5_PNL1_RNG1 r/phi/z = 7.77775/2.78724/-33.6689 cmssw side/disk/blade/pannel/plaq=1/1/11/1/1 E-z 0.919158
Position 7.77775 -32.069 3.07284 0.029 160 416

Det: 344242180 0 1 2
endcap, side=1 disk=1, blade=11, panel=2, plaq=1
mO FPix_BmO_D1_BLD5_PNL2_RNG1 1 5 2 1 1 v2x8 344242180
344242180 FPix_BmO_D1_BLD5_PNL2_RNG1 r/phi/z = 7.81674/2.92065/-33.4758 cmssw side/disk/blade/pannel/plaq=1/1/11/2/1 E-z -0.919158
Position 7.81674 -31.8759 -3.07694 0.029 160 416

Det: 344503300 0 1 2
endcap, side=1 disk=2, blade=11, panel=1, plaq=1
mO FPix_BmO_D2_BLD5_PNL1_RNG1 2 5 1 1 1 v2x8 344503300
344503300 FPix_BmO_D2_BLD5_PNL1_RNG1 r/phi/z = 7.77775/2.78724/-41.1689 cmssw side/disk/blade/pannel/plaq=1/2/11/1/1 E-z 0.919158
Position 7.77775 -39.569 3.07284 0.029 160 416

Det: 344504324 0 1 2
endcap, side=1 disk=2, blade=11, panel=2, plaq=1
mO FPix_BmO_D2_BLD5_PNL2_RNG1 2 5 2 1 1 v2x8 344504324
344504324 FPix_BmO_D2_BLD5_PNL2_RNG1 r/phi/z = 7.81674/2.92065/-40.9758 cmssw side/disk/blade/pannel/plaq=1/2/11/2/1 E-z -0.919158
Position 7.81674 -39.3759 -3.07694 0.029 160 416

Det: 344765444 0 1 2
endcap, side=1 disk=3, blade=11, panel=1, plaq=1
mO FPix_BmO_D3_BLD5_PNL1_RNG1 3 5 1 1 1 v2x8 344765444
344765444 FPix_BmO_D3_BLD5_PNL1_RNG1 r/phi/z = 7.77775/2.78724/-50.6689 cmssw side/disk/blade/pannel/plaq=1/3/11/1/1 E-z 0.919158
Position 7.77775 -49.069 3.07284 0.029 160 416

Det: 344766468 0 1 2
endcap, side=1 disk=3, blade=11, panel=2, plaq=1
mO FPix_BmO_D3_BLD5_PNL2_RNG1 3 5 2 1 1 v2x8 344766468
344766468 FPix_BmO_D3_BLD5_PNL2_RNG1 r/phi/z = 7.81674/2.92065/-50.4758 cmssw side/disk/blade/pannel/plaq=1/3/11/2/1 E-z -0.919158
Position 7.81674 -48.8759 -3.07694 0.029 160 416

namely it's a track with 1 hit in BPix 1 and 6 hits on side 1 of FPix (suggesting it's very forward).

is it obvious that the issue is not e.g. in misalignment or some other sources of mis/displaced hits in this data?

Comparing the rigid body alignment parameters (3 cartesian coordinates of the sensor active area center and corresponding Euler angles) of these DetIds between the "realistic" misaligned geometry and the "ideal" one I don't see exceedingly worrysome deviations:

image

albeit there's one DetId of FPix that has somewhat high global-z displacement (around 60um), considering the RMS of the whole set is about 20 μm:

image

On the other hand I have seen that using this overridden set of conditions on top of the recipe at #39110 (comment)

from Configuration.AlCa.GlobalTag import GlobalTag
process.GlobalTag = GlobalTag(process.GlobalTag, '121X_mcRun3_2021_realistic_v10', '')
process.GlobalTag.toGet = cms.VPSet(                                                                                                                                                                      
    cms.PSet(record = cms.string("TrackerSurfaceDeformationRcd"),
             tag = cms.string("TrackerSurfaceDeformations_zero")
    )
)

is sufficient to remove the warning Scaling is NAN ignoring this candidate/track.
So it looks like it is actually caused by the non-rigid body degrees of freedom.

@swagata87
Copy link
Contributor Author

+1

@stahlleiton
Copy link
Contributor

@cms-sw/reconstruction-l2 : Is there something pending to close this issue?

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 24, 2024

cms-bot internal usage

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@stahlleiton
Copy link
Contributor

Would be appreciated if someone could close this issue. Thanks.

@mandrenguyen
Copy link
Contributor

please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests