Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VertexException in Tier0 PromptReco for JetMET1 PD #45520

Closed
germanfgv opened this issue Jul 22, 2024 · 31 comments
Closed

VertexException in Tier0 PromptReco for JetMET1 PD #45520

germanfgv opened this issue Jul 22, 2024 · 31 comments

Comments

@germanfgv
Copy link
Contributor

Hi all,

We have one paused job for workflow PromptReco_Run383323_JetMET1 with the following error message:

cmsRun1
Fatal Exception (Exit Code: 8001)
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 383323 lumi: 177 event: 347254359 stream: 5
   [1] Running path 'dqmoffline_11_step'
   [2] Prefetching for module SMPDQM/'SMPDQM'
   [3] Prefetching for module MuonProducer/'muons'
   [4] Prefetching for module PFProducer/'particleFlowTmp'
   [5] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [6] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [7] Prefetching for module PFDisplacedTrackerVertexProducer/'pfDisplacedTrackerVertex'
   [8] Calling method for module PFDisplacedVertexProducer/'particleFlowDisplacedVertex'
Exception Message:
BasicSingleVertexState::could not invert weight matrix

You can find tarball for the job here:

/eos/user/c/cmst0/public/PausedJobs/Run2024F/VertexException/job_4996299

This issue was also reported in CMS Talk:
https://cms-talk.web.cern.ch/t/vertexexception-in-promptreco-run383323-jetmet1-during-dqm-offline-step/44019

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 22, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @germanfgv.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Jul 22, 2024

assign reconstruction

@mmusich
Copy link
Contributor

mmusich commented Jul 22, 2024

type pf

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Jul 22, 2024

@cms-sw/pf-l2 FYI

@mmusich
Copy link
Contributor

mmusich commented Jul 22, 2024

Also seems prudent to tag @cms-sw/tracking-pog-l2 as well.

@swagata87
Copy link
Contributor

swagata87 commented Jul 22, 2024

Is it reproducible?
I tried the following (in intel), but there was no crash (maybe one should try AMD)

import FWCore.ParameterSet.Config as cms
from PSet import process
process.source.eventsToProcess = cms.untracked.VEventRange('383323:347254359-383323:347254359')

@mmusich
Copy link
Contributor

mmusich commented Jul 22, 2024

@swagata87

Is it reproducible?
I tried the following (in intel), but there was no crash (maybe one should try AMD)

Can you try on lxplus800, see #45190 (comment)

?

@swagata87
Copy link
Contributor

thank you Marco, yes the crash is reproducible in AMD.
we will check what's the problem

@germanfgv
Copy link
Contributor Author

thank you Marco, yes the crash is reproducible in AMD.

This job failed 4 times at Tier-0, all of them while running in AMD.

@swagata87
Copy link
Contributor

This [1] seems enough in this particular case. The crash is bypassed.
But if others have better ideas, let me know!

[1]

--- a/RecoParticleFlow/PFTracking/src/PFDisplacedVertexFinder.cc
+++ b/RecoParticleFlow/PFTracking/src/PFDisplacedVertexFinder.cc
@@ -313,10 +313,13 @@ bool PFDisplacedVertexFinder::fitVertexFromSeed(const PFDisplacedVertexSeed& dis
 
       if (debug_)
         cout << "First test with KFT" << endl;
-
-      KalmanVertexFitter theKalmanFitter(true);
-      theVertexAdaptiveRaw = theKalmanFitter.vertex(transTracksRaw, seedPoint);
-
+      try {
+       KalmanVertexFitter theKalmanFitter(true);
+       theVertexAdaptiveRaw = theKalmanFitter.vertex(transTracksRaw, seedPoint);
+      } catch (VertexException& e) {
+       edm::LogWarning(" PFDisplacedVertexFinder ") << " failure in KalmanVertexFitter! " << e.what();
+    }

@mmusich
Copy link
Contributor

mmusich commented Jul 22, 2024

@swagata87 @cms-sw/reconstruction-l2

seems enough in this particular case. The crash is bypassed.

would logging an error be more appropriate in this case?

But if others have better ideas, let me know!

I am curious about the properties of transTracksRaw on Intel vs AMD. In earlier similar issues IIRC there were extra tracks reconstructed in one case (and not the other) that lead to the crash. Can you check e.g. the collection multiplicity and perhaps some basic track quantities?

@mandrenguyen
Copy link
Contributor

If I print transTracksRaw for this event I see differences immediately, well before the code actually crashes.
For example the first print out gives transTracks.size() = 272 on intel, but transTracks.size() = 298
It's not a question of ordering as subsequent print outs all have a significantly lower value on both systems.

@swagata87
Copy link
Contributor

For the particular case where it fails, this is how it looks like;
there are 2 extra tracks in AMD.

AMD

transTracksRaw.size() = 6

pt / eta / phi=0.243945 / -1.05883 / -0.348677
normChi2=0.467958, valid hits=5

pt / eta / phi=0.874192 / -2.00627 / 2.43794 (not in intel)
normChi2=0.972991, valid hits=11

pt / eta / phi=1.6572 / -0.22933 / -3.08643
normChi2=0.528174, valid hits=19

pt / eta / phi=0.63309 / -0.239693 / -0.360681
normChi2=0.641863, valid hits=6

pt / eta / phi=2507.97 / -1.93988 / 2.55448 (not in intel)
normChi2=0.16946, valid hits=3

pt / eta / phi=4.86086 / -0.753577 / -3.10877
normChi2=1.86016, valid hits=14

Intel

 transTracksRaw.size() = 4

pt / eta / phi=1.6572 / -0.22933 / -3.08643
normChi2=0.528155, valid hits=19

pt / eta / phi=0.243945 / -1.05883 / -0.348677
normChi2=0.467958, valid hits=5

pt / eta / phi=0.63309 / -0.239693 / -0.360681
normChi2=0.641863, valid hits=6

pt / eta / phi=4.86086 / -0.753577 / -3.10877
normChi2=1.86009, valid hits=14

@mmusich
Copy link
Contributor

mmusich commented Jul 23, 2024

Thanks @swagata87 .
At least one of the extra tracks in AMD looks poorly measured, just 3 hits and very high pT (maybe fake). Wondering if those tracks should be filtered out somewhere in the pf codebase before the arriving to being fit into a vertex.
As for the other differences I am not sure it's established to have this level of disagreement between AMD and Intel. Perhaps @cms-sw/tracking-pog-l2 can chime in here.

@swagata87
Copy link
Contributor

Indeed the AMD-only track with valid hits=3 seems to trigger the crash. If we reject such tracks with low valid hit, then the crash is gone (I removed the try/catch to check this). And this is the change log[1].

However, for very displaced vertices, a low number of valid hits seems possible. Also, there are other tracks in the same event with valid hits=3 (both amd and intel), which does not cause any crash and basic track parameters also look ok. A few examples are given below (from amd)

pt / eta / phi=0.312894 / -0.0428623 / 2.82491
normChi2=0.00614393, valid hits=3

pt / eta / phi=1.25537 / 0.44921 / 1.13834
normChi2=1.41038, valid hits=3

pt / eta / phi=0.558604 / 1.67569 / 2.81214
normChi2=0.616471, valid hits=3

pt / eta / phi=1.17541 / 1.89087 / 2.63116
normChi2=0.650683, valid hits=3

[1]

--- a/RecoParticleFlow/PFTracking/src/PFDisplacedVertexFinder.cc
+++ b/RecoParticleFlow/PFTracking/src/PFDisplacedVertexFinder.cc
@@ -255,7 +255,10 @@ bool PFDisplacedVertexFinder::fitVertexFromSeed(const PFDisplacedVertexSeed& dis
   // Fill vectors of TransientTracks and TrackRefs after applying preselection cuts.
   for (auto const& ie : tracksToFit) {
     TransientTrack tmpTk(*(ie.get()), magField_, globTkGeomHandle_);
-    transTracksRaw.emplace_back(tmpTk);
+
+    if (tmpTk.numberOfValidHits() > 3) {
+      transTracksRaw.emplace_back(tmpTk);
+    }
     bool nonIt = PFTrackAlgoTools::nonIterative((ie)->algo());
     bool step45 = PFTrackAlgoTools::step45((ie)->algo());
     bool highQ = PFTrackAlgoTools::highQuality((ie)->algo());
@@ -281,7 +284,13 @@ bool PFDisplacedVertexFinder::fitVertexFromSeed(const PFDisplacedVertexSeed& dis
 
   // 1)If only two track are found do not prefit
 
-  if (transTracksRaw.size() == 2) {
+  if (transTracksRaw.size() < 2) {
+    if (debug_)
+      cout << "With <2 tracks, no fit is possible" << endl;
+    return true;
+  }
+  
+  else if (transTracksRaw.size() == 2) {
     if (debug_)
       cout << "No raw fit done" << endl;
     if (switchOff2TrackVertex_) {

@mmusich
Copy link
Contributor

mmusich commented Jul 23, 2024

However, for very displaced vertices, a low number of valid hits seems possible.

do you still have access to the tracking original algorithm at this level? also I would look at pTerr/pT or equivalently to the relative curvature (q/p) error.

@swagata87
Copy link
Contributor

The errors look really bad for the problematic track [1].

[1]

pt / eta / phi=2507.97 / -1.93988 / 2.55448
normChi2=0.16946, valid hits=3
algoName=duplicateMerge, originalAlgo=22
qoverpError = 1.1651e+08, ptError/pt = 1.03756e+12

The original algo is 22, which seems to be highPtTripletStep from

cosmicGlobalMuon = 21,
// Phase1
highPtTripletStep = 22,
lowPtQuadStep = 23,

The other tracks look fine [2].

[2]

 transTracksRaw.size() = 6

pt / eta / phi=0.243945 / -1.05883 / -0.348677
normChi2=0.467958, valid hits=5
algoName=detachedTripletStep, originalAlgo=7
qoverpError = 0.054121, ptError/pt = 0.0225846

pt / eta / phi=0.874192 / -2.00627 / 2.43794
normChi2=0.972991, valid hits=11
algoName=pixelLessStep, originalAlgo=9
qoverpError = 0.0066148, ptError/pt = 0.0229119

pt / eta / phi=1.6572 / -0.22933 / -3.08643
normChi2=0.528174, valid hits=19
algoName=initialStep, originalAlgo=4
qoverpError = 0.00469397, ptError/pt = 0.00798881

pt / eta / phi=0.63309 / -0.239693 / -0.360681
normChi2=0.641863, valid hits=6
algoName=mixedTripletStep, originalAlgo=8
qoverpError = 0.0269218, ptError/pt = 0.0175457

[problematic track goes here]

pt / eta / phi=4.86086 / -0.753577 / -3.10877
normChi2=1.86016, valid hits=14
algoName=initialStep, originalAlgo=4
qoverpError = 0.00207326, ptError/pt = 0.0130838

@mmusich
Copy link
Contributor

mmusich commented Jul 23, 2024

The errors look really bad for the problematic track [1].

thanks for all the checks @swagata87.
It's really not my call but I would consider capping the relative momentum error somewhere and reject the clearly non-physical tracks before attempting the vertex fit. I let the other involved parties to decide on what's the best course of action.
As for the AMD vs Intel we should probably start opening a new issue.

@swagata87
Copy link
Contributor

consider capping the relative momentum error somewhere and reject the clearly non-physical tracks before attempting the vertex fit.

I agree, I implemented the capping in this function[1], and it avoids the crash. I will try to make a PR soon.

[1]

// This tool allow to pre-select the track for the displaced vertex finder
bool PFDisplacedVertexCandidateFinder::goodPtResolution(const TrackBaseRef& trackref) const {

@slava77
Copy link
Contributor

slava77 commented Jul 23, 2024

pt / eta / phi=2507.97 / -1.93988 / 2.55448
normChi2=0.16946, valid hits=3
algoName=duplicateMerge, originalAlgo=22
qoverpError = 1.1651e+08, ptError/pt = 1.03756e+12

the track doesn't look too crazy (considering it's just 3 hits).
Where is the actual crash?
I'm thinking more in the direction of making the vertex fitter not throw.

@mmusich
Copy link
Contributor

mmusich commented Jul 23, 2024

the track doesn't look too crazy

Where's the boundary? 1E12 looks like a big number...

@slava77
Copy link
Contributor

slava77 commented Jul 23, 2024

pt / eta / phi=2507.97 / -1.93988 / 2.55448
normChi2=0.16946, valid hits=3
algoName=duplicateMerge, originalAlgo=22
qoverpError = 1.1651e+08, ptError/pt = 1.03756e+12

the track doesn't look too crazy (considering it's just 3 hits). Where is the actual crash? I'm thinking more in the direction of making the vertex fitter not throw.

on a second thought, this does look fairly crazy.
qoverpError should remain more finite.
ptError/pt is not a good measure: at high pt end ptError/pt^2 should stay finite for well-aligned hits.

@swagata87
Copy link
Contributor

on a second thought, this does look fairly crazy.

yes, even for a track with just 3 hits, it looks a bit off;
the same event has other tracks with only 3 hits, and their error are more meaningful.

ptError/pt^2 should stay finite for well-aligned hits.

For the problematic track, ptError/pt^2 is 4.13706e+08

Where is the actual crash?

here

if (ifail != 0)
throw VertexException("BasicSingleVertexState::could not invert weight matrix");
} else {

but if we remove those problematic tracks (by some upper cut on, say, qoverpError ) and then try vertex-fit, then the crash goes away. Does that sound like a good idea to you? And, if we cap, then at what value roughly? (thinking about not just to avoid this crash, but in general what value would be sensible)

@swagata87
Copy link
Contributor

here is a proposal #45536
qoverpError capped at max value of 1.0e+7

@swagata87
Copy link
Contributor

tagging PF reco L3 @stahlleiton

@mmusich
Copy link
Contributor

mmusich commented Jul 29, 2024

As for the AMD vs Intel we should probably start opening a new issue.

this is being followed up at #45576 (thanks @mandrenguyen )

@mandrenguyen
Copy link
Contributor

+1
Closing this specific issue, keeping track of more general AMD vs Intel difference at #45576

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2024

This issue is fully signed and ready to be closed.

@mandrenguyen
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants