Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GsfTracking crash at HLT in run 359686 #39570

Closed
swagata87 opened this issue Oct 2, 2022 · 45 comments
Closed

GsfTracking crash at HLT in run 359686 #39570

swagata87 opened this issue Oct 2, 2022 · 45 comments

Comments

@swagata87
Copy link
Contributor

This issue is related to #39026 in the sense that similar crashes happened before and a fix for that was put in #39074. It is unclear to me why the crash came back despite the fix.

Some details of the crash is here

With the error stream file, I could reproduce the crash in GPU machines.

The error stream file(.raw), which is available in Hilton nodes, is also copied to this location: /eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw

In case it is useful, the file is also available in .root format: /eos/cms/store/group/phys_egamma/swmukher/outputFileGSF.root

Recipe to reproduce the crash in GPU machine is given below:

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py 

It crashes with the following messages:

%MSG-i CUDAService:  (NoModuleName) 01-Oct-2022 22:49:54 CEST pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
CUDA device 1: Tesla T4 (sm_75)
%MSG
.....
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 4 is invalid. skipping.
.....
----- Begin Fatal Exception 01-Oct-2022 22:50:29 CEST-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359686 lumi: 263 event: 165691107 stream: 0
   [1] Running path 'DST_Run3_PFScoutingPixelTracking_v18'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2022

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@swagata87
Copy link
Contributor Author

assign reconstruction, tracking-pog, egamma-pog

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 2, 2022

New categories assigned: tracking-pog,reconstruction,egamma-pog

@slava77,@mmusich,@mandrenguyen,@lfinco,@clacaputo,@swagata87 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Oct 3, 2022

the recipe at #39570 (comment) when run on lxplus-gpu results in:

Can not stat file (2):-/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw
%MSG
----- Begin Fatal Exception 03-Oct-2022 14:00:58 CEST-----------------------
An exception of category 'FedRawDataInputSource:threadError' occurred while
   [0] Calling InputSource::getNextItemType
Exception Message:
 file reader thread error 
----- End Fatal Exception -------------------------------------------------

please create a recipe that can run without having access to hilton nodes.

@bmjoshi
Copy link

bmjoshi commented Oct 3, 2022

@mmusich Please use
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']

@bmjoshi
Copy link

bmjoshi commented Oct 3, 2022

This issue is not reproducible on lxplus-node. However, it is reproducible on GPU cluster at Milan. The difference that I found was the CUDA version. Please find the driver info and version below:
LXPLUS-GPU: CUDA runtime version 11.5, driver version 11.7, NVIDIA driver version 515.65.01
MILAN-GPU@P5: CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03

@mmusich
Copy link
Contributor

mmusich commented Oct 3, 2022

with this recipe:

(after ssh -X -Y <user>@lxplus-gpu.cern.ch)

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py > & out.out &

I am unable to reproduce the error.
You can find the resulting log file here: http://musich.web.cern.ch/musich/Issue-39570/out.out

@mmusich
Copy link
Contributor

mmusich commented Oct 3, 2022

-1

  • cannot be reproduced offline

@swagata87
Copy link
Contributor Author

Just to keep track, there was another similar crash in HLT in run 359806.
Elog: http://cmsonline.cern.ch/cms-elog/1158449

It was possible to reproduce the crash in GPU Hilton machine using the error stream file. The recipe is same as given in the issue description; but only change is the filename:
process.source.fileNames = ['file:/store/error_stream/run359806/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw']
The crash reproduced in hilton can be viewed here: https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_GPU_hilton_c2b01_44_01.log

As before, this crash was also NOT reproduced in lxplus-gpu.
To run on lxplus, one can use this file which I copied to /eos:
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw']
The log file is here, which shows that there is no crash when ran in lxplus-gpu; https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_lxplus_gpu_764.log

@swagata87
Copy link
Contributor Author

The crash goes away if we add the full Sylvester's criterion here:

double det = 0;
if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
(updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {

ie; with this patch the crash is gone:
(ignore the std::cout, which was just for check)

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index 5e9e6b31389..e3035361a5c 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -32,8 +32,17 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
 
     double det = 0;
-    if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
-        (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+    double det22 = 0;
+    double det33 = 0;
+    double det44 = 0;
+
+    if ( updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det22) && det22 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det33) && det33 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det44) && det44 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0)  {
+      std::cout << "all TSOS condition passed so will do result.addState.. " << std::endl;
+      
       result.addState(TrajectoryStateOnSurface(weights[i],

@missirol
Copy link
Contributor

missirol commented Oct 6, 2022

FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2

(HLT crashes, seemingly specific to reconstruction on GPUs)

@swagata87
Copy link
Contributor Author

There was another HLT crash due to GSF with similar error message, in run 359871.
Relevant Elog: http://cmsonline.cern.ch/cms-elog/1158552
The crash was reproducible in hilton machine; this was the error message in hilton; using error stream file /store/error_stream/run359871/run359871_ls0093_index000227_fu-c2b01-39-01_pid3885773.raw

%MSG
----- Begin Fatal Exception 06-Oct-2022 14:46:35 -----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359871 lumi: 93 event: 193950756 stream: 0
   [1] Running path 'HLT_DoubleEle9_eta1p22_mMax6_v2'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

With the fix mentioned above (#39570 (comment)), the crash goes away.

I will prepare a PR.

@swagata87
Copy link
Contributor Author

Given the fix PR is merged in master, I'll prepare backport to 12_5_X and 12_4_X.

Similar GSF crashes is expected to happen rarely in long collision runs until the fix is online. Another example crash is here, from Friday's overnight run: http://cmsonline.cern.ch/cms-elog/1159020. This one was also reproducible using error_stream file, and the crash goes away with the fix.

I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 9, 2022

I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.

I expect the HLT DOCs will anyway document them ?

@swagata87
Copy link
Contributor Author

+1

  • A rare GSF crash was happening in HLT, with the message MultiTrajectoryState mixes states with and without errors.
  • To the best of my knowledge, this crash happened in long collision runs 359686 (Oct 1), 359806 (Oct 5), 359871 (Oct 5), 359998 (Oct 7), 360019 (Oct 8), 360225 (Oct 12); all taken with 12_4_9.
  • HLT then switched to 12_4_10 which included a potential fix for this crash. The switch to 12_4_10 happened on Oct 12th, and the first run with this release was 360279.
  • After HLT switched to 12_4_10, this crash was not seen so far.

@alintulu
Copy link
Contributor

This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .

@alintulu you might want to report a recipe to reproduce (also offline).

@swagata87
Copy link
Contributor Author

thanks, I've now managed to check the prompt-reco crash also, and I confirm that your patch solves the crash. So I think it makes sense to put in the fix to avoid further crashes.

Your fix was originally meant for a crash with message MultiTrajectoryState mixes states with different signs of local p_z, which happened both offline and HLT, and as far as I know was not possible to reproduce. Thus the fix was not put in back then. The same crash did not happen at HLT recently. But now the other crash MultiTrajectoryState mixes states with and without errors is also fixed with the same patch. So, it looks like these 2 crashes are after all related to each other?

We surely want to fix the crashes urgently, so let's go ahead with the PR, but I've copied the raw file so that it remain accessible, in case we want to do any other check
/eos/cms/store/group/phys_egamma/ec/swmukher/gsfcrashfile/00266a70-1a8f-4101-8391-ca99df9b51ca.root

The HLT error_stream files are also there, but none of them are reproducible in lxplus-gpu or lxplus(cpu). So it becomes generally difficult for people to debug with HLT error files, as one needs access to one of the P5 GPU machines. The prompt reco crash raw file should be easier to work with, as anyone can use it for debugging purpose in any usual lxplus machine.

At some point it would be great for egamma to understand these strange warnings triggered from BasicTrajectoryState here. It's unclear to me right now from where we are having missing error or nan matrix. But given the complexity of the Gsf algo it might take a while to fully understand it.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 24, 2022

@mmusich can you explain to me what the suggested changes do ?

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index c3958e7f0c4..39a8da622be 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,8 +28,15 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
   MultiTrajectoryStateAssembler result;
 
   int i = 0;
+  float pzSign = 1.;
   for (auto const& tsosI : predictedComponents) {
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
+    if (i > 0 && pzSign * updatedTSOS.localParameters().pzSign() < 0) {
+      continue;
+    } else {
+      pzSign *= updatedTSOS.localParameters().pzSign();
+    }
+
     if (updatedTSOS.isValid() && updatedTSOS.localError().valid()) {
       result.addState(TrajectoryStateOnSurface(weights[i],
                                                updatedTSOS.localParameters(),

If updatedTSOS.localParameters().pzSign() is negative, won't this flip the sign of pzSign for each component ?

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

is negative, won't this flip the sign of pzSign for each component ?

pzSign is a "check" variable. It starts a positive and then it gets update for each iteration (after the first) by multiplying it with the sign of the next state. Since, unless the charges of the state do not share the same sign, the sign of the product will be +, it essentially catches the states for which the sign gets changed (and removes them, since they lead to the mismatched pZ sign that than leads to a crash).

@fwyzard
Copy link
Contributor

fwyzard commented Oct 24, 2022

Mhm, I'm not sure I follow.

During the first iteration (when i == 0) the if clause is skipped and the else clause is always processed, so effectively pzSign is set to the sign of of KFUpdator().update(predictedComponents.front(), aRecHit).

Then for the following interations

  • if pzSign has a different sign than updatedTSOS.localParameters().pzSign(), the component is skipped
  • otherwise, pzSign is multiplied by updatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.

So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?

Then, once a sign is chosen, should it be fixed and not changed any more ?

For example to use the sign of the first component, this loop seems safer:

    if (i == 0) {
      // select the pz sign of the first component
      pzSign = updatedTSOS.localParameters().pzSign();
    } else {
      // skip components with the "wrong" pz sign
      if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
        continue;
      }
    }

@fwyzard
Copy link
Contributor

fwyzard commented Oct 24, 2022

(of course, maybe I misunderstood the whole logic...)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 24, 2022

Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

Then for the following interations

  • if pzSign has a different sign than updatedTSOS.localParameters().pzSign(), the component is skipped
  • otherwise, pzSign is multiplied by updatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.

right, I see it can become a problem when there are many states to be added with multiple sign flips. I was wrongly assuming that it would not try to add more than two states...

So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?

this is a valid point, though one would need one more loop to collect all the signs.

Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?

that's precisely my comment at: #39234 (comment), see also #39234 (comment).

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

For example to use the sign of the first component, this loop seems safer:

   if (i == 0) {
     // select the pz sign of the first component
     pzSign = updatedTSOS.localParameters().pzSign();
   } else {
     // skip components with the "wrong" pz sign
     if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
       continue;
     }
   }

while logically correct, this doesn't solve the issue at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641 :(

@mmusich
Copy link
Contributor

mmusich commented Oct 25, 2022

So I have made a bit of archeology dig.
The crash reported in prompt reco (cmsTalk) already appeared once this year (see #39026) and I solved it by proposing #39056.
With the state of the code as of 2b81038 the newly reported crash doesn't happen either.
Then, the same file (TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc) was touched twice by @swagata87:

with the state of the code as of f3afc63 the crash in prompt reco doesn't happen, so I conclude that the issue has been re-introduce somehow by mistake in PR #39656.

@swagata87
Copy link
Contributor Author

#39656 is a rather simple PR. As I understand, all it does is to make sure that the states we are allowing to go further in the reco chain has a positive definite curvilinear error matrix, so that the states are more well behaved. This should not lead to the crash. We are probably missing something elsewhere.

Also, I noticed that the frequency of this crash at HLT has reduced after #39656, but given that the crash has not completely disappeared, it's clear that there is something more to it which we are missing currently. I am open to suggestions and happy to try out any suggestions people might have on this.

The reco chain of electron and lowpt-electron is a bit different. So I was wondering if that has something to do with it. The problem could also be in lowpt-electron specific code/config. I have to admit I have not figured it out yet.

@mmusich
Copy link
Contributor

mmusich commented Oct 25, 2022

This should not lead to the crash.

unfortunately it does (I invite you to cross-check)
After re-implementing the Sylvester criterion in a tidier way:

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index f3d6d173c10..dddb731e0e4 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,15 +28,46 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
   MultiTrajectoryStateAssembler result;
 
   int i = 0;
   for (auto const& tsosI : predictedComponents) {
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);

+    // in order to check for pos-def of error matrix
+    std::function<bool(AlgebraicSymMatrix55, int)> sylvesterCriterion = [this](AlgebraicSymMatrix55 curvError,
+                                                                               int index) {
+      //Sylvester's criterion, start from the smaller submatrix size
+      double det = 0;
+      if ((!curvError.Sub<AlgebraicSymMatrix22>(0, 0).Det(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub2.det for state " << index;
+        return false;
+      } else if ((!curvError.Sub<AlgebraicSymMatrix33>(0, 0).Det(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub3.det for state " << index;
+        return false;
+      } else if ((!curvError.Sub<AlgebraicSymMatrix44>(0, 0).Det(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub4.det for state " << index;
+        return false;
+      } else if ((!curvError.Det2(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check det for state " << index;
+        return false;
+      }
+      // in case all the above was not realized
+      return true;
+    };
+
+    if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+        sylvesterCriterion(updatedTSOS.curvilinearError().matrix(), i)) {
+      std::cout << " ###### keeping: " << i << " state" << std::endl;
 
-    if (double det;
-        updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0)) {
       result.addState(TrajectoryStateOnSurface(weights[i],
                                                updatedTSOS.localParameters(),
                                                updatedTSOS.localError(),

and running the reconstruction of the incriminated electron (see recipe at #35929 (comment)) with the full Sylvester criterion and only the check on the full error matrix determinant (commenting out the first three clauses) I see that in one case there is one state less used for latter (the one with index 2):

2c2
< %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
11,12c11,12
< %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
< Fail pos-def check det for state 1
---
> %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 1
14c14
< %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
17c17
< %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
26c26,31
<  ###### keeping: 2 state
---
> %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 2
> %MSG
> %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> KF updated state 2 is invalid. skipping.
> %MSG
32c37
< %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
41,42c46,47
< %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
< Fail pos-def check det for state 8
---
> %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 8
44c49
< %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:31 CEST Run: 360888 Event: 426362613
---
> %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
49c54
<  ###### keeping: 11 state
---
>  ###### keeping: 11 state

and this leads to the crash (in Prompt).

@francescobrivio
Copy link
Contributor

@swagata87 do you have any update on this?
At the moment (luckily) only one job in Prompt is affected, but the sooner we fix this the better.

Cheers,
Francesco as ORM

@swagata87
Copy link
Contributor Author

Hello Francesco,

Apart from what Marco has already pointed out, that removing the sub-matrices' positive definite check done by Sylvester criterion cures this crash, I don't have other fix right now.

That particular prompt reco event is puzzling, as, despite all protections in GsfMultiStateUpdator to make sure that matrices are positive definite, the event still prints warning messages of this kind[1], and it is unclear to me from where these states are coming from as we are doing a strict check to remove such states. And it's also unclear to me why keeping a non pos-def state is actually helping to cure a crash.

About #39234 (comment) , it would be helpful to know when this alignment update is going to be used for datataking? I'm just curious to see if that would solve the crashes also.

If I manage to gain more insight on this crash, I will write here.

[1]

%MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  26-Oct-2022 15:52:07 CEST Run: 360888 Event: 426362613
curv error not pos-def
[      13.4648-7.12423e-06 -0.00225463 0.000274802    0.102126
  -7.12423e-06-2.03937e-08 1.01283e-06-1.23354e-07-4.58423e-05
   -0.00225463 1.01283e-06 0.000320388-3.90183e-05  -0.0145005
   0.000274802-1.23354e-07-3.90183e-05 4.75568e-06  0.00176737
      0.102126-4.58423e-05  -0.0145005  0.00176737    0.656814 ]
pos/mom/mf  (-84.5635,-15.529,71.1933)   (0.00323092,-0.122849,-0.0301315)   (-0.0121003,-0.00222207,3.80835) 

@mmusich
Copy link
Contributor

mmusich commented Oct 26, 2022

That particular prompt reco event is puzzling, as, despite all protections in GsfMultiStateUpdator to make sure that matrices are positive definite, the event still prints warning messages of this kind[1], and it is unclear to me from where these states are coming from as we are doing a strict check to remove such states. And

it might be worth noticing that messages as in [1] are triggered by this check:

if UNLIKELY (!err.posDef())
edm::LogWarning("BasicTrajectoryState") << "curv error not pos-def\n"
<< err.matrix() << "\npos/mom/mf " << state.position() << ' '
<< state.momentum() << ' ' << state.parameters().magneticFieldInTesla();
}

which doesn't implement the full Sylvester criterion , but a poor-man version implemented here:

// not really full check of posdef
bool posDef() const {
return (theCovarianceMatrix(0, 0) >= 0) && (theCovarianceMatrix(1, 1) >= 0) && (theCovarianceMatrix(2, 2) >= 0) &&
(theCovarianceMatrix(3, 3) >= 0) && (theCovarianceMatrix(4, 4) >= 0);
}

also I think that those messages are coming from an earlier stage of execution of the code than the filtering done here:

if (double det;
updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
(det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det) && det > 0) &&
(det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det) && det > 0) &&
(det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det) && det > 0) &&
(det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0)) {

@swagata87
Copy link
Contributor Author

I still need to understand it better, by doing more checks;
but looks like this patch[1] solve the prompt reco crash in that event in run 360888.
So it is checking if LocalTrajectoryError of the single states are positive definite or not. If not, they are discarded.
I checked it from 12_4_10_patch2.

Also, the warning messages that I was talking about are now gone, for this one event.

[1]

diff --git a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
index d4c6e1b11d2..701180aaeba 100644
--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -40,12 +40,14 @@ namespace GaussianStateConversions {
     std::vector<TrajectoryStateOnSurface> components;
     components.reserve(singleStates.size());
     for (auto const& ic : singleStates) {
-      components.emplace_back((*ic).weight(),
-                              LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
-                              LocalTrajectoryError((*ic).covariance()),
-                              surface,
-                              field,
-                              side);
+      if (double det=0; (*ic).covariance().Det2(det) && det>0 ) {
+       components.emplace_back((*ic).weight(),
+                               LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
+                               LocalTrajectoryError((*ic).covariance()),
+                               surface,
+                               field,
+                               side);
+      }
     }
     return TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
   }

@swagata87
Copy link
Contributor Author

I did some sanity checks by running on 200 events from EGamma RAW file /eos/cms/tier0/store/data/Run2022F/EGamma/RAW/v1/000/361/197/00000/76bd97fa-4ad6-4d85-b941-014e3ed27f9c.root , with and without the above patch. And then I compared pt of electron, lowpt-electron and photon(as converted photons also have gsf track). There were same number of these objects reco'ed with and w/o the patch, and the pt distribution was identical.

@elfontan
Copy link
Contributor

elfontan commented Oct 28, 2022

Dear all,
let me add for book-keeping the link to the e-log for a few additional crashes of this type happened during runs 361107, 361239, and 361297.
Error streams have been saved:

/store/lustre/error/run361107:

  • /store/error_stream/run361107/run361107_ls0085_index000135_fu-c2b03-36-01_pid1607962.raw
  • /store/error_stream/run361107/run361107_ls0110_index000149_fu-c2b02-31-01_pid1971160.raw

/store/lustre/error/run361239:

  • /store/error_stream/run361239/run361239_ls0763_index000224_fu-c2b01-39-01_pid1490575.raw
  • /store/error_stream/run361239/run361239_ls0899_index000131_fu-c2b02-14-01_pid2057688.raw

/store/lustre/error/run361297

I tried to reproduce them using the following recipe in GPU Hilton machines with the first two cases without success:

cmsrel CMSSW_12_4_10_patch3
cd CMSSW_12_4_10_patch3/src
cmsenv
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 361107 > hlt_run361107.py
cat >>  hlt_run361107.py <<@EOF
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/store/error_stream',
    runNumber = 361107
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
source = _source.clone(
    fileListMode = True,
    fileNames = (
        ' /store/error_stream/run361107/run361107_ls0085_index000135_fu-c2b03-36-01_pid1607962.raw',
    )
)
@EOF
cmsRun hlt_run361107.py 

Follow up needed checking the last occurence in run 361297.

Cheers,
--Elisa as secondary HLT DOC

@swagata87
Copy link
Contributor Author

Coming back to HLT crashes. Some of the recent Gsf crashes were not reproducible. However, now we have a HLT crash which I could reproduce in GPU machine gputest-milan-01 (as before, it is not reproducible in lxplus-gpu). The crash happened in run 361297 at 2022-10-28 23:49:21. Recipe to reproduce the crash:

cmsrel CMSSW_12_4_10_patch3
cd CMSSW_12_4_10_patch3/src/
cmsenv
hltConfigFromDB --runNumber 361297  > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = [
'file:/store/error_stream/run361297/run361297_ls0295_index000109_fu-c2b01-23-01_pid786825.raw',
'file:/store/error_stream/run361297/run361297_ls0295_index000135_fu-c2b01-23-01_pid786825.raw'
]
@EOF
cmsRun hlt.py 

Crashes with this message:

----- Begin Fatal Exception 30-Oct-2022 11:53:36 CET-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 361297 lumi: 295 event: 541668964 stream: 0
   [1] Running path 'HLT_DoubleEle8p5_eta1p22_mMax6_v2'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

Then, I reran with my fix (#39873) which was for a prompt-reco crash. It cures this HLT crash also. So, I believe that this fix will be useful for HLT as well.

What I learn from this whole exercise is that, apparently similar looking crashes can be triggered by different modules. So a fix for one crash does not necessarily fix other, although the error message will be same. With this fix integrated, the frequency of Gsf crash should reduce. Whether the crashes will completely go away or not, that is to be seen, and actions to be taken accordingly.

@francescobrivio
Copy link
Contributor

francescobrivio commented Nov 16, 2022

There was another instance of the same issue (MultiTrajectoryState mixes states with and without errors) in Run 361971.
I tried running the job in:

But I'd like the experts to test it as well.
A minimal recipe to reproduce the crash is:

cmsrel CMSSW_12_4_11
cd CMSSW_12_4_11/src
cmsenv
git cms-addpkg TrackingTools/GsfTracking
git remote add swagataRepo [email protected]:swagata87/cmssw.git
git fetch swagataRepo
git cherry-pick aec3325132ddb2d2423dc872561e6ebf35934786
scram b -j 8
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022F/LogicError/job_3520411/vocms014.cern.ch-3520411-3-log.tar.gz .
tar -zxvf vocms014.cern.ch-3520411-3-log.tar.gz
cd job/WMTaskSpace/cmsRun1/

edit the PSet.py to be:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.eventsToProcess = cms.untracked.VEventRange('361971:435223126-361971:435223128')

run:

cmsRun PSet.py

@mmusich
Copy link
Contributor

mmusich commented Nov 16, 2022

@francescobrivio your post seems misplaced here.
#39987 is perhaps a better place to continue (even though the two issues are linked)

@francescobrivio
Copy link
Contributor

@francescobrivio your post seems misplaced here. #39987 is perhaps a better place to continue (even though the two issues are linked)

Hi Marco, the original crash reported by Tier0 is the same that is being tracked in this issue (the mixes states, not the segFault), but I agree this also post should be propagated in the other issue, I will copy paste my message there as well!

@mmusich
Copy link
Contributor

mmusich commented Nov 16, 2022

Hi Marco, the original crash reported by Tier0 is the same that is being tracked in this issue

No, this issue is about HLT, not sure how it got mixed with Tier0 issues

@mmusich
Copy link
Contributor

mmusich commented Nov 16, 2022

actually @swagata87 do you know if there were further HLT crashes beyond #39570 (comment)?
In case not, I think this issue could be closed to avoid scattering the discussion in too many places.

@swagata87
Copy link
Contributor Author

Hello Marco,
I have not seen any other GSF crash in HLT after that (the last one was in 361297 on 28th Oct). But given the crash is very rare and tend to come back after weeks, so I kept this issue open. But we can close it and if HLT Gsf problem happen again, we can open a new issue.

@swagata87
Copy link
Contributor Author

Closing this issue, after discussing with FOG convenor. If the crash comes back at HLT, a new issue will be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants