-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GsfTracking crash at HLT in run 359686 #39570
Comments
A new Issue was created by @swagata87 Swagata Mukherjee. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction, tracking-pog, egamma-pog |
New categories assigned: tracking-pog,reconstruction,egamma-pog @slava77,@mmusich,@mandrenguyen,@lfinco,@clacaputo,@swagata87 you have been requested to review this Pull request/Issue and eventually sign? Thanks |
the recipe at #39570 (comment) when run on
please create a recipe that can run without having access to hilton nodes. |
@mmusich Please use |
This issue is not reproducible on lxplus-node. However, it is reproducible on GPU cluster at Milan. The difference that I found was the CUDA version. Please find the driver info and version below: |
with this recipe: (after cmsrel CMSSW_12_4_9
cd CMSSW_12_4_9/src
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py > & out.out & I am unable to reproduce the error. |
-1
|
Just to keep track, there was another similar crash in HLT in run It was possible to reproduce the crash in GPU Hilton machine using the error stream file. The recipe is same as given in the issue description; but only change is the filename: As before, this crash was also NOT reproduced in lxplus-gpu. |
The crash goes away if we add the full Sylvester's criterion here: cmssw/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc Lines 34 to 36 in 5fb55fe
ie; with this patch the crash is gone: (ignore the std::cout, which was just for check) diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index 5e9e6b31389..e3035361a5c 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -32,8 +32,17 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
double det = 0;
- if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
- (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+ double det22 = 0;
+ double det33 = 0;
+ double det44 = 0;
+
+ if ( updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+ (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det22) && det22 > 0) &&
+ (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det33) && det33 > 0) &&
+ (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det44) && det44 > 0) &&
+ (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+ std::cout << "all TSOS condition passed so will do result.addState.. " << std::endl;
+
result.addState(TrajectoryStateOnSurface(weights[i],
|
FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2 (HLT crashes, seemingly specific to reconstruction on GPUs) |
There was another HLT crash due to GSF with similar error message, in run 359871.
With the fix mentioned above (#39570 (comment)), the crash goes away. I will prepare a PR. |
Given the fix PR is merged in master, I'll prepare backport to 12_5_X and 12_4_X. Similar GSF crashes is expected to happen rarely in long collision runs until the fix is online. Another example crash is here, from Friday's overnight run: http://cmsonline.cern.ch/cms-elog/1159020. This one was also reproducible using error_stream file, and the crash goes away with the fix. I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in. |
I expect the HLT DOCs will anyway document them ? |
+1
|
This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 . |
thanks, I've now managed to check the prompt-reco crash also, and I confirm that your patch solves the crash. So I think it makes sense to put in the fix to avoid further crashes. Your fix was originally meant for a crash with message We surely want to fix the crashes urgently, so let's go ahead with the PR, but I've copied the raw file so that it remain accessible, in case we want to do any other check The HLT error_stream files are also there, but none of them are reproducible in lxplus-gpu or lxplus(cpu). So it becomes generally difficult for people to debug with HLT error files, as one needs access to one of the P5 GPU machines. The prompt reco crash raw file should be easier to work with, as anyone can use it for debugging purpose in any usual lxplus machine. At some point it would be great for egamma to understand these strange warnings triggered from |
@mmusich can you explain to me what the suggested changes do ?
If |
|
Mhm, I'm not sure I follow. During the first iteration (when Then for the following interations
So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ? Then, once a sign is chosen, should it be fixed and not changed any more ? For example to use the sign of the first component, this loop seems safer: if (i == 0) {
// select the pz sign of the first component
pzSign = updatedTSOS.localParameters().pzSign();
} else {
// skip components with the "wrong" pz sign
if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
continue;
}
} |
(of course, maybe I misunderstood the whole logic...) |
Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ? |
right, I see it can become a problem when there are many states to be added with multiple sign flips. I was wrongly assuming that it would not try to add more than two states...
this is a valid point, though one would need one more loop to collect all the signs.
that's precisely my comment at: #39234 (comment), see also #39234 (comment). |
while logically correct, this doesn't solve the issue at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641 :( |
So I have made a bit of archeology dig.
with the state of the code as of f3afc63 the crash in prompt reco doesn't happen, so I conclude that the issue has been re-introduce somehow by mistake in PR #39656. |
#39656 is a rather simple PR. As I understand, all it does is to make sure that the states we are allowing to go further in the reco chain has a positive definite curvilinear error matrix, so that the states are more well behaved. This should not lead to the crash. We are probably missing something elsewhere. Also, I noticed that the frequency of this crash at HLT has reduced after #39656, but given that the crash has not completely disappeared, it's clear that there is something more to it which we are missing currently. I am open to suggestions and happy to try out any suggestions people might have on this. The reco chain of electron and lowpt-electron is a bit different. So I was wondering if that has something to do with it. The problem could also be in lowpt-electron specific code/config. I have to admit I have not figured it out yet. |
unfortunately it does (I invite you to cross-check) diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index f3d6d173c10..dddb731e0e4 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,15 +28,46 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
MultiTrajectoryStateAssembler result;
int i = 0;
for (auto const& tsosI : predictedComponents) {
TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
+ // in order to check for pos-def of error matrix
+ std::function<bool(AlgebraicSymMatrix55, int)> sylvesterCriterion = [this](AlgebraicSymMatrix55 curvError,
+ int index) {
+ //Sylvester's criterion, start from the smaller submatrix size
+ double det = 0;
+ if ((!curvError.Sub<AlgebraicSymMatrix22>(0, 0).Det(det)) || det < 0) {
+ edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub2.det for state " << index;
+ return false;
+ } else if ((!curvError.Sub<AlgebraicSymMatrix33>(0, 0).Det(det)) || det < 0) {
+ edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub3.det for state " << index;
+ return false;
+ } else if ((!curvError.Sub<AlgebraicSymMatrix44>(0, 0).Det(det)) || det < 0) {
+ edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub4.det for state " << index;
+ return false;
+ } else if ((!curvError.Det2(det)) || det < 0) {
+ edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check det for state " << index;
+ return false;
+ }
+ // in case all the above was not realized
+ return true;
+ };
+
+ if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+ sylvesterCriterion(updatedTSOS.curvilinearError().matrix(), i)) {
+ std::cout << " ###### keeping: " << i << " state" << std::endl;
- if (double det;
- updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
- (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det) && det > 0) &&
- (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det) && det > 0) &&
- (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det) && det > 0) &&
- (det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0)) {
result.addState(TrajectoryStateOnSurface(weights[i],
updatedTSOS.localParameters(),
updatedTSOS.localError(),
and running the reconstruction of the incriminated electron (see recipe at #35929 (comment)) with the full Sylvester criterion and only the check on the full error matrix determinant (commenting out the first three clauses) I see that in one case there is one state less used for latter (the one with index 2): 2c2
< %MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
11,12c11,12
< %MSG-w GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
< Fail pos-def check det for state 1
---
> %MSG-w GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 1
14c14
< %MSG-e GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-e GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
17c17
< %MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
26c26,31
< ###### keeping: 2 state
---
> %MSG-w GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 2
> %MSG
> %MSG-e GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> KF updated state 2 is invalid. skipping.
> %MSG
32c37
< %MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
41,42c46,47
< %MSG-w GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
< Fail pos-def check det for state 8
---
> %MSG-w GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 8
44c49
< %MSG-e GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:36:31 CEST Run: 360888 Event: 426362613
---
> %MSG-e GsfMultiStateUpdator: GsfTrackProducer:lowPtGsfEleGsfTracks 25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
49c54
< ###### keeping: 11 state
---
> ###### keeping: 11 state and this leads to the crash (in Prompt). |
@swagata87 do you have any update on this? Cheers, |
Hello Francesco, Apart from what Marco has already pointed out, that removing the sub-matrices' positive definite check done by Sylvester criterion cures this crash, I don't have other fix right now. That particular prompt reco event is puzzling, as, despite all protections in GsfMultiStateUpdator to make sure that matrices are positive definite, the event still prints warning messages of this kind[1], and it is unclear to me from where these states are coming from as we are doing a strict check to remove such states. And it's also unclear to me why keeping a non pos-def state is actually helping to cure a crash. About #39234 (comment) , it would be helpful to know when this alignment update is going to be used for datataking? I'm just curious to see if that would solve the crashes also. If I manage to gain more insight on this crash, I will write here. [1]
|
it might be worth noticing that messages as in [1] are triggered by this check: cmssw/TrackingTools/TrajectoryState/src/BasicTrajectoryState.cc Lines 94 to 98 in 2a2ca2e
which doesn't implement the full Sylvester criterion , but a poor-man version implemented here: cmssw/TrackingTools/TrajectoryParametrization/interface/CurvilinearTrajectoryError.h Lines 51 to 55 in 6d2f660
also I think that those messages are coming from an earlier stage of execution of the code than the filtering done here: cmssw/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc Lines 34 to 39 in 83e46c0
|
I still need to understand it better, by doing more checks; Also, the warning messages that I was talking about are now gone, for this one event. [1] diff --git a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
index d4c6e1b11d2..701180aaeba 100644
--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -40,12 +40,14 @@ namespace GaussianStateConversions {
std::vector<TrajectoryStateOnSurface> components;
components.reserve(singleStates.size());
for (auto const& ic : singleStates) {
- components.emplace_back((*ic).weight(),
- LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
- LocalTrajectoryError((*ic).covariance()),
- surface,
- field,
- side);
+ if (double det=0; (*ic).covariance().Det2(det) && det>0 ) {
+ components.emplace_back((*ic).weight(),
+ LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
+ LocalTrajectoryError((*ic).covariance()),
+ surface,
+ field,
+ side);
+ }
}
return TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
}
|
I did some sanity checks by running on 200 events from EGamma RAW file |
Dear all, /store/lustre/error/run361107:
/store/lustre/error/run361239:
/store/lustre/error/run361297 I tried to reproduce them using the following recipe in GPU Hilton machines with the first two cases without success:
Follow up needed checking the last occurence in run 361297. Cheers, |
Coming back to HLT crashes. Some of the recent Gsf crashes were not reproducible. However, now we have a HLT crash which I could reproduce in GPU machine
Crashes with this message:
Then, I reran with my fix (#39873) which was for a prompt-reco crash. It cures this HLT crash also. So, I believe that this fix will be useful for HLT as well. What I learn from this whole exercise is that, apparently similar looking crashes can be triggered by different modules. So a fix for one crash does not necessarily fix other, although the error message will be same. With this fix integrated, the frequency of Gsf crash should reduce. Whether the crashes will completely go away or not, that is to be seen, and actions to be taken accordingly. |
There was another instance of the same issue (
But I'd like the experts to test it as well.
edit the
run:
|
@francescobrivio your post seems misplaced here. |
Hi Marco, the original crash reported by Tier0 is the same that is being tracked in this issue (the |
No, this issue is about HLT, not sure how it got mixed with Tier0 issues |
actually @swagata87 do you know if there were further HLT crashes beyond #39570 (comment)? |
Hello Marco, |
Closing this issue, after discussing with FOG convenor. If the crash comes back at HLT, a new issue will be opened. |
This issue is related to #39026 in the sense that similar crashes happened before and a fix for that was put in #39074. It is unclear to me why the crash came back despite the fix.
Some details of the crash is here
With the error stream file, I could reproduce the crash in GPU machines.
The error stream file(.raw), which is available in Hilton nodes, is also copied to this location:
/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw
In case it is useful, the file is also available in .root format:
/eos/cms/store/group/phys_egamma/swmukher/outputFileGSF.root
Recipe to reproduce the crash in GPU machine is given below:
It crashes with the following messages:
The text was updated successfully, but these errors were encountered: