-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT Crash -- Run 362720 -- Possibly related to tracking? #40174
Comments
A new Issue was created by @trtomei Thiago Tomei. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@trtomei is there a full log available? |
assign reconstruction FYI @cms-sw/hlt-l2 @cms-sw/tracking-pog-l2 |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Let me add some info, since I don't think this is the first occurrence of this HLT crash this year (should have opened an issue a while back).
FYI: @fwyzard (I was discussing this with Andrea at some point in the past) |
Thanks @missirol. So in all cases in the log file the crashing module is I guess the crash occurs inside this
with the HitLessPhi beingcmssw/RecoTracker/TkHitPairs/interface/RecHitsSortedInPhi.h Lines 35 to 37 in 830c070
An easy way to get sort() to crash is to have a NaN, but I don't know how likely that would be here.
|
assign hlt (To make sure this remains on HLT's radar.) |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Following up on this issue, as we still see HLT crashes similar to this one. These appeared again in run-367112, run-367337 and run-367553 (all pp collisions runs). The CMSSW release at HLT then was The corresponding stack traces from DAQ are attached below, and the corresponding error-stream files are available on EOS. First attempts to reproduce the crashes offline failed (again). The recipe used in those failed attempts was adapted in [1] to be valid for FYI: @silviodonato @fwyzard @mzarucki [1] #!/bin/bash
# cmsrel CMSSW_13_0_5_patch1
# cd CMSSW_13_0_5_patch1
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367553 4 # runNumber nThreads
[ $# -eq 2 ] || exit 1
RUNNUM="${1}"
NUMTHREADS="${2}"
ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"
for dirPath in $(ls -d "${RUNDIR}"*); do
# require at least one non-empty FRD file
[ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
runNumber="${dirPath: -6}"
JOBTAG=test_run"${runNumber}"
HLTMENU="--runNumber ${runNumber}"
hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
process.__delattr__(foo)
EOF
rm -rf run"${runNumber}"
mkdir run"${runNumber}"
echo "run${runNumber} .."
cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
echo "run${runNumber} .. done (exit code: $?)"
unset runNumber
done
unset dirPath |
Reporting another instance of these HLT crashes in run-368489 (3 crashes). What is interesting about this kind of crashes is that they often come in bursts (multiple crashes within a few seconds on different HLT nodes). The full stack trace of the latest crashes is attached (includes metadata). |
Reporting another HLT crash of this kind. This time, there was only one crash of this kind in this run.
|
Reporting another HLT crash of this kind. This time, three crashes of this kind within a few seconds of each other, on different HLT nodes, in the same run.
|
There is a reproducer [*] which might be related to this issue. The reproducer was tested with Some things are still unclear to me, though. The reproducer comes from run-368685, which had 4 crashes (see previous comment). Three of the four crashes mentioned
[*] #!/bin/bash
# cmsrel CMSSW_13_0_7
# cd CMSSW_13_0_7/src
# cmsenv
OUTFILE=hltTest368685
ACCELER=gpu-nvidia
hltGetConfiguration run:368685 \
--data \
--no-prescale \
--no-output \
--globaltag 130X_dataRun3_HLT_v2 \
--max-events 1 \
--paths HLT_DoubleMediumChargedIsoDisplacedPFTauHPS32_Trk1_eta2p1_v* \
--input file:/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/root/run368685/run368685_ls1009_index000027_fu-c2b02-41-01_pid1894024.root \
> "${OUTFILE}".py
cat <<@EOF >> "${OUTFILE}".py
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.options.numberOfThreads = 1
process.options.accelerators = ['${ACCELER}']
process.source.skipEvents = cms.untracked.uint32(64)
process.maxEvents.input = 1
@EOF
cmsRun "${OUTFILE}".py &> "${OUTFILE}".log |
(I edited the reproducer in #40174 (comment) to simplify it slightly.) Based on the reproducer, the first NaN I can find appears in
because there is one case where ldir.z() == 0 [1] [2]. The crash occurs only with GPU offloading (no crash with CPU only) [3].
The module that's crashing is Could experts please have a look ? @cms-sw/reconstruction-l2 @cms-sw/tracking-pog-l2 [1] diff --git a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
index f740faff365..529d4ac1d10 100644
--- a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
+++ b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
@@ -2,6 +2,7 @@
#define RecHitPropagator_H
#include "TrackingTools/TrajectoryState/interface/TrajectoryStateOnSurface.h"
+#include "FWCore/MessageLogger/interface/MessageLogger.h"
class TrackingRecHit;
class MagneticField;
@@ -18,12 +19,17 @@ public:
// propagate from glued to mono/stereo
inline TrajectoryStateOnSurface fastProp(const TrajectoryStateOnSurface& ts, const Plane& oPlane, const Plane& tPlane) {
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " ts.isValid=" << ts.isValid() << " ts.globalPosition=" << ts.globalPosition();
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " oPlane.position=" << oPlane.position() << " tPlane.position=" << tPlane.position();
GlobalVector gdir = ts.globalMomentum();
-
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " gdir=" << gdir;
double delta = tPlane.localZ(oPlane.position());
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " delta=" << delta;
LocalVector ldir = tPlane.toLocal(gdir); // fast prop!
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " ldir=" << ldir;
LocalPoint lPos = tPlane.toLocal(ts.globalPosition());
LocalPoint projectedPos = lPos - ldir * delta / ldir.z();
+edm::LogPrint("RecHitPropagator") << " RecHitPropagator::fastProp " << __LINE__ << " lPos=" << lPos << " projectedPos=" << projectedPos;
// we can also patch it up as only the position-errors are used...
GlobalTrajectoryParameters gp(
tPlane.toGlobal(projectedPos), gdir, ts.charge(), &ts.globalParameters().magneticField()); [2] Output on GPU:
[3] Output on CPU:
|
It may be more practical to start upstream at the input position to be at least in the CMS cavern, or perhaps a bit more local. |
Digging a bit deeper (for reference, missirol@bd3a869 contains some untidy printouts used on top of #40174 (comment)), I see that
because p.z() == -9.53674e-07 . The use of StraightLinePropagator in this case comes from
After discussing offline with @mmusich, the simplest fix I could come up with is [*] (this avoids the crash in the reproducer), but obviously this should be reviewed by experts. It might also be useful to review this particular trigger Path, and understand why it is prone to this issue. [*] diff --git a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
index f740faff365..ad4411b1374 100644
--- a/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
+++ b/RecoTracker/MeasurementDet/plugins/RecHitPropagator.h
@@ -19,9 +19,14 @@ public:
// propagate from glued to mono/stereo
inline TrajectoryStateOnSurface fastProp(const TrajectoryStateOnSurface& ts, const Plane& oPlane, const Plane& tPlane) {
GlobalVector gdir = ts.globalMomentum();
+ LocalVector ldir = tPlane.toLocal(gdir); // fast prop!
+
+ // if ldir.z() == 0, return an invalid TrajectoryStateOnSurface
+ if (ldir.z() == 0) {
+ return TrajectoryStateOnSurface();
+ }
double delta = tPlane.localZ(oPlane.position());
- LocalVector ldir = tPlane.toLocal(gdir); // fast prop!
LocalPoint lPos = tPlane.toLocal(ts.globalPosition());
LocalPoint projectedPos = lPos - ldir * delta / ldir.z();
// we can also patch it up as only the position-errors are used...
diff --git a/RecoTracker/MeasurementDet/plugins/doubleMatch.icc b/RecoTracker/MeasurementDet/plugins/doubleMatch.icc
index ab45ef8d6c6..ff1a5efb34e 100644
--- a/RecoTracker/MeasurementDet/plugins/doubleMatch.icc
+++ b/RecoTracker/MeasurementDet/plugins/doubleMatch.icc
@@ -82,7 +82,8 @@ void TkGluedMeasurementDet::doubleMatch(const TrajectoryStateOnSurface& ts,
if LIKELY (!emptyMono) {
// mono does require "projection" for precise estimate
TrajectoryStateOnSurface mts = fastProp(ts, geomDet().surface(), theMonoDet->geomDet().surface());
- theMonoDet->simpleRecHits(mts, collector.estimator(), data, monoHits);
+ if LIKELY (mts.isValid())
+ theMonoDet->simpleRecHits(mts, collector.estimator(), data, monoHits);
}
// print("mono", mts,ts);
mf = monoHits.size();
@@ -96,7 +97,8 @@ void TkGluedMeasurementDet::doubleMatch(const TrajectoryStateOnSurface& ts,
emptyStereo = theStereoDet->empty(data);
if LIKELY (!emptyStereo) {
TrajectoryStateOnSurface pts = fastProp(ts, geomDet().surface(), theStereoDet->geomDet().surface());
- theStereoDet->simpleRecHits(pts, collector.estimator(), data, stereoHits);
+ if LIKELY (pts.isValid())
+ theStereoDet->simpleRecHits(pts, collector.estimator(), data, stereoHits);
// print("stereo", pts,ts);
}
sf = stereoHits.size(); |
It seems OK to apply your changes to avoid the crash specifically here with However, it seems like the problem is more generic and calls to Without it I wouldn't be surprised that some other call taking a |
@slava77 , thanks for the suggestions in #40174 (comment), I can try to improve the patch. |
Just reporting one more crash of this kind in run-370093.
The log from DAQ mentions [1]
[2]
|
Thanks to clarifications received offline from @slava77, I tried to address his comment in missirol@6a18b81 (this fixes the two reproducers found thus far). Some details in [*]. To summarise, missirol@269dbd9 and missirol@6a18b81 are what I could come up with, as a layman of tracking. My goal would be to fix the crashes seen at HLT, so it would be important to be able to backport the fix (whatever it will be). How to proceed ? [*] The rationale of missirol@6a18b81 is that |
both look reasonable to me. Perhaps the next step is to make a PR to the master branch and see if any differences show up in the PR tests. |
Just for the record, I tested the reproducer in #40174 (comment) using a more recent HLT menu (i.e. |
type tracking |
+hlt
|
I summarised my understanding in #40174 (comment). I think we can close this issue. If similar crashes reappear, we can re-open it, or create a new one. I'll let RECO conveners comment/sign. |
please close
Going ahead with closing this issue. |
Dearest,
HLT crashes were observed in Run 362720. The messages seemed related to Tracking:
but I couldn't reproduce them.
Instructions to run on Hilton (
hilton_c2b02_44_01
) ashltpro
Instructions to run on a regular
lxplus
machine.@cms-sw/hlt-l2 FYI
@cms-sw/tracking-dpg-l2 FYI
The text was updated successfully, but these errors were encountered: