HLT crashes in run 359297 from module `EcalRecHitProducer:hltEcalRecHitWithoutTPs` #39568

trocino · 2022-10-01T17:42:38Z

Several HLT jobs crashed during run 359297, all due to module EcalRecHitProducer:hltEcalRecHitWithoutTPs. Before crashing, the following error message appears:
cmsRun: /.../cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/CalibCalorimetry/EcalLaserAnalyzer/src/MEEBGeom.cc:31: static int MEEBGeom::sm(MEEBGeom::EBGlobalCoord, MEEBGeom::EBGlobalCoord): Assertion ``ieta > 0 && ieta <= 85' failed.
The full log output for several such cases, including the stack trace, can be found on EOS:
/eos/cms/store/user/trocino/HLT_ECAL_Debug/LogOutput/
ROOT RAW files containing all incriminated events can be found at
/eos/cms/store/user/trocino/HLT_ECAL_Debug/EdmRawRoot/

Please note that the error does not seem to be reproducible on LXPLUS (probably because it runs on CPUs), while it's reproducible on machines with GPUs, e.g. Hilton machines.

A recipe to reproduce the errors:

cmsrel CMSSW_12_4_9
cd CMSSW_12_4_9/src
cmsenv
hltGetConfiguration  run:359297  --globaltag 124X_dataRun3_HLT_v4  --process HLT  --data  --unprescale  --input /store/user/trocino/HLT_ECAL_Debug/EdmRawRoot/run359297_ls0232_index000269_fu-c2b05-14-01_pid3023154.root  --output all  > hlt.py
cmsRun hlt.py

The text was updated successfully, but these errors were encountered:

cmsbuild · 2022-10-01T17:43:00Z

A new Issue was created by @trocino Daniele Trocino.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

missirol · 2022-10-01T22:26:19Z

assign ecal-dpg

I think the crash occurs only if running the ECAL unpacker on GPU. One can check this by appending

del process.hltEcalUncalibRecHit.cuda
del process.hltEcalRecHit.cuda

which, I think, runs the RecHit producers on CPU, but still running the unpacker on GPU (adding also del process.hltEcalDigis.cuda makes the crash disappear).

Adding some printouts shows that behind the crash there is a RecHit with invalid detId (the value of the invalid detId is not always the same for the different crashes seen during run 359297, the one below is only one example):

DetId::subdetId() = EcalBarrel
DetId::rawId() = 838860888
EBDetId::hashedIndex() = 30687
EBDetId::ieta() = 0
EBDetId::iphi() = 88
EBDetId::zside() = -1
EBDetId::validDetId(ieta, iphi) = false

cmsbuild · 2022-10-01T22:26:36Z

New categories assigned: ecal-dpg

@simonepigazzini,@jainshilpi,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks

thomreis · 2022-10-03T12:52:57Z

This example is indeed an invalid detector id. We will take a look at the digis from the GPU unpacker to see why this happens.

The RecHit producer itself always runs on CPU at the moment but with different input collections on machines with a GPU. Therefore, an issue on the GPU unpacker can lead to a crash in the RecHit producer.

missirol · 2022-10-06T07:16:43Z

FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2

(HLT crashes, seemingly specific to reconstruction on GPUs)

perrotta · 2022-11-02T06:45:11Z

urgent
(marking urgent the issues affecting online workflows)

missirol · 2023-02-18T13:36:01Z

assign hlt

(To make sure this remains on HLT's radar.)

cmsbuild · 2023-02-18T13:36:24Z

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

missirol · 2023-05-21T18:18:48Z

@cms-sw/ecal-dpg-l2

Today, during collisions, there was a crash at HLT which looks similar to the one described here, see
http://cmsonline.cern.ch/cms-elog/1183558

thomreis · 2023-05-22T13:23:24Z

Hi @missirol , this last instance is likely caused by a tower in EB-01 that has data integrity problems. It is mostly contained in one tower which could be masked as a short term solution if needed. See also slide 7 of last week's ECAL PFG shifter report https://indico.cern.ch/event/1288622/contributions/5414918/attachments/2650937/4590074/PFG_week_20_report_Orlandi.pdf

thomreis · 2023-05-22T13:24:35Z

FYI @grasph

missirol · 2023-05-22T16:50:15Z

@thomreis , I reproduced the latest crash on lxplus-gpu with

./test.sh 367771 1

using the script copied in [1].

Like for the first crash described in this issue, it does not occur if the GPU reconstruction is disabled.

[1]

#!/bin/bash

# cmsrel CMSSW_13_0_6
# cd CMSSW_13_0_6/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367771 4 # runNumber nThreads

[ $# -eq 2 ] || exit 1

RUNNUM="${1}"
NUMTHREADS="${2}"

ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"

for dirPath in $(ls -d "${RUNDIR}"*); do
  # require at least one non-empty FRD file
  [ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
  runNumber="${dirPath: -6}"
  JOBTAG=test_run"${runNumber}"
  HLTMENU="--runNumber ${runNumber}"
  hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
  cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  rm -rf run"${runNumber}"
  mkdir run"${runNumber}"
  echo "run${runNumber} .."
  cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
  echo "run${runNumber} .. done (exit code: $?)"
  unset runNumber
done
unset dirPath

hannahbnelson · 2023-06-06T14:21:18Z

There was another instance of this crash in run 368547 (1 crash).
f3mon_run368547.txt

thomreis · 2023-06-06T15:04:05Z

Is this the first new crash in the last two weeks since the one in 367771?

missirol · 2023-06-06T15:07:22Z

Yes, as far as I know (we monitor the crashes semi-automatically, it's possible we can miss one, but I don't think we missed any in this case).

missirol · 2023-06-11T19:34:18Z

Reporting another HLT crash of this kind.

Run 368547 (pp collisions)
Release: CMSSW_13_0_7
Full log from DAQ: f3mon_run368547.txt
Piece of stack trace:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_7-el8_amd64_gcc11/build/CMSSW_13_0_7-build/tmp/BUILDROOT/9019b82ce41695dd3e01c9d81cd67c61/opt/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/src/CalibCalorimetry/EcalLaserAnalyzer/src/MEEBGeom.cc:31: static int MEEBGeom::sm(MEEBGeom::EBGlobalCoord, MEEBGeom::EBGlobalCoord): Assertion `ieta > 0 && ieta <= 85' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Tue Jun 6 12:03:13 CEST 2023
Thread 23 (Thread 0x7f3ca17fd700 (LWP 1800032) "cmsRun"):
#0 0x00007f3d7b47ea71 in poll () from /lib64/libc.so.6
#1 0x00007f3d744f046f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f3d744bbb6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f3d744bc33b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f3d7b3c437f in raise () from /lib64/libc.so.6
#6 0x00007f3d7b3aedb5 in abort () from /lib64/libc.so.6
#7 0x00007f3d7b3aec89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#8 0x00007f3d7b3bca76 in __assert_fail () from /lib64/libc.so.6
#9 0x00007f3d1a4ca313 in MEEBGeom::sm(int, int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserAnalyzer.so
#10 0x00007f3d1a4ca389 in MEEBGeom::dcc(int, int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserAnalyzer.so
#11 0x00007f3d1a4cbe12 in MEEBGeom::lmr(int, int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserAnalyzer.so

missirol · 2023-06-11T20:17:51Z

Reporting another HLT crash of this kind.

Run 368724 (pp collisions)
Release: CMSSW_13_0_7
Full log from DAQ: f3mon_run368724.txt
Piece of stack trace:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_7-el8_amd64_gcc11/build/CMSSW_13_0_7-build/tmp/BUILDROOT/9019b82ce41695dd3e01c9d81cd67c61/opt/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/src/CalibCalorimetry/EcalLaserAnalyzer/src/MEEBGeom.cc:31: static int MEEBGeom::sm(MEEBGeom::EBGlobalCoord, MEEBGeom::EBGlobalCoord): Assertion `ieta > 0 && ieta <= 85' failed.

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.
Sun Jun 11 08:23:27 CEST 2023

(..)

Thread 7 (Thread 0x7f26c2ffe700 (LWP 3227352) "cmsRun"):
#0  0x00007f2739f9aa71 in poll () from /lib64/libc.so.6
#1  0x00007f2730ed846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007f2730ea3b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007f2730ea433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f2739ee037f in raise () from /lib64/libc.so.6
#6  0x00007f2739ecadb5 in abort () from /lib64/libc.so.6
#7  0x00007f2739ecac89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#8  0x00007f2739ed8a76 in __assert_fail () from /lib64/libc.so.6
#9  0x00007f26d8f9d313 in MEEBGeom::sm(int, int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserAnalyzer.so
#10 0x00007f26d8f9d389 in MEEBGeom::dcc(int, int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserAnalyzer.so
#11 0x00007f26d8f9ee12 in MEEBGeom::lmr(int, int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserAnalyzer.so
#12 0x00007f26d909f0f4 in EcalLaserDbService::getLaserCorrection(DetId const&, edm::Timestamp const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libCalibCalorimetryEcalLaserCorrection.so
#13 0x00007f26d9245502 in EcalRecHitWorkerSimple::run(edm::Event const&, EcalUncalibratedRecHit const&, edm::SortedCollection<EcalRecHit, edm::StrictWeakOrdering<EcalRecHit> >&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalCaloEcalRecProducersPlugins.so
#14 0x00007f26d9236050 in EcalRecHitProducer::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalCaloEcalRecProducersPlugins.so
#15 0x00007f273c9e795d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#16 0x00007f273c9ce072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f273c95a6da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00007f273c95ab88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007f273c6aff79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreConcurrency.so
#20 0x00007f273b12c304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f25e5defe00, waiter=..., this=0x7f2735f93b00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#21 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f2735f93b00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#22 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137
#23 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599
#24 0x00007f273b12e4c6 in tbb::detail::r1::rml::private_worker::run (this=0x7f2735f6fe80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#25 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f2735f6fe80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#26 0x00007f273a27817a in start_thread () from /lib64/libpthread.so.0
#27 0x00007f2739fa5df3 in clone () from /lib64/libc.so.6

(..)

Current Modules:
Module: EcalRecHitProducer:hltEcalRecHit (crashed)
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: SeedGeneratorFromProtoTracksEDProducer:hltIter0IterL3FromL1MuonPixelSeedsFromPixelTracks
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: CorrectedCaloJetProducer:hltAK4CaloJetsCorrected
Module: CorrectedPFJetProducer:hltAK4PFJetsTightIDVBFCorrected
Module: MultiHitFromChi2EDProducer:hltDisplacedhltIter4PFlowPixelLessHitTripletsForTau
Module: none
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: HLTEcalRecHitInAllL1RegionsProducer:hltRechitInRegionsECAL
Module: PFRecHitProducer:hltParticleFlowRecHitHBHE
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesOpenMu
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: SiPixelDigisClustersFromSoAPhase1:hltSiPixelClustersFromSoA
Module: MuonIdProducer:hltGlbTrkMuonsLowPtIter01Merge
Module: TSGForOIDNN:hltIterL3OISeedsFromL2Muons
Module: HitPairEDProducer:hltElePixelHitDoubletsForTripletsUnseeded
Module: none
Module: TriggerSummaryProducerAOD:hltTriggerSummaryAOD
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: MuonIdProducer:hltIterL3MuonsNoVtx
Module: TriggerSummaryProducerRAW:hltTriggerSummaryRAW
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: none
Module: HLTL1TSeed:hltL1sMu18erTau26er2p1Jet55
Module: CSCRecHitDProducer:hltCsc2DRecHits
Module: HBHEPhase1Reconstructor:hltHbherecoLegacy
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: EcalRecHitProducer:hltEcalRecHit
A fatal system signal has occurred: abort signal

missirol · 2023-06-12T11:03:45Z

@thomreis , these crashes are not frequent, but they continue to happen.
I'm wondering if there is an ETA for a fix. I don't have a sense of how difficult this is.

thomreis · 2023-06-12T15:11:15Z

Hi @missirol there is no ETA yet. I have just started to look into this today and manage to reproduce the the crash with your recipe.

thomreis · 2023-06-14T14:15:49Z

Hi @missirol are the error_stream files for the crashes in runs 368547 and 368724 available somewhere? I would like to check if the fix for 367771 also avoids the other two crashes.

missirol · 2023-06-14T20:20:15Z

Hi @thomreis , we have the files for run-368724 [1]. I will request the files for run-368547, and share them here if they are still available.

[1]

/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run368724

missirol · 2023-06-15T07:13:24Z

Hi @thomreis , below is the path to the error-stream files of run-368547.

/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run368547

thomreis · 2023-06-15T16:24:58Z

PR #41977 should avoid crashes like these in the future. Backports will follow.

thomreis · 2023-06-15T22:12:00Z

Backports

13_0_X: ECAL - Add integrity checks for strip and xtal ids to GPU unpacker - 130X #41980
13_1_X: ECAL - Add integrity checks for strip and xtal ids to GPU unpacker - 131X #41981

missirol · 2023-07-20T07:14:39Z

Just for the record, the last crash of this kind was seen in run-370293.

Run 370293 (pp collisions)
Release: CMSSW_13_0_9
Full log from DAQ: f3mon_run370293.txt

The corresponding input files can be found in

/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run370293

missirol · 2023-07-20T07:15:11Z

+hlt

@thomreis provided a fix for these crashes in #41977, then backported and integrated in CMSSW_13_0_10, Even though not a lot of data was taken with 13_0_10 at HLT yet, the fix was tested on the reproducers, so I think it'd be okay to close this issue (if needed, it can be re-opened). A follow-up of #41977 is in #42301.

thomreis · 2023-07-20T09:46:51Z

Just for the record, the last crash of this kind was seen in run-370293.

* Run 370293 (pp collisions)

* Release: `CMSSW_13_0_9`

* Full log from DAQ: [f3mon_run370293.txt](https://github.com/cms-sw/cmssw/files/12103083/f3mon_run370293.txt)

The corresponding input files can be found in

/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run370293

I ran some tests and this crash would not have happened with the fix already in 13_0_10 and also not with the improved fix in #42301

missirol · 2023-07-20T09:51:42Z

Thanks for checking, @thomreis. Do you want to sign off this issue for ECAL , before I close it ?

thomreis · 2023-07-20T10:00:59Z

+ecal-dpg

cmsbuild · 2023-07-20T10:02:00Z

This issue is fully signed and ready to be closed.

missirol · 2023-07-20T10:02:25Z

please close

cmsbuild added the pending-assignment label Oct 1, 2022

cmsbuild added ecal-dpg-pending pending-signatures and removed pending-assignment labels Oct 1, 2022

missirol mentioned this issue Oct 8, 2022

CUDA-related HLT crashes between run-359694 and run-359764 #39680

Closed

missirol mentioned this issue Oct 20, 2022

HLT crashes in GPU and CPU in collision runs #38453

Closed

cmsbuild added the urgent label Nov 2, 2022

cmsbuild added the hlt-pending label Feb 18, 2023

thomreis mentioned this issue Jun 15, 2023

ECAL - Add integrity checks for strip and xtal ids to GPU unpacker #41977

Merged

This was referenced Jun 15, 2023

ECAL - Add integrity checks for strip and xtal ids to GPU unpacker - 130X #41980

Merged

ECAL - Add integrity checks for strip and xtal ids to GPU unpacker - 131X #41981

Merged

missirol mentioned this issue Jun 26, 2023

Improve data-integrity check in ECAL GPU unpacker #42090

Closed

thomreis mentioned this issue Jul 18, 2023

ECAL skip GPU unpacking of the rest of the block if a bad block is detected #42301

Merged

cmsbuild added hlt-approved and removed hlt-pending labels Jul 20, 2023

cmsbuild added fully-signed ecal-dpg-approved and removed pending-signatures ecal-dpg-pending labels Jul 20, 2023

cmsbuild closed this as completed Jul 20, 2023

thomreis mentioned this issue Jan 8, 2024

ECAL unpacker and ECAL multifit algorithm migration to alpaka #43257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT crashes in run 359297 from module `EcalRecHitProducer:hltEcalRecHitWithoutTPs` #39568

HLT crashes in run 359297 from module `EcalRecHitProducer:hltEcalRecHitWithoutTPs` #39568

trocino commented Oct 1, 2022

cmsbuild commented Oct 1, 2022

missirol commented Oct 1, 2022

cmsbuild commented Oct 1, 2022

thomreis commented Oct 3, 2022

missirol commented Oct 6, 2022

perrotta commented Nov 2, 2022

missirol commented Feb 18, 2023

cmsbuild commented Feb 18, 2023

missirol commented May 21, 2023 •

edited

Loading

thomreis commented May 22, 2023

thomreis commented May 22, 2023

missirol commented May 22, 2023 •

edited

Loading

hannahbnelson commented Jun 6, 2023

thomreis commented Jun 6, 2023

missirol commented Jun 6, 2023

missirol commented Jun 11, 2023

missirol commented Jun 11, 2023

missirol commented Jun 12, 2023

thomreis commented Jun 12, 2023

thomreis commented Jun 14, 2023

missirol commented Jun 14, 2023

missirol commented Jun 15, 2023

thomreis commented Jun 15, 2023

thomreis commented Jun 15, 2023

missirol commented Jul 20, 2023

missirol commented Jul 20, 2023

thomreis commented Jul 20, 2023

missirol commented Jul 20, 2023 •

edited

Loading

thomreis commented Jul 20, 2023 •

edited

Loading

cmsbuild commented Jul 20, 2023

missirol commented Jul 20, 2023

HLT crashes in run 359297 from module EcalRecHitProducer:hltEcalRecHitWithoutTPs #39568

HLT crashes in run 359297 from module EcalRecHitProducer:hltEcalRecHitWithoutTPs #39568

Comments

trocino commented Oct 1, 2022

cmsbuild commented Oct 1, 2022

missirol commented Oct 1, 2022

cmsbuild commented Oct 1, 2022

thomreis commented Oct 3, 2022

missirol commented Oct 6, 2022

perrotta commented Nov 2, 2022

missirol commented Feb 18, 2023

cmsbuild commented Feb 18, 2023

missirol commented May 21, 2023 • edited Loading

thomreis commented May 22, 2023

thomreis commented May 22, 2023

missirol commented May 22, 2023 • edited Loading

hannahbnelson commented Jun 6, 2023

thomreis commented Jun 6, 2023

missirol commented Jun 6, 2023

missirol commented Jun 11, 2023

missirol commented Jun 11, 2023

missirol commented Jun 12, 2023

thomreis commented Jun 12, 2023

thomreis commented Jun 14, 2023

missirol commented Jun 14, 2023

missirol commented Jun 15, 2023

thomreis commented Jun 15, 2023

thomreis commented Jun 15, 2023

missirol commented Jul 20, 2023

missirol commented Jul 20, 2023

thomreis commented Jul 20, 2023

missirol commented Jul 20, 2023 • edited Loading

thomreis commented Jul 20, 2023 • edited Loading

cmsbuild commented Jul 20, 2023

missirol commented Jul 20, 2023

HLT crashes in run 359297 from module `EcalRecHitProducer:hltEcalRecHitWithoutTPs` #39568

HLT crashes in run 359297 from module `EcalRecHitProducer:hltEcalRecHitWithoutTPs` #39568

missirol commented May 21, 2023 •

edited

Loading

missirol commented May 22, 2023 •

edited

Loading

missirol commented Jul 20, 2023 •

edited

Loading

thomreis commented Jul 20, 2023 •

edited

Loading