-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crashes in run 359297 from module EcalRecHitProducer:hltEcalRecHitWithoutTPs
#39568
Comments
A new Issue was created by @trocino Daniele Trocino. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign ecal-dpg I think the crash occurs only if running the ECAL unpacker on GPU. One can check this by appending del process.hltEcalUncalibRecHit.cuda
del process.hltEcalRecHit.cuda which, I think, runs the RecHit producers on CPU, but still running the unpacker on GPU (adding also Adding some printouts shows that behind the crash there is a RecHit with invalid detId (the value of the invalid detId is not always the same for the different crashes seen during run 359297, the one below is only one example): DetId::subdetId() = EcalBarrel
DetId::rawId() = 838860888
EBDetId::hashedIndex() = 30687
EBDetId::ieta() = 0
EBDetId::iphi() = 88
EBDetId::zside() = -1
EBDetId::validDetId(ieta, iphi) = false |
New categories assigned: ecal-dpg @simonepigazzini,@jainshilpi,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks |
This example is indeed an invalid detector id. We will take a look at the digis from the GPU unpacker to see why this happens. The RecHit producer itself always runs on CPU at the moment but with different input collections on machines with a GPU. Therefore, an issue on the GPU unpacker can lead to a crash in the RecHit producer. |
FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2 (HLT crashes, seemingly specific to reconstruction on GPUs) |
urgent |
assign hlt (To make sure this remains on HLT's radar.) |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@cms-sw/ecal-dpg-l2 Today, during collisions, there was a crash at HLT which looks similar to the one described here, see |
Hi @missirol , this last instance is likely caused by a tower in EB-01 that has data integrity problems. It is mostly contained in one tower which could be masked as a short term solution if needed. See also slide 7 of last week's ECAL PFG shifter report https://indico.cern.ch/event/1288622/contributions/5414918/attachments/2650937/4590074/PFG_week_20_report_Orlandi.pdf |
FYI @grasph |
@thomreis , I reproduced the latest crash on
using the script copied in [1]. Like for the first crash described in this issue, it does not occur if the GPU reconstruction is disabled. [1] #!/bin/bash
# cmsrel CMSSW_13_0_6
# cd CMSSW_13_0_6/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367771 4 # runNumber nThreads
[ $# -eq 2 ] || exit 1
RUNNUM="${1}"
NUMTHREADS="${2}"
ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"
for dirPath in $(ls -d "${RUNDIR}"*); do
# require at least one non-empty FRD file
[ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
runNumber="${dirPath: -6}"
JOBTAG=test_run"${runNumber}"
HLTMENU="--runNumber ${runNumber}"
hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
process.__delattr__(foo)
EOF
rm -rf run"${runNumber}"
mkdir run"${runNumber}"
echo "run${runNumber} .."
cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
echo "run${runNumber} .. done (exit code: $?)"
unset runNumber
done
unset dirPath |
There was another instance of this crash in run 368547 (1 crash). |
Is this the first new crash in the last two weeks since the one in 367771? |
Yes, as far as I know (we monitor the crashes semi-automatically, it's possible we can miss one, but I don't think we missed any in this case). |
Reporting another HLT crash of this kind.
|
Reporting another HLT crash of this kind.
|
@thomreis , these crashes are not frequent, but they continue to happen. |
Hi @missirol there is no ETA yet. I have just started to look into this today and manage to reproduce the the crash with your recipe. |
Hi @missirol are the error_stream files for the crashes in runs 368547 and 368724 available somewhere? I would like to check if the fix for 367771 also avoids the other two crashes. |
Hi @thomreis , we have the files for run-368724 [1]. I will request the files for run-368547, and share them here if they are still available. [1]
|
Hi @thomreis , below is the path to the error-stream files of run-368547.
|
PR #41977 should avoid crashes like these in the future. Backports will follow. |
Just for the record, the last crash of this kind was seen in run-370293.
The corresponding input files can be found in
|
+hlt @thomreis provided a fix for these crashes in #41977, then backported and integrated in |
I ran some tests and this crash would not have happened with the fix already in 13_0_10 and also not with the improved fix in #42301 |
Thanks for checking, @thomreis. Do you want to sign off this issue for ECAL , before I close it ? |
+ecal-dpg |
This issue is fully signed and ready to be closed. |
please close |
Several HLT jobs crashed during run 359297, all due to module
EcalRecHitProducer:hltEcalRecHitWithoutTPs
. Before crashing, the following error message appears:cmsRun: /.../cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/CalibCalorimetry/EcalLaserAnalyzer/src/MEEBGeom.cc:31: static int MEEBGeom::sm(MEEBGeom::EBGlobalCoord, MEEBGeom::EBGlobalCoord): Assertion ``ieta > 0 && ieta <= 85' failed.
The full log output for several such cases, including the stack trace, can be found on EOS:
/eos/cms/store/user/trocino/HLT_ECAL_Debug/LogOutput/
ROOT RAW files containing all incriminated events can be found at
/eos/cms/store/user/trocino/HLT_ECAL_Debug/EdmRawRoot/
Please note that the error does not seem to be reproducible on LXPLUS (probably because it runs on CPUs), while it's reproducible on machines with GPUs, e.g. Hilton machines.
A recipe to reproduce the errors:
The text was updated successfully, but these errors were encountered: