Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crash in run-367615 #41741

Open
missirol opened this issue May 20, 2023 · 6 comments
Open

HLT crash in run-367615 #41741

missirol opened this issue May 20, 2023 · 6 comments

Comments

@missirol
Copy link
Contributor

missirol commented May 20, 2023

In run-367615 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release: CMSSW_13_0_5_patch1). The link to the corresponding HLT elog is here.

The available stack trace is attached (f3mon_run367615.txt). A piece of stack trace which is possibly relevant is in [1].

The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on Hilton machine). The recipe used for those failed attempts is adapted in [2] to be valid for lxplus and lxplus-gpu.

FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei

[1]

Thread 32 (Thread 0x7f2b3bbfe700 (LWP 3719073) "cmsRun"):
#0  0x00007f2c8605fa71 in poll () from /lib64/libc.so.6
#1  0x00007f2c7da7746f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007f2c7da42b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007f2c7da4333b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_5/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000000000000000 in ?? ()
#6  0xc27000c2c27000c2 in ?? ()
#7  0x0000000000000000 in ?? ()

(..)

Current Modules:
Module: CAHitQuadrupletEDProducer:hltFastPVPixelTracksHitQuadruplets (crashed)
Module: RecoTauProducer:hltHpsCombinatoricRecoTausDispl
Module: PrimaryVertexProducer:hltVerticesPF
Module: DeepTauId:hltHpsPFTauDeepTauProducerForVBFIsoTau
Module: GlobalEvFOutputModule:hltOutputHLTMonitor
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: PrimaryVertexProducer:hltVerticesPF
Module: GlobalEvFOutputModule:hltOutputPhysicsScoutingPFMonitor
Module: InclusiveCandidateVertexFinder:hltDeepInclusiveVertexFinderPF
Module: TriggerSummaryProducerAOD:hltTriggerSummaryAOD
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: SeedCreatorFromRegionConsecutiveHitsTripletOnlyEDProducer:hltDisplacedhltIter4PFlowPixelLessSeedsForTau
Module: IsolatedPixelTrackCandidateL1TProducer:hltIsolPixelTrackProdHB
Module: InclusiveCandidateVertexFinder:hltDeepInclusiveVertexFinderPF
Module: SeedCreatorFromRegionConsecutiveHitsTripletOnlyEDProducer:hltDisplacedhltIter4PFlowPixelLessSeedsForTau
Module: CaloTowersCreator:hltTowerMakerForAll
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: L2TauNNProducer:hltL2TauTagNNProducer
Module: BoostedJetONNXJetTagsProducer:hltParticleNetONNXJetTagsAK8
Module: GlobalEvFOutputModule:hltOutputDQM
Module: PFRecHitProducer:hltParticleFlowRecHitPSUnseeded
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: HcalRawToDigi:hltHcalDigis
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: HcalHaloDataProducer:hltHcalHaloData
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: MultiHitFromChi2EDProducer:hltDisplacedhltIter4PFlowPixelLessHitTripletsForTau
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: FastjetJetProducer:hltAK4CaloJets
Module: IsolatedPixelTrackCandidateL1TProducer:hltIsolPixelTrackProdHE
A fatal system signal has occurred: segmentation violation

[2]

#!/bin/bash

# cmsrel CMSSW_13_0_5_patch1
# cd CMSSW_13_0_5_patch1/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367615 4 # runNumber nThreads

[ $# -eq 2 ] || exit 1

RUNNUM="${1}"
NUMTHREADS="${2}"

ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"

for dirPath in $(ls -d "${RUNDIR}"*); do
  # require at least one non-empty FRD file
  [ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
  runNumber="${dirPath: -6}"
  JOBTAG=test_run"${runNumber}"
  HLTMENU="--runNumber ${runNumber}"
  hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
  cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  rm -rf run"${runNumber}"
  mkdir run"${runNumber}"
  echo "run${runNumber} .."
  cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
  echo "run${runNumber} .. done (exit code: $?)"
  unset runNumber
done
unset dirPath
@cmsbuild
Copy link
Contributor

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@missirol
Copy link
Contributor Author

assign hlt

I let others assign to other groups, if needed.

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

assign reconstruction

FYI @cms-sw/tracking-pog-l2

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

I'm tempted to interpret the stack trace such that the stack got corrupted. Maybe valgrind might reveal something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants