HLT crash in run-367906 (`sistrip::FEDBuffer::findChannels()`) #41786

missirol · 2023-05-28T13:00:12Z

In run-367906 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release: CMSSW_13_0_6) [link to HLT elog].

The stack trace is attached (f3mon_run367906.txt). A piece of stack trace which is possibly relevant is in [1].

The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on "Hilton" HLT node).

The recipe used for those failed attempts is adapted in [2] to be valid for lxplus and lxplus-gpu.

FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei

[1]

msgtime:2023-05-24 22:37:12
doc_type:cmsswlog
date:2023-05-24T20:37:12.106Z
run:367906
host:fu-c2b03-18-01
pid:2793118
doctype:stacktrace
severity:FATAL
severityVal:4
instance:global
lexicalId:549852445
message:A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Wed May 24 22:36:52 CEST 2023

(..)

Thread 6 (Thread 0x7fe97ea4f700 (LWP 2794125) "cmsRun"):
#0  0x00007fe9f3d60a71 in poll () from /lib64/libc.so.6
#1  0x00007fe9eac9846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007fe9eac63b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007fe9eac6433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe990ee5092 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6  0x00007fe990f5a21e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripCluste\
rizerPlugins.so
#7  0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw\
/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_\
amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<T\
empTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&\
) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms\
/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_a\
md64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gc\
c11/libFWCoreFramework.so
#17 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCo\
reFramework.so
#18 0x00007fe9f67206da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm:\
:EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/c\
ms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007fe9f6720b88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00007fe9f6475f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCore\
Concurrency.so
#21 0x00007fe9f4ef2304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fe82e94ab00, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_\
2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fe9efd53780) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-bui\
ld/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#23 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f\
6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137
#24 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6\
c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599
#25 0x00007fe9f4ef44c6 in tbb::detail::r1::rml::private_worker::run (this=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb\
5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fe9efd30100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6\
d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#27 0x00007fe9f403e17a in start_thread () from /lib64/libpthread.so.0
#28 0x00007fe9f3d6bdf3 in clone () from /lib64/libc.so.6

(..)

Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed)
Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates
Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus
Module: PFBlockProducer:hltParticleFlowBlock
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalDigisProducerGPU:hltHcalDigisGPU
Module: none
Module: BeamSpotToCUDA:hltOnlineBeamSpotToGPU
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: none
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: none
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: PFRecHitProducer:hltParticleFlowRecHitPSUnseeded
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracks
Module: CkfTrackCandidateMaker:hltDisplacedhltIter4PFlowCkfTrackCandidatesForTau
Module: none
Module: none
Module: SiPixelRecHitCUDAPhase1:hltSiPixelRecHitsGPU
Module: SiPixelRecHitFromCUDAPhase1:hltSiPixelRecHitsFromGPU
Module: HBHERecHitProducerGPU:hltHbherecoGPU
Module: EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU
Module: FastjetJetProducer:hltAK4CaloJets
Module: CAHitNtupletCUDAPhase1:hltPixelTracksGPU
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: SiPixelDigisSoAFromCUDA:hltSiPixelDigisSoA
Module: PFBlockProducer:hltParticleFlowBlockCPUOnly
A fatal system signal has occurred: segmentation violation

[2]

#!/bin/bash

# cmsrel CMSSW_13_0_6
# cd CMSSW_13_0_6/src
# cmsenv
# # save this file as test.sh
# chmod u+x test.sh
# ./test.sh 367906 4 # runNumber nThreads

[ $# -eq 2 ] || exit 1

RUNNUM="${1}"
NUMTHREADS="${2}"

ERRDIR=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream
RUNDIR="${ERRDIR}"/run"${RUNNUM}"

for dirPath in $(ls -d "${RUNDIR}"*); do
  # require at least one non-empty FRD file
  [ $(cd "${dirPath}" ; find -maxdepth 1 -size +0 | grep .raw | wc -l) -gt 0 ] || continue
  runNumber="${dirPath: -6}"
  JOBTAG=test_run"${runNumber}"
  HLTMENU="--runNumber ${runNumber}"
  hltConfigFromDB ${HLTMENU} > "${JOBTAG}".py
  cat <<EOF >> "${JOBTAG}".py
process.options.numberOfThreads = ${NUMTHREADS}
process.options.numberOfStreams = 0
process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)
del process.PrescaleService
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
import os
import glob
process.source.fileListMode = True
process.source.fileNames = sorted([foo for foo in glob.glob("${dirPath}/*raw") if os.path.getsize(foo) > 0])
process.EvFDaqDirector.buBaseDir = "${ERRDIR}"
process.EvFDaqDirector.runNumber = ${runNumber}
process.hltDQMFileSaverPB.runNumber = ${runNumber}
# remove paths containing OutputModules
streamPaths = [pathName for pathName in process.finalpaths_()]
for foo in streamPaths:
    process.__delattr__(foo)
EOF
  rm -rf run"${runNumber}"
  mkdir run"${runNumber}"
  echo "run${runNumber} .."
  cmsRun "${JOBTAG}".py &> "${JOBTAG}".log
  echo "run${runNumber} .. done (exit code: $?)"
  unset runNumber
done
unset dirPath

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-05-28T13:00:34Z

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

missirol · 2023-05-28T13:01:03Z

assign hlt

(I let others assign to other groups, if needed.)

cmsbuild · 2023-05-28T13:01:23Z

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

missirol · 2023-05-28T13:06:58Z

The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on Hilton machine).

This is another instance of recent HLT crashes that I can't reproduce offline (see for example #40174, #41741 and #41742).

This time I can also include the full log of the CMSSW job that crashed (see [1]), but I don't know if that helps.

The log contains a large number of log-warnings and log-errors which I don't see when running on the 200 events of the error-stream files [2].
At the same time, the job processed more than those 200 events, and I guess it's possible that those 200 events didn't issue any log-errors or log-warnings even online.
It's also possible that somehow the 200 events in the error-stream files do not contain the event that caused the crash (we have seen this happen already in recent weeks, see comment on run-366469 in CMSLITOPS-411).

@smorovic , is it possible to draw any conclusions comparing the log of the CMSSW job [1] and the content of the error-stream files [2] ?

[1] old_hlt_run367906_pid2793118.log

[2] /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run367906/

smorovic · 2023-05-28T20:21:53Z

Event IDs in two raw files:

run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw
128082587 - 128091658

run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw
128183442 - 128186805

Last message in the log is from one of previous events (file):

%MSG-e TrajectoryNotPosDef:   TrackProducer:hltL3NoFiltersTkTracksFromL2IOHitNoVtx 24-May-2023 22:36:51 CEST  Run: 367906 Event:  127979616
Trajectory covariance is not positive-definite
%MSG

Timestamps of last few files appearing locally at hltd for that process (last 3).

INFO:2023-05-24 22:36:49 - processIndexFile - RUN:367906 - run367906_ls0056_index000189_pid2793118.jsn

INFO:2023-05-24 22:36:51 - processIndexFile - RUN:367906 - run367906_ls0056_index000213_pid2793118.jsn
INFO:2023-05-24 22:36:52 - processIndexFile - RUN:367906 - run367906_ls0056_index000236_pid2793118.jsn
INFO:2023-05-24 22:37:04 - processCRASHfile - RUN:367906 - 'run367906_ls0000_crash_pid2793118.jsn' with errcode: -11
INFO:2023-05-24 22:37:04 - processCRASHFile - RUN:367906 - inputFileList: run367906_ls0056_index000213_fu-c2b03-18-01_pid2793118.raw,run367906_ls0056_index000236_fu-c2b03-18-01_pid2793118.raw

However, this looks ok. Last two open files by the process were also saved, older ones were alread handled and closed.
Source keeps up to 2 files open and buffered at the time.

For the crash, there is no information of event ID (only for Exception this is known).

makortel · 2023-05-28T21:25:11Z

assign reconstruction

FYI @cms-sw/tracking-pog-l2

cmsbuild · 2023-05-28T21:25:34Z

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-05-30T17:57:54Z

Possibly incidental, but there are two other threads in StMeasurementDetSet::getDetSet(int) at the time of the crash

Thread 36 (Thread 0x7fe8a65ff700 (LWP 2794392) "cmsRun"):
#2  0x00007fe9eac60ed0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fe990f58e90 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#5  0x00007fe9940a04bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#9  0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#14 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 21 (Thread 0x7fe91dbfe700 (LWP 2794140) "cmsRun"):
#2  0x00007fe9eac60ed0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fe9940a0480 in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5  0x00007fe9940a08a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007fe9940a30f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007fe99400e347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#8  0x00007fe8f21a01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#9  0x00007fe8f219338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007fe8f2196846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007fe8f2150263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#12 0x00007fe8f2151ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007fe9f67ad95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00007fe9f6794072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_6/lib/el8_amd64_gcc11/libFWCoreFramework.so

makortel · 2023-05-30T21:38:11Z

So threads 36 and 6 (crashing one) are operating on the same StMeasurementDetSet object (address 0x00007fe9940a04bd). The code of StMeasurementDetSet::detSet() and StMeasurementDetSet::getDetSet() are technically not thread safe

cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h

Lines 207 to 211 in 1bce7ad

    
           const StripDetset& detSet(int i) const { 
        
             if (ready_[i]) 
        
               const_cast<StMeasurementDetSet*>(this)->getDetSet(i); 
        
             return detSet_[i]; 
        
           }

cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h

Lines 230 to 241 in 1bce7ad

    
           void getDetSet(int i) { 
        
             if (detIndex_[i] >= 0) { 
        
               detSet_[i].set(*handle_, handle_->item(detIndex_[i])); 
        
               empty_[i] = false;  // better be false already 
        
               incAct(); 
        
             } else {  // we should not be here 
        
               detSet_[i] = StripDetset(); 
        
               empty_[i] = true; 
        
             } 
        
             ready_[i] = false; 
        
             incSet(); 
        
           }

cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h

Lines 250 to 256 in 1bce7ad

    
           std::vector<bool> empty_; 
        
           std::vector<bool> activeThisEvent_; 
        
           // full reco 
        
           std::vector<StripDetset> detSet_; 
        
           std::vector<int> detIndex_; 
        
           std::vector<bool> ready_;  // to be cleaned

I'm assuming the detIndex_ does not change during the event processing, the elements of empty_ and ready_ are accessed and modified without any protection.

On a cursory look the edmNew::DetSet<SiStripCluster>::set() (called on line 232 above) looks like it would be thread safe. Both threads end up calling ClusterFiller::fill(), but it could be different elements of i.

Another possible thread-safety problem is in edmNew::DetSetVector<T>::update()

cmssw/DataFormats/Common/interface/DetSetVectorNew.h

Lines 634 to 649 in 1bce7ad

    
           template <typename T> 
        
           inline void DetSetVector<T>::update(const Item& item) const { 
        
             // no m_getter or already updated 
        
             if (!m_getter) { 
        
               assert(item.isValid()); 
        
               return; 
        
             } 
        
             if (item.initialize()) { 
        
               assert(item.initializing()); 
        
               { 
        
                 TSFastFiller ff(*this, item); 
        
                 static_cast<Getter*>(m_getter.get())->fill(ff); 
        
               } 
        
               assert(item.isValid()); 
        
             } 
        
           }

Here the m_getter is defined as

cmssw/DataFormats/Common/interface/DetSetVectorNew.h

Line 88 in 1bce7ad

std::shared_ptr<void> m_getter;

but in practice is used as pointer to Getter which is defined as

cmssw/DataFormats/Common/interface/DetSetVectorNew.h

Line 164 in 1bce7ad

typedef dslv::LazyGetter<T> Getter;

and the LazyGetter<T>::fill() is not defined as const!

cmssw/DataFormats/Common/interface/DetSetVectorNew.h

Lines 608 to 614 in 1bce7ad

    
             template <typename T> 
        
             class LazyGetter { 
        
             public: 
        
               virtual ~LazyGetter() {} 
        
               virtual void fill(typename DetSetVector<T>::TSFastFiller&) = 0; 
        
             }; 
        
           }  // namespace dslv

So if the concrete LazyGetter<T>::fill() is not thread-safe, it could cause problems. In this case the concrete LazyGetter<T> is ClusterFiller

cmssw/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc

Line 327 in 1bce7ad

    
           void ClusterFiller::fill(StripClusterizerAlgorithm::output_t::TSFastFiller& record) {

(which I haven't digested yet)

Note that despite of all I wrote above, I can't tell from the stack trace if the problem is really in thread safety or something else.

makortel · 2023-06-02T15:11:47Z

the LazyGetter<T>::fill() is not defined as const!

This part is now addressed in #41853 . It helped me to reach conclusion that the

cmssw/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc

Line 327 in 1bce7ad

    
           void ClusterFiller::fill(StripClusterizerAlgorithm::output_t::TSFastFiller& record) {

looks like it would be thread safe.

makortel · 2023-06-05T11:29:10Z

The code of StMeasurementDetSet::detSet() and StMeasurementDetSet::getDetSet() are technically not thread safe

The race condition mentioned above is fixed in #41872. I'm not convinced though it would be the full cause of the crash. Idealistically the race condition would only lead to edmNew::DetSet<SiStripCluster>::set() to be called more than needed, but strictly speaking a race condition leads to undefined behavior so who knows.

missirol · 2023-06-05T11:44:31Z

Thanks for the suggested fix, @makortel !

makortel · 2023-06-05T11:52:58Z

Thanks for the suggested fix

@missirol Do you want it backported to 13_0_X? (since it is unclear whether is plays a role in the crash)

missirol · 2023-06-05T12:01:08Z

If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while. If it helps, I can prepare the backports.

makortel · 2023-06-05T12:04:28Z

If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.

Thanks, I'll prepare the backports after the review of #41872 completes (in the current form it is easily cherry-pickable).

dan131riley · 2023-06-06T12:15:46Z

As long as we're looking at DetSetNew, we're getting with some frequency DetSetNew assertion failures on aarch64

/data/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/95e24eec79ed42decc0c70dcac7a0f7d/opt/cmssw/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/src/DataFormats/Common/interface/DetSetNew.h:86: const data_type* edmNew::DetSet<T>::data() const [with T = SiStripCluster; edmNew::DetSet<T>::data_type = SiStripCluster]: Assertion `m_data' failed.

from here:

cmssw/DataFormats/Common/interface/DetSetNew.h

Lines 84 to 88 in 1bce7ad

    
           data_type const *data() const { 
        
             if (m_offset | m_size) 
        
               assert(m_data); 
        
             return m_data ? (&((*m_data)[m_offset])) : nullptr; 
        
           }

The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1. There's probably also a race condition, but I haven't stared at it long enough yet.

Stack trace:

Thread 3 (Thread 0x400086359260 (LWP 2601823) "cmsRun"):
#8  0x00004000385bfc18 in __assert_fail () from /lib64/libc.so.6
#9  0x0000400063adda58 in edmNew::DetSet<SiStripCluster>::data() const [clone .part.0] [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x0000400063ae790c in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#11 0x0000400063ae7c38 in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#12 0x0000400063b8723c in LayerMeasurements::measurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-05-28-2300/lib/el8_aarch64_gcc11/libTrackingToolsMeasurementDet.so
#13 0x00004000a5fa57bc in MuonCkfTrajectoryBuilder::collectMeasurement(DetLayer const*, std::vector<DetLayer const*, std::allocator<DetLayer const*> > const&, TrajectoryStateOnSurface const&, std::vector<TrajectoryMeasurement, std::allocator<TrajectoryMeasurement> >&, int&, Propagator const*) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoMuonL3TrackFinder.so
#14 0x00004000a5fa743c in MuonCkfTrajectoryBuilder::findCompatibleMeasurements(TrajectorySeed const&, TempTrajectory const&, std::vector<TrajectoryMeasurement, std::allocator<TrajectoryMeasurement> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoMuonL3TrackFinder.so
#15 0x00004000a5f380cc in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00004000a5f392a8 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#17 0x00004000a5f394dc in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#18 0x00004000a5f2d7fc in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so
#19 0x00004000a5f2edc4 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02787/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_2_X_2023-06-01-2300/lib/el8_aarch64_gcc11/libRecoTrackerCkfPattern.so

makortel · 2023-06-06T19:43:23Z

The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1

I agree (especially on the m_offset check should be against -1). Could you make a PR?

There's probably also a race condition

At least the code has

cmssw/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.cc

Lines 39 to 43 in 8617c80

    
           TkStripMeasurementDet::RecHitContainer TkStripMeasurementDet::recHits(const TrajectoryStateOnSurface& ts, 
        
                                                                                 const MeasurementTrackerEvent& data) const { 
        
             RecHitContainer result; 
        
             if UNLIKELY ((!isActive(data)) || isEmpty(data.stripData())) 
        
               return result;

cmssw/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.h

Line 94 in 8617c80

    
           bool isEmpty(const StMeasurementDetSet& theDets) const { return theDets.empty(index()); }

which ends up calling

cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h

Line 178 in 8617c80

bool empty(int i) const { return empty_[i]; }

which is part of the race condition I'm trying to fix in #41872 (assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering; if not, the cause is likely something else)

missirol · 2023-06-06T19:52:37Z

(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering

I think this is the case, as the config had

process.hltSiStripRawToClustersFacility = cms.EDProducer( "SiStripClusterizerFromRaw",
    onDemand = cms.bool( True ),
[..]

makortel · 2023-06-06T20:00:54Z

(assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering

I think this is the case, as the config had

I meant Dan's stack trace on the assertion failure on aarch64 (sorry for being unclear).

makortel · 2023-06-09T11:03:39Z

If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while.

Thanks, I'll prepare the backports after the review of #41872 completes (in the current form it is easily cherry-pickable).

The backports are in #41909 (13_1_X) and #41910 (13_0_X)

missirol · 2023-06-11T19:51:18Z

Reporting another HLT crash which may be related to this issue.

Run 368566 (pp collisions)
Release: CMSSW_13_0_7
Full log from DAQ: f3mon_run368566.txt (1st crash in the log)
Piece of stack trace:

#3  0x00007f9fd21f133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f9e26dcff20 in ?? ()
#6  0x00007f9f763b6216 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f9f794fc4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f9f7950eb28 in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f9f7950ef0d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

(..)

Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates (crashed)
Module: CkfTrackCandidateMaker:hltIter2PFlowCkfTrackCandidatesForDisplaced
Module: HcalRawToDigi:hltHcalDigis
Module: RecoTauProducer:hltHpsCombinatoricRecoTaus
Module: CkfTrackCandidateMaker:hltIterL3OIGlbDisplacedTrackCandidates
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracksForDisplaced
Module: L2MuonProducer:hltL2CosmicMuons
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded
Module: EcalUncalibRecHitProducer:hltEcalUncalibRecHitCPUOnly
Module: PFClusterProducer:hltParticleFlowClusterPSUnseeded
Module: PFBlockProducer:hltParticleFlowBlockForTaus
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: CkfTrackCandidateMaker:hltMuCkfTrackCandidates
Module: HLTL1TSeed:hltL1sDoubleEGXer1p2dRMaxY
Module: LightPFTrackProducer:hltLightPFTracks
Module: PFClusterProducer:hltParticleFlowClusterPSUnseeded
Module: HitPairEDProducer:hltElePixelHitDoubletsUnseeded
Module: none
Module: FastjetJetProducer:hltAK4CaloJetsPF
Module: PathStatusInserter:HLT_CaloMET350_NotCleaned_v8
Module: PFBlockProducer:hltParticleFlowBlockForDisplTaus
Module: CAHitNtupletCUDAPhase1:hltPixelTracksCPUOnly
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: CSCRecHitDProducer:hltCsc2DRecHits
Module: PFBlockProducer:hltParticleFlowBlock
Module: RecoTauJetRegionProducer:hltTauPFJets08Region
Module: none
Module: SiPixelClusterProducer:hltSiPixelClustersRegForDisplaced
Module: PFClusterProducer:hltParticleFlowClusterHBHE
A fatal system signal has occurred: segmentation violation

missirol · 2023-06-11T19:59:41Z

Reporting another HLT crash which may be related to this issue.

Run 368636 (pp collisions)
Release: CMSSW_13_0_7
Full log from DAQ: f3mon_run368636.txt
Piece of stack trace:

#3  0x00007f3bc7f3133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f3b6c0610f1 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6  0x00007f3b6c0d621e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f3b6f21c4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f3b6f21c8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f3b6f21f0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

(..)

Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly (crashed)
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: CorrectedECALPFClusterProducer:hltParticleFlowClusterECALUnseeded
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: CkfTrajectoryMaker:hltL3TrackCandidateFromL2IOHit
Module: MuonIdProducer:hltGlbTrkMuonsLowPtIter01Merge
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: FastjetJetProducer:hltAK4PixelOnlyPFJets
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: TrackProducer:hltIter0IterL3FromL1MuonCtfWithMaterialTracks
Module: HLTL1TSeed:hltL1sTripleMuOpen53p52UpsilonMuon
Module: DeepTauId:hltHpsPFTauDeepTauProducerForVBFIsoTau
Module: HLTL1TSeed:hltL1VBFIsoEG
Module: SeedCombiner:hltElePixelSeedsCombined
Module: CorrectedCaloJetProducer:hltAK4CaloJetsCorrected
Module: MuonIdProducer:hltMuonsForDisplTau
Module: GlobalEvFOutputModule:hltOutputCalibration
Module: CkfTrackCandidateMaker:hltIter0IterL3FromL1MuonCkfTrackCandidates
Module: FastjetJetProducer:hltAK4CaloJets
Module: FastjetJetProducer:hltAK8CaloJets
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: CaloTowersCreator:hltTowerMakerForAll
Module: none
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
A fatal system signal has occurred: segmentation violation

makortel · 2023-06-12T14:31:12Z

Extracting more stack trace from #41786 (comment)

Full log from DAQ: f3mon_run368566.txt (1st crash in the log)

Thread 21 (Thread 0x7f9f003fc700 (LWP 1659889) "cmsRun"):
#2  0x00007f9fd21eded0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f9f794fc41a in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5  0x00007f9f794fc8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007f9f794ff0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#8  0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#9  0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#12 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 14 (Thread 0x7f9f03bff700 (LWP 1659882) "cmsRun"):
#3  0x00007f9fd21f133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f9e26dcff20 in ?? ()
#6  0x00007f9f763b6216 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f9f794fc4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f9f7950eb28 in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f9f7950ef0d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f9f7946a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f9ed7df21b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f9ed7de538d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f9ed7de8846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f9ed7da2263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f9ed7da3ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f9fdbbd095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f9fdbbb7072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

makortel · 2023-06-12T14:35:31Z

In #41786 (comment)

Full log from DAQ: f3mon_run368636.txt

only one thread was in StMeasurementDetSet::getDetSet(), making the stack trace different from the earlier ones. Under the "race condition somewhere in call chain" hypothesis the closes match would be

Thread 36 (Thread 0x7f3a813ff700 (LWP 2036162) "cmsRun"):
#2  0x00007f3bc7f2ded0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f3b6f22418d in SiStripRecHit2D& std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >::emplace_back<Point3DBase<float, LocalTag> const&, LocalError const&, GeomDet const&, edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&>(Point3DBase<float, LocalTag> const&, LocalError const&, GeomDet const&, edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&) [clone .isra.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#5  0x00007f3b6f224c74 in bool TkStripMeasurementDet::filteredRecHits<edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> >(edm::Ref<edmNew::DetSetVector<SiStripCluster>, SiStripCluster, edmNew::DetSetVector<SiStripCluster>::FindForDetSetVector> const&, StripCPE::AlgoParam const&, TrajectoryStateOnSurface const&, MeasurementEstimator const&, std::vector<bool, std::allocator<bool> > const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x00007f3b6f22de80 in TkStripMeasurementDet::simpleRecHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const [clone .isra.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007f3b6f21f15d in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#9  0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#10 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#11 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#13 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#14 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 16 (Thread 0x7f3af7dfc700 (LWP 2035972) "cmsRun"):
#3  0x00007f3bc7f3133b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f3b6c0610f1 in sistrip::FEDBuffer::findChannels() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libEventFilterSiStripRawToDigi.so
#6  0x00007f3b6c0d621e in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f3b6f21c4bd in StMeasurementDetSet::getDetSet(int) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f3b6f21c8a6 in TkStripMeasurementDet::empty(MeasurementTrackerEvent const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f3b6f21f0f1 in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f3b6f18a347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f3acd30f1b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f3acd30238d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f3acd305846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f3acd2bf263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f3acd2c0ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f3bd192d95d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f3bd1914072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_7/lib/el8_amd64_gcc11/libFWCoreFramework.so

On the other hand, this observation supports my earlier hunch of the race condition in StMeasurementDetSet not being the full cause of the crash in sistrip::FEDBuffer::findChannels() (#41786 (comment)).

makortel · 2023-06-12T21:00:33Z

@Dr15Jones pointed out that after #41872 the StMeasurementDetSet::getSet() still has a race condition in the assignment

cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h

Lines 240 to 241 in e18c96d

    
           } else {  // we should not be here 
        
             det.detSet_ = StripDetset();

makortel · 2023-06-12T21:02:15Z

StMeasurementDetSet::getSet() still has a race condition in the assignment

Fix proposed in #41936 (to be backported to 13_0_X as well)

missirol · 2023-07-01T09:23:31Z

The fixes in #41872 and #41936 were integrated and backported, and CMSSW_13_0_9 includes both. (Thanks for that !)

After HLT deployed CMSSW_13_0_9 online, we saw a runtime crash which looks similar to the ones discussed in this issue. We can share the corresponding error-stream file once available, if that helps.

Run 369870 (pp collisions)
Release: CMSSW_13_0_9
Full log from DAQ: f3mon_run369870.txt
Extract of log from DAQ:

msgtime:2023-06-30 17:51:19
doc_type:cmsswlog
date:2023-06-30T15:51:19.990Z
run:369870
host:fu-c2b05-13-01
pid:3824147
doctype:stacktrace
severity:FATAL
severityVal:4
instance:global
lexicalId:549852445
message:A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Fri Jun 30 17:50:56 CEST 2023

(..)

Thread 10 (Thread 0x7f5b987fe700 (LWP 3825123) "cmsRun"):
#0  0x00007f5c10ae3a71 in poll () from /lib64/libc.so.6
#1  0x00007f5c079d846f in full_read.constprop () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2  0x00007f5c079a3b6c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  0x00007f5c079a433b in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f5a566562b0 in ?? ()
#6  0x00007f5baee67026 in (anonymous namespace)::ClusterFiller::fill(edmNew::DetSetVector<SiStripCluster>::TSFastFiller&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#7  0x00007f5bb1fae355 in StMeasurementDetSet::detSet(int) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f5bb1fc024c in TkStripMeasurementDet::recHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, std::vector<float, std::allocator<float> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f5bb1fc091d in TkStripMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f5bb1f1c347 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libTrackingToolsMeasurementDet.so
#11 0x00007f5b38da01b1 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const [clone .constprop.0] () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#12 0x00007f5b38d9338d in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#13 0x00007f5b38d96846 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/pluginRecoTrackerCkfPatternPlugins.so
#14 0x00007f5b38d50263 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#15 0x00007f5b38d51ceb in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libRecoTrackerCkfPattern.so
#16 0x00007f5c1353095d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00007f5c13517072 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00007f5c134a36da in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00007f5c134a3b88 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00007f5c131f8f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_9/lib/el8_amd64_gcc11/libFWCoreConcurrency.so
#21 0x00007f5c11c75304 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f5abcf5af00, waiter=..., this=0x7f5c0b9f3a00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#22 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f5c0b9f3a00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#23 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/arena.cpp:137
#24 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/market.cpp:599
#25 0x00007f5c11c774c6 in tbb::detail::r1::rml::private_worker::run (this=0x7f5c0b9e7d80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#26 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f5c0b9e7d80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_0_2-el8_amd64_gcc11/build/CMSSW_13_0_2-build/BUILD/el8_amd64_gcc11/external/tbb/v2021.8.0-bb5e0283c68ca6d69bd8419f6c08f7b1/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#27 0x00007f5c10dc117a in start_thread () from /lib64/libpthread.so.0
#28 0x00007f5c10aeedf3 in clone () from /lib64/libc.so.6

(..)

Current Modules:
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly (crashed)
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidates
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: PFBlockProducer:hltParticleFlowBlock
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracksCPUOnly
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: GlobalEvFOutputModule:hltOutputPhysicsHLTPhysics2
Module: CorrectedECALPFClusterProducer:hltParticleFlowClusterECALUnseeded
Module: ElectronNHitSeedProducer:hltEgammaElectronPixelSeedsUnseeded
Module: CkfTrackCandidateMaker:hltIterL3OITrackCandidatesNoVtx
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoubletsUnseeded
Module: RecoTauProducer:hltHpsCombinatoricRecoTausDispl
Module: MuonHLTSeedMVAClassifier:hltIter0IterL3FromL1MuonPixelSeedsFromPixelTracksFiltered
Module: TrackProducer:hltIter0IterL3FromL1MuonCtfWithMaterialTracks
Module: PFMultiDepthClusterProducer:hltParticleFlowClusterHCAL
Module: none
Module: PFClusterProducer:hltParticleFlowClusterHBHE
Module: PFRecHitProducer:hltParticleFlowRecHitHF
Module: GlobalEvFOutputModule:hltOutputParkingDoubleMuonLowMass3
Module: SeedCreatorFromRegionConsecutiveHitsEDProducer:hltElePixelSeedsDoublets
Module: PixelTrackProducerFromSoAPhase1:hltPixelTracksFromSoACPUOnly
Module: HLTRegionalEcalResonanceFilter:hltAlCaPi0RecHitsFilterEBonlyRegional
Module: PFBlockProducer:hltParticleFlowBlockForTaus
Module: AlcaPCCEventProducer:hltAlcaPixelClusterCounts
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: HitPairEDProducer:hltElePixelHitDoubletsForTripletsUnseeded
Module: TrackProducer:hltIter0PFlowCtfWithMaterialTracks
Module: GsfTrackProducer:hltEgammaGsfTracksUnseeded
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidates
Module: MuonHLTSeedMVAClassifier:hltIter0IterL3MuonPixelSeedsFromPixelTracksFiltered
Module: CkfTrackCandidateMaker:hltIter0PFlowCkfTrackCandidatesCPUOnly
A fatal system signal has occurred: segmentation violation

slava77 · 2023-07-01T16:54:59Z

type tracking

makortel · 2023-07-03T14:04:22Z

Thanks @missirol for reporting the new stack trace. I didn't see anything obviously related activity in the other threads. I suppose the further investigation should focus on the contents of the fill() function itself (I suspected also earlier)

cmssw/RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducer.cc

Line 328 in 127b308

    
           void ClusterFiller::fill(StripClusterizerAlgorithm::output_t::TSFastFiller& record) const {

missirol · 2023-07-06T17:30:59Z

@dan131riley , would it be useful to backport #42194 to 13_0_X (and 13_1_X) as part of debugging these online crashes ?

dan131riley · 2023-07-10T16:37:34Z

@dan131riley , would it be useful to backport #42194 to 13_0_X (and 13_1_X) as part of debugging these online crashes ?

That PR is entirely about reducing false positives, it wouldn't help with the HLT crashes.

dan131riley · 2023-07-10T18:37:18Z

Naive question: are there circumstances where the FEDRawDataCollection could get released while the event is still in progress? Currently the on-demand getter holds a reference to the FEDRawDataCollection--should it be keeping a Handle to the FEDRawDataCollection instead?

Dr15Jones · 2023-07-10T18:45:05Z

@dan131riley it is possible to tell the framework to delete a data product early. See process.options.canDeleteEarly for the list of data products that a configuration has marked to be allowed to delete early. I would not expect FEDRawDataCollection to be on that list since it has to remain in the event until the OutputModule.

IF FEDRawDataCollection is marked for delete early, one must also specify any data products which reference (say by holding pointers to or even edm::Ref to the data product) the to be deleted early data product in the configuration parameter

process.options.holdsReferencesToDeleteEarly

fwyzard · 2023-07-10T21:10:55Z

As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion.

dan131riley · 2023-07-10T21:15:33Z

As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion.

Thanks, that all makes sense. I'm having trouble constructing scenarios that could account for the crashes in sistrip::FEDBuffer::findChannels(), so there's some clutching at straws in effect trying to eliminate possibilities.

missirol · 2023-07-27T07:59:48Z

Adding a belated summary of recent online crashes which might be related to this issue. All the runs below are 2023 pp-collisions runs after run-369870. The CMSSW release used in these runs was CMSSW_13_0_N with N >= 9. So far, these crashes were not reproduced offline. A recipe to try and reproduce is in [*].

Legend: run number, [total number of online crashes] number of crashes possibly related to this issue (based on my naive reading of the attached stack traces).

run-370144, [1] 1
run-370169, [1] 1
run-370175, [5] at least 2
run-370293, [4] at least 1
run-370300, [5] at least 4
run-370304, [2] 2
run-370332, [3] at least 1
run-370355, [2] at least 1
run-370406, [14] at least 1
run-370497, [2] 1
run-370560, [4] 1
run-370580, [6] at least 4
run-370725, [2] none
run-370749, [5] at least 3
run-370772, [3] 3
run-370790, [1] 1

[*] Recipe tested on lxplus-gpu:
https://gist.github.com/missirol/45e9626c967e415ca39d2e86c7d26a4b

# example to run on files from run-370560 with 32 threads and 24 streams
./rerun_hlt_on_error_stream.sh -t 32 -s 24 \
 -i /eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream \
 -r 370560 -o tmp

fwyzard · 2023-07-27T11:52:25Z

If all the crashes are there since CMSSW_13_0_9, maybe #42033 is related ?

mmusich · 2023-07-27T11:54:15Z

If all the crashes are there since CMSSW_13_0_9, maybe #42033 is related ?

I doubt it, since the first report is from May 28th (CMSSW_13_0_6): #41786 (comment)

fwyzard · 2023-07-27T12:03:09Z

Ah OK, thanks for pointing this out.

mmusich · 2024-12-03T15:52:38Z

This type of crash didn't happen at all in 2024. Should we consider closing this issue?

cmsbuild · 2024-12-03T15:52:56Z

cms-bot internal usage

cmsbuild added the pending-assignment label May 28, 2023

cmsbuild added hlt-pending pending-signatures and removed pending-assignment labels May 28, 2023

cmsbuild added the reconstruction-pending label May 28, 2023

makortel mentioned this issue Jun 2, 2023

Enforce const-correctness for the Getter in edmNew::DetSetVector #41853

Merged

makortel mentioned this issue Jun 5, 2023

Fix a race condition in StMeasurementDetSet #41872

Merged

This was referenced Jun 9, 2023

[13_1_X] Fix a race condition in StMeasurementDetSet #41909

Merged

[13_0_X] Fix a race condition in StMeasurementDetSet #41910

Merged

makortel mentioned this issue Jun 12, 2023

Protect assignments of StripDetset() in StMeasurementDetSet::getDetSet() #41936

Merged

cmsbuild added the tracking label Jul 1, 2023

dan131riley mentioned this issue Jul 5, 2023

Correct validity test in edmNew::DetSet data() method #42194

Merged

mmusich mentioned this issue Jul 24, 2023

HLT Crash -- Runs 370772 -- Possibly related to Tracker local reconstruction? #42339

Closed

missirol changed the title ~~HLT crash in run-367906~~ HLT crash in run-367906 (sistrip::FEDBuffer::findChannels()) Jul 24, 2023

HLT crash in run-367906 (sistrip::FEDBuffer::findChannels()) #41786

HLT crash in run-367906 (sistrip::FEDBuffer::findChannels()) #41786

Comments

missirol commented May 28, 2023

cmsbuild commented May 28, 2023

missirol commented May 28, 2023

cmsbuild commented May 28, 2023

missirol commented May 28, 2023

smorovic commented May 28, 2023

makortel commented May 28, 2023

cmsbuild commented May 28, 2023

makortel commented May 30, 2023

makortel commented May 30, 2023 • edited Loading

makortel commented Jun 2, 2023

makortel commented Jun 5, 2023

missirol commented Jun 5, 2023

makortel commented Jun 5, 2023

missirol commented Jun 5, 2023

makortel commented Jun 5, 2023

dan131riley commented Jun 6, 2023

makortel commented Jun 6, 2023

missirol commented Jun 6, 2023

makortel commented Jun 6, 2023

makortel commented Jun 9, 2023

missirol commented Jun 11, 2023

missirol commented Jun 11, 2023

makortel commented Jun 12, 2023

makortel commented Jun 12, 2023 • edited Loading

makortel commented Jun 12, 2023

makortel commented Jun 12, 2023

missirol commented Jul 1, 2023

slava77 commented Jul 1, 2023

makortel commented Jul 3, 2023

missirol commented Jul 6, 2023

dan131riley commented Jul 10, 2023

dan131riley commented Jul 10, 2023

Dr15Jones commented Jul 10, 2023

fwyzard commented Jul 10, 2023

dan131riley commented Jul 10, 2023

missirol commented Jul 27, 2023 • edited Loading

fwyzard commented Jul 27, 2023

mmusich commented Jul 27, 2023

fwyzard commented Jul 27, 2023

mmusich commented Dec 3, 2024

cmsbuild commented Dec 3, 2024

HLT crash in run-367906 (`sistrip::FEDBuffer::findChannels()`) #41786

HLT crash in run-367906 (`sistrip::FEDBuffer::findChannels()`) #41786

makortel commented May 30, 2023 •

edited

Loading

makortel commented Jun 12, 2023 •

edited

Loading

missirol commented Jul 27, 2023 •

edited

Loading