HLT crashes in GPU and CPU in collision runs #38453

swagata87 · 2022-06-21T18:49:42Z

Dear experts,

During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using CMSSW_12_3_5.

type 1

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.


A fatal system signal has occurred: abort signal

This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt

type 2

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: PathStatusInserter:Dataset_ExpressPhysics
Module: EcalRawToDigi:hltEcalDigisLegacy

A fatal system signal has occurred: segmentation violation

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none

A fatal system signal has occurred: segmentation violation

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: none
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU

A fatal system signal has occurred: segmentation violation

This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).

type 3

[2] Prefetching for module MeasurementTrackerEventProducer/'hltSiStripClusters'
[3] Prefetching for module SiPixelDigiErrorsFromSoA/'hltSiPixelDigisFromSoA'
[4] Calling method for module SiPixelDigiErrorsSoAFromCUDA/'hltSiPixelDigiErrorsSoA'
Exception Message:
A std::exception was thrown.
cannot create std::vector larger than max_size()

happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.

Reason of crash (2) and (3) might even be related.
Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515

Regards,
Swagata, as HLT DOC during June 13-20.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2022-06-21T18:50:05Z

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2022-06-21T19:03:06Z

assign hlt, reconstruction

cmsbuild · 2022-06-21T19:03:20Z

New categories assigned: hlt,reconstruction

@jpata,@missirol,@clacaputo,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2022-06-21T19:04:31Z

@swagata87 could you provide the full stack traces for the job that failed with the segmentation violations?

swagata87 · 2022-06-21T19:22:53Z

Three examples are pasted below:

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sat Jun 18 18:31:53 CEST 2022
Thread 1 (Thread 0x7fde7a331540 (LWP 194148) "cmsRun"):
#0 0x00007fde7c1d3ddd in poll () from /lib64/libc.so.6
#1 0x00007fde70bf428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fde70bf4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fde70bf756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fde7c2366a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007fddb6e786ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007fddb6e76fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007fde7ec2dd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fde7ec16eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007fde7eb720e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fde7eb723db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fde7eb749c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007fde7eab8c45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007fde7d2c1b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fddb5dd2300, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fde7eae2ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007fde7eaed8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007fde7d2b015b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: EcalRawToDigi:hltEcalDigisLegacy
Module: none

A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue Jun 14 06:45:22 CEST 2022
Thread 1 (Thread 0x7f1d0ef42540 (LWP 251002) "cmsRun"):
#0 0x00007f1d10de4ddd in poll () from /lib64/libc.so.6
#1 0x00007f1d057f428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f1d057f4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f1d057f756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f1d10e45d29 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f1c4b0876ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f1c4b085fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f1d1383fd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f1d13828eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f1d137840e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f1d137843db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f1d137869c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f1d136cac45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f1d11ed3b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f1c4a9a1500, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f1d136f4ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f1d136ff8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f1d11ec215b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: HcalHitReconstructor:hltHoreco
Module: HcalHitReconstructor:hltHoreco

A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue Jun 14 06:45:23 CEST 2022
Thread 1 (Thread 0x7f6148fd5540 (LWP 250893) "cmsRun"):
#0 0x00007f614ae77ddd in poll () from /lib64/libc.so.6
#1 0x00007f613f1f228f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f613f1f2c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f613f1f556b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f614aed8cb5 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f60850e76ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f60850e5fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f614d8d4d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f614d8bdeaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f614d8190e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f614d8193db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f614d81b9c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f614d75fc45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f614bf5fb8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f60849ad400, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f614d789ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f614d7948fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f614bf4e15b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()

Current Modules:

Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none

A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]

The full list is here:

swagata87 · 2022-06-22T07:20:27Z

Experts are working on providing a recipe to reproduce the crashes offline. (tagging @mzarucki and @fwyzard )
Once that is available, that can be posted here so that tracker DPG can have a look. The code that triggered the crashes are under tracker DPG.

swagata87 · 2022-06-22T14:37:28Z

Dear tracker DPG, (@cms-sw/trk-dpg-l2)
I managed to reproduce the GPU crash happened during run 353941 in the machine gputest-milan-01.cms at Point 5.
I used CMSSW_12_3_5.
$CMSSW_RELEASE_BASE/ is -bash: /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/: Is a directory

General instructions to set up CMSSW area in GPU nodes online, is here
https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs

The HLT configuration file is: https://swmukher.web.cern.ch/swmukher/hlt_v5.py
The .raw file I ran on is this-> run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw
This .raw file and all the other .raw files are available in the online machines under /store/error_stream.

I have copied one .raw here: https://swmukher.web.cern.ch/swmukher/run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw

In case it is useful,
The HLT configuration file was obtained by the following command:
https_proxy=http://cmsproxy.cms:3128/ hltConfigFromDB --adg --configName /cdaq/physics/firstCollisions22/v2.4/HLT/V5 > hlt_v5.py

Then, at the end, the following block was added:

process.EvFDaqDirector = cms.Service(
    "EvFDaqDirector",
    runNumber=cms.untracked.uint32(353941), #maybe_replace_me
    baseDir=cms.untracked.string("tmp"),
    buBaseDir=cms.untracked.string(
        "/nfshome0/swmukher/check/CMSSW_12_3_5/src" #replace_me
    ),
    useFileBroker=cms.untracked.bool(False),
    fileBrokerKeepAlive=cms.untracked.bool(True),
    fileBrokerPort=cms.untracked.string("8080"),
    fileBrokerUseLocalLock=cms.untracked.bool(True),
    fuLockPollInterval=cms.untracked.uint32(2000),
    requireTransfersPSet=cms.untracked.bool(False),
    selectedTransferMode=cms.untracked.string(""),
    mergingPset=cms.untracked.string(""),
    outputAdler32Recheck=cms.untracked.bool(False),
)

process.source.fileNames = cms.untracked.vstring("file:run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw")  #maybe_replace_me    
process.source.fileListMode = True

cmsRun hlt_v5.py reproduces the crash.
It will create a /tmp folder.
To reproduce the crash again, I had to remove the /tmp folder before doing cmsRun again.

Let me know if something was unclear.

fwyzard · 2022-06-22T15:42:03Z

@swagata87 thank you for providing these instructions !

@tsusa you can use the online GPU machines to reproduce the issue:

ssh gpu-c2a02-39-01.cms
mkdir -p /data/$USER
cd /data/$USER
source /data/cmssw/cmsset_default.sh
cmsrel CMSSW_12_3_5
cd CMSSW_12_3_5
mkdir run
cd run
cp ~hltpro/error/hlt_error_run353941.py .
cmsRun hlt_error_run353941.py

In my test the problem did not happen every time, I had to run the job a few times before it crashed:

while cmsRun hlt_error_run353941.py; do clear; rm -rf output; done

It eventually crashed, though I'm not 100% sure if it was due to the same problem :-/

fwyzard · 2022-06-22T15:47:49Z

Yes, looks like the same crash:

#4  <signal handler called>
#5  0x00007fbbf5c9f6a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6  0x00007fbb34ed06ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl<SiPixelErrorsSoA, int, SiPixelErrorCompact const*, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7  0x00007fbb34ecefab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8  0x00007fbbf8696d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9  0x00007fbbf867feaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
...

Dr15Jones · 2022-06-22T16:04:57Z

As a guess, I think the problem is an extremely large amount of data is being requested to be copied which leads to some memory overwrite into a protected memory space. This is just based on what edm::Event::emplaceImpl is doing which is basically calling

cmssw/DataFormats/SiPixelRawData/interface/SiPixelErrorsSoA.h

Lines 13 to 14 in 6d2f660

    
           explicit SiPixelErrorsSoA(size_t nErrors, const SiPixelErrorCompact *error, const SiPixelFormatterErrors *err) 
        
               : error_(error, error + nErrors), formatterErrors_(err) {}

Dr15Jones · 2022-06-22T16:29:34Z

So cms::cuda::SimpleVector does not initialize any of its member data in its constructor

cmssw/HeterogeneousCore/CUDAUtilities/interface/SimpleVector.h

Line 16 in 1c36084

constexpr SimpleVector() = default;

If the first call to SiPixelDigiErrorsSoAFromCUDA::acquire hits this condition

cmssw/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc

Lines 54 to 55 in d573dd2

    
           if (gpuDigiErrors.nErrorWords() == 0) 
        
             return;

then this call in produce

cmssw/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc

Line 73 in d573dd2

    
           iEvent.emplace(digiErrorPutToken_, error_.size(), error_.data(), formatterErrors_);

will just copy a random number of bytes from a random memory address.

fwyzard · 2022-06-22T20:57:23Z

@Dr15Jones thanks for investigating the issue.

So cms::cuda::SimpleVector does not initialize any of its member data in its constructor

This is intended, because a SimpleVector is often allocated by the host in GPU memory, so the constructor cannot be run.
However the does leave open the possibility of using uninitialised memory :-(

A minimal fix could be

diff --git a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
index 4037b4d5061..554f1425cef 100644
--- a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
+++ b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
@@ -28,7 +28,7 @@ private:
   edm::EDPutTokenT<SiPixelErrorsSoA> digiErrorPutToken_;
 
   cms::cuda::host::unique_ptr<SiPixelErrorCompact[]> data_;
-  cms::cuda::SimpleVector<SiPixelErrorCompact> error_;
+  cms::cuda::SimpleVector<SiPixelErrorCompact> error_ = cms::cuda::make_SimpleVector<SiPixelErrorCompact>(0, nullptr);
   const SiPixelFormatterErrors* formatterErrors_ = nullptr;
 };

With it I have been able to run over 20 times on the same input as before without triggering any errors.

fwyzard · 2022-06-22T21:06:55Z

PRs with this fix:

CMSSW_12_5_X: Initialise the errors_ data member of SiPixelDigiErrorsSoAFromCUDA #38476
CMSSW_12_4_X: Initialise the errors_ data member of SiPixelDigiErrorsSoAFromCUDA [12.4.x] #38477
CMSSW_12_3_X: Initialise the errors_ data member of SiPixelDigiErrorsSoAFromCUDA [12.3.x] #38478

trtomei · 2022-06-22T21:35:12Z

Hm, looks like I am late to the party... but, if it's any help, here are instructions for the error seen in Run 353744 (AFAICT you have been testing with Run 353941). Running in Hilton this time:

Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353744_ls0009.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2

I also see the same problem, it crashes only every once in a while. It's probably the same bug, but I add it here for completeness.

trtomei · 2022-06-23T02:09:15Z

I also have here the other crash, this one is fully reproducible:

Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353709_ls0085.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2

It will always crash on the 52nd event, Run 353709, Event 76567528, LumiSection 85, with the message:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray<short unsigned int, 48> >; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc<unsigned int, 32769, 163840>; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray<unsigned int, 6>]: Assertion `tmpNtuplet.size() <= 4' failed.

PS: it's not needed to run on Hilton at all, I was running in offline-like mode.

fwyzard · 2022-06-23T06:02:18Z

@trtomei could you clarify

what was the original error stream file ? is it /store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw
are you running with or without GPUs ?
does the error happen consistently, or only randomly ?

Running online, I have not been able to reproduce the error using the .raw input file, neither with nor without GPUs.

trtomei · 2022-06-26T12:25:55Z

@fwyzard To clarify:

Original error stream file was indeed: /store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw
I understand I am using GPUs, as I am using the Skylake machine ( hilton-c2e36-35-04), using the process.options = cms.untracked.PSet( accelerators = cms.untracked.vstring( '*' ) ) option, and I see the lines

%MSG-i CUDAService:  (NoModuleName) 26-Jun-2022 12:24:47  pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
%MSG

For me, the error happens consistently.

Maybe sit together with me tomorrow and we solve this.

missirol · 2022-10-12T15:47:10Z

@swagata87 @trtomei

Is this issue still relevant?

swagata87 · 2022-10-12T16:19:58Z

Is this issue still relevant?

actually, yesterday we had a crash which looks like the type1 crash mentioned in the issue-description.
Here are some relevant information on yesterday's crash:

Run number: 360224
StartTime: Oct 12 2022, 02:52
EndTime: Oct 12 2022, 04:36
HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.1/HLT/V1
CMSSW_12_4_9
Crash happened in: fu-c2b05-23-01
The error stream file has been copied to hilton. So I think FOG will check if it is reproducible or not, and will follow up.

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_9-el8_amd64_gcc10/build/CMSSW_12_4_9-build/tmp/BUILDROOT/dc6747a684df926e1faea7ef7c301e1a/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.

trtomei · 2022-10-14T01:42:41Z

The files in ROOT format and the HLT configuration are in: /afs/cern.ch/user/t/tomei/public/issue38453
This is reproducible in the Hilton with GPU:

%MSG-i ThreadStreamSetup:  (NoModuleName) 14-Oct-2022 02:05:46  pre-events
setting # threads 4
setting # streams 4
%MSG
%MSG-i CUDAService:  (NoModuleName) 14-Oct-2022 02:05:47  pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)

missirol · 2022-10-19T15:28:17Z

@cms-sw/tracking-pog-l2

In this issue, one HLT crash is not yet solved, and I would say we need help from tracking experts in order to find a fix.

The crash is reproducible offline (see #38453 (comment)), it comes from the (HLT) pixel reconstruction, and it only happens on CPU, not on GPU (for what we have seen so far).

Removing some assert calls, one can find a tmpNtuplet with size=5, but that's as far as my insight goes.
https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293

fwyzard · 2022-10-19T16:02:27Z

I have a vague recollection of a comment from @VinInn sayng that we should simply remove the assert...

I think now it's OK to have ntuplets with 5 hits, so an alternative could be to change the condition to <= 5 ?

missirol · 2022-10-19T16:08:25Z

At least, removing the asserts [1,2] does not lead to any other crashes, fwiw.

And just for my understanding: the fact that, for the same event, we do not see a ntuplet with size=5 on GPU can be expected?

[1] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293
[2] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L334

VinInn · 2022-10-19T16:08:29Z

It does not happen on GPU because assert are removed.
this is a sort of sextuplet candidate (rare? impossible?)
anyhow if on GPU does not cause havoc I would either change the condition following @fwyzard advice or just remove the assert.
mind the assert at the end of the function as well

missirol · 2022-10-19T16:11:42Z

It does not happen on GPU because assert are removed.

Okay, thanks, but still I tried to just print the ntuplet size while running on GPU, and I didn't see a size=5..

missirol · 2022-10-19T16:30:03Z

Thanks for having a look.

I checked that (unsurprisingly) the HLT runs fine on these 'error events', for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I'll open PRs with that change to gain time.

missirol · 2022-10-19T17:05:03Z

The PRs with the 4 -> 5 change are #39780 (12_6_X), #39781 (12_5_X), and #39782 (12_4_X).

mmusich · 2022-10-20T09:22:57Z

@cms-sw/hlt-l2 (now speaking with the ORM hat, in order to better coordinate the creation of the next patch releases):

will this issue be fully solved after the merge of the backports of loosen check on ntuplet size in GPUCACell::find_ntuplets #39780 ?
is there any other outstanding HLT crash with recent data that still needs to followed up (outside of this ticket)?

missirol · 2022-10-20T09:37:21Z

will this issue be fully solved after the merge of the backports of #39780 ?

Yes, that is my understanding.

is there any other outstanding HLT crash with recent data that still needs to followed up (outside of this ticket)?

There are two more issues, but those crashes have been rare: #39568 , which ECAL has promised to look into, and #38651, which might somehow have been a glitch (seen only once).

FOG (@trtomei) can tell us if there are any new online crashes without a CMSSW issue.

missirol · 2022-10-20T09:40:54Z

+hlt

VinInn · 2022-10-20T10:07:18Z

so the tuplet in question is joining layer-pairs 0,3,10,7,12, so all 6 BPIX1,2,3 and FPIX1,2,3
geometrically (almost) impossible but ok.
Now why not on GPU?

how can I run hlt_for_debug.py on GPU and NOT on CPU?

VinInn · 2022-10-20T10:21:25Z

Anyhow if we "observe" sextuplets we need to allow sextuplets in the code .... so the fix of the asserts is ok (the arrays were already over-dimensioned)

VinInn · 2022-10-20T10:38:18Z

The sextuplet is on GPU as well

VinInn · 2022-10-20T10:55:24Z

in case you are interested here are the coordinates of the hits

CPU 0,3,10,7,12,   r/z: 2.834839/0.714075,6.584036/-13.839200,10.662227/-29.628742,11.603213/-33.272655,13.539767/-40.767979,15.999158/-50.250683,
GPU 0,3,10,7,12,   r/z: 2.834839/0.714075,6.584036/-13.839200,10.662227/-29.628742,11.603212/-33.272655,13.539767/-40.767979,15.999158/-50.250683,

missirol · 2022-10-20T13:25:14Z

how can I run hlt_for_debug.py on GPU and NOT on CPU?

Looks like this was already solved. I add one comment for documentation purposes.

The complication comes from the fact that the HLT menu includes 2 prescaled triggers that run the pixel CPU-only reco (which is why we saw the crash online). To ensure that only the pixel GPU reco is running, one solution is to remove them, but that's tricky to do starting from the full menu [1]; alternatively, one can just run 1 appropriate Path instead of the full menu (most times, this is enough for a reproducer) [2]. In the future, we/HLT should maybe try to build 'minimal' reproducers, e.g. not using the full menu if that's not needed.

[1] Add at the end of hlt_for_debug.py:

del process.DQM_PixelReconstruction_v4
del process.AlCa_PFJet40_CPUOnly_v1
del process.HLT_PFJet40_GPUvsCPU_v1
process.hltMuonTriggerResultsFilter.triggerConditions = ['FALSE']
del process.PrescaleService
del process.DQMHistograms
dpaths = [foo for foo in process.paths_() if foo.startswith('Dataset_')]
for foo in dpaths: process.__delattr__(foo)
fpaths = [foo for foo in process.finalpaths_()]
for foo in fpaths: process.__delattr__(foo)

[2] In this case, it could have been

hltGetConfiguration run:360224 \
 --data \
 --no-prescale \
 --no-output \
 --globaltag 124X_dataRun3_HLT_v4 \
 --paths AlCa_PFJet40_v* \
 --max-events -1 \
 --input file:run360224_ls0081_file1.root \
 > hlt.py

cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
process.options.accelerators = ['cpu']
@EOF

cmsRun hlt.py &> hlt.log

missirol · 2022-10-20T13:25:57Z

in case you are interested here are the coordinates of the hits

If it's not too much trouble to explain, I would be interested to know how to extract the information on layer pairs and r-z coordinates for a given candidate.

I see the crash at Run 360224, Event 82169671, LumiSection 81; if I run CPU-only, I can see tmpNtuplet.size == 5 inside find_ntuplets, but I don't see that if I run the same event on GPU, and this got me confused.

VinInn · 2022-10-20T13:34:48Z

diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..a33ab98ca09 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,7 +290,19 @@ public:

     auto doubletId = this - cells;
     tmpNtuplet.push_back_unsafe(doubletId);
-    assert(tmpNtuplet.size() <= 4);
+    assert(tmpNtuplet.size() <= 5);
+    if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+      printf("GPU ");
+#else
+      printf("CPU ");
+#endif
+      for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+      printf("   r/z: ");
+      for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+      auto c = tmpNtuplet[tmpNtuplet.size()-1];  printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+      printf("\n");
+    }

     bool last = true;
     for (unsigned int otherCell : outerNeighbors()) {
@@ -331,7 +343,7 @@ public:
       }
     }
     tmpNtuplet.pop_back();
-    assert(tmpNtuplet.size() < 4);
+    assert(tmpNtuplet.size() < 5);
   }

   // Cell status management

VinInn · 2022-10-20T13:35:23Z

I saw the printout twice so I added the ifdef part

VinInn · 2022-10-20T13:37:30Z

btw the method .back() of VecArray is badly broken (does not compile because cannot compile)

missirol · 2022-10-20T13:53:57Z

Thanks a lot for the info.

missirol · 2022-10-20T20:51:55Z

(This issue is solved; the rest below is just me trying to learn things.)

With Vincenzo's diff, I get what he wrote: same sextuplet on CPU and GPU.

In my previous attempts, I had additional printouts in GPUCACell::find_ntuplets, and in that case I couldn't see the sextuplet on GPU. I think this is somewhat reproducible: I ran 30 times with this diff [*] and I could see the sextuplet on GPU in the printouts only 2 times (on CPU, I saw it 10 times out of 10).

At least now I see what I was doing differently.

[*] (yes, most of these printouts are pointless)

diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..bfefdf7ccd6 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,15 +290,37 @@ public:
 
     auto doubletId = this - cells;
     tmpNtuplet.push_back_unsafe(doubletId);
-    assert(tmpNtuplet.size() <= 4);
+    assert(tmpNtuplet.size() <= 5);
+    if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+      printf("GPU ");
+#else
+      printf("CPU ");
+#endif
+      for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+      printf("   r/z: ");
+      for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+      auto c = tmpNtuplet[tmpNtuplet.size()-1];  printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+      printf("\n");
+    }
 
     bool last = true;
     for (unsigned int otherCell : outerNeighbors()) {
       if (cells[otherCell].isKilled())
         continue;  // killed by earlyFishbone
       last = false;
+#ifdef __CUDACC__
+      printf("GPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+      printf("CPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
       cells[otherCell].find_ntuplets<DEPTH - 1>(
           hh, cells, cellTracks, foundNtuplets, apc, quality, tmpNtuplet, minHitsPerNtuplet, startAt0);
+#ifdef __CUDACC__
+      printf("GPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+      printf("CPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
     }
     if (last) {  // if long enough save...
       if ((unsigned int)(tmpNtuplet.size()) >= minHitsPerNtuplet - 1) {
@@ -331,7 +353,12 @@ public:
       }
     }
     tmpNtuplet.pop_back();
-    assert(tmpNtuplet.size() < 4);
+    assert(tmpNtuplet.size() < 5);
+#ifdef __CUDACC__
+    printf("GPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+    printf("CPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
   }
 
   // Cell status management

VinInn · 2022-10-21T08:27:39Z

In my previous attempts, I had additional printouts in GPUCACell::find_ntuplets, and in that case I couldn't see the sextuplet on GPU. I think this is somewhat reproducible: I ran 30 times with this diff [*] and I could see the sextuplet on GPU in the printouts only 2 times (on CPU, I saw it 10 times out of 10).

This is surprising as we do not expect GPU vs CPU differences at this point of processing will try to investigate more

VinInn · 2022-10-21T13:11:52Z

@missirol
I'm sorry. running on patatrack02 even changing GPU I observed always (6 out of 6) both GPU and CPU printout.

on which machine are you running?
how exactly are you switching between cpu and gpu?
(in my case I'm just running cmsRun hlt_for_debug.py from a copy of issue38453 directory using CMSSW_12_4_10_patch2)

VinInn · 2022-10-21T13:14:09Z

btw: printf from GPU is not guaranteed to appear if there are too many.

missirol · 2022-10-21T13:32:06Z

running on patatrack02 even changing GPU I observed always (6 out of 6) both GPU and CPU printout.

Sorry for the trouble, then. I tested on gpu-c2a02-39-03; to run CPU-only, I add process.options.accelerators = ['cpu'] to the config [*]. I've been using 12_4_10 with the diff in #38453 (comment) (will re-try with 12_4_10_patch2, but that likely makes no difference).

btw: printf from GPU is not guaranteed to appear if there are too many.

Thanks, didn't know, it might explain what I (didn't) see.

[*]

https_proxy=http://cmsproxy.cms:3128 \
hltGetConfiguration run:360224 \
 --data \
 --no-prescale \
 --no-output \
 --globaltag 124X_dataRun3_HLT_v4 \
 --paths AlCa_PFJet40_v* \
 --max-events -1 \
 --input file:run360224_ls0081_file1.root \
 > hlt.py

cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
#process.options.accelerators = ['cpu']
@EOF

cmsRun hlt.py &> hlt.log

missirol · 2022-10-22T06:54:23Z

btw: printf from GPU is not guaranteed to appear if there are too many.

I think this is indeed the explaination [*]. Case closed, and sorry again for the noise.

[*] I checked this by keeping the large number of printouts, but also adding

#ifdef __CUDACC__
    if (tmpNtuplet.size() > 4) {
      __trap();
    }
#endif

and the program crashed 10/10 times on GPU (running only on the event in question), meaning each time there was a sextuplet on GPU.

perrotta · 2022-11-02T06:48:00Z

@swagata87 @missirol can this issue be considered concluded, and therefore closed?

missirol · 2022-11-02T06:50:36Z

In my understanding, yes (I signed it). Swagata can confirm and close.

swagata87 · 2022-11-02T07:19:46Z

yes, I am closing this issue. Thanks everyone!

cmsbuild added the pending-assignment label Jun 21, 2022

cmsbuild added hlt-pending pending-signatures reconstruction-pending and removed pending-assignment labels Jun 21, 2022

This was referenced Oct 19, 2022

loosen check on ntuplet size in GPUCACell::find_ntuplets #39780

Merged

loosen check on ntuplet size in GPUCACell::find_ntuplets [12_5_X] #39781

Merged

loosen check on ntuplet size in GPUCACell::find_ntuplets [12_4_X] #39782

Merged

cmsbuild added hlt-approved and removed hlt-pending labels Oct 20, 2022

swagata87 closed this as completed Nov 2, 2022

HLT crashes in GPU and CPU in collision runs #38453

HLT crashes in GPU and CPU in collision runs #38453

Comments

swagata87 commented Jun 21, 2022

cmsbuild commented Jun 21, 2022

Dr15Jones commented Jun 21, 2022

cmsbuild commented Jun 21, 2022

Dr15Jones commented Jun 21, 2022

swagata87 commented Jun 21, 2022

swagata87 commented Jun 22, 2022

swagata87 commented Jun 22, 2022 • edited Loading

fwyzard commented Jun 22, 2022 • edited Loading

fwyzard commented Jun 22, 2022

Dr15Jones commented Jun 22, 2022

Dr15Jones commented Jun 22, 2022

fwyzard commented Jun 22, 2022

fwyzard commented Jun 22, 2022

trtomei commented Jun 22, 2022 • edited Loading

trtomei commented Jun 23, 2022

fwyzard commented Jun 23, 2022

trtomei commented Jun 26, 2022

missirol commented Oct 12, 2022

swagata87 commented Oct 12, 2022 • edited Loading

trtomei commented Oct 14, 2022 • edited Loading

missirol commented Oct 19, 2022

fwyzard commented Oct 19, 2022 • edited Loading

missirol commented Oct 19, 2022

VinInn commented Oct 19, 2022

missirol commented Oct 19, 2022

missirol commented Oct 19, 2022

missirol commented Oct 19, 2022

mmusich commented Oct 20, 2022 • edited Loading

missirol commented Oct 20, 2022

missirol commented Oct 20, 2022

VinInn commented Oct 20, 2022 • edited Loading

VinInn commented Oct 20, 2022

VinInn commented Oct 20, 2022

VinInn commented Oct 20, 2022

missirol commented Oct 20, 2022

missirol commented Oct 20, 2022

VinInn commented Oct 20, 2022

VinInn commented Oct 20, 2022

VinInn commented Oct 20, 2022

missirol commented Oct 20, 2022

missirol commented Oct 20, 2022

VinInn commented Oct 21, 2022 via email

VinInn commented Oct 21, 2022

VinInn commented Oct 21, 2022

missirol commented Oct 21, 2022 • edited Loading

missirol commented Oct 22, 2022

perrotta commented Nov 2, 2022

missirol commented Nov 2, 2022

swagata87 commented Nov 2, 2022

swagata87 commented Jun 22, 2022 •

edited

Loading

fwyzard commented Jun 22, 2022 •

edited

Loading

trtomei commented Jun 22, 2022 •

edited

Loading

swagata87 commented Oct 12, 2022 •

edited

Loading

trtomei commented Oct 14, 2022 •

edited

Loading

fwyzard commented Oct 19, 2022 •

edited

Loading

mmusich commented Oct 20, 2022 •

edited

Loading

VinInn commented Oct 20, 2022 •

edited

Loading

missirol commented Oct 21, 2022 •

edited

Loading