Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831

mzarucki · 2021-08-10T16:53:18Z

Dear all,

During our online GPU tests in CMSSW_11_3_4 (eg. run 34555 with Pixel, ECAL and HCAL in global, and with 2 GPU + 2 CPU FUs in the DAQ configuration) with a GPU menu that includes pixel reconstruction (CMSHLT-2157, [1]), we saw the following crashes (e-log):

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoPixelVertexing/PixelTrackFitting/interface/BrokenLine.h:381: 
void brokenline::circleFit(const M3xN&, const M6xN&, const V4&, double, brokenline::PreparedBrokenLineData&, brokenline::karimaki_circle_fit&) 
[with M3xN = Eigen::Map, 0, Eigen::Stride<73728, 24576> >; M6xN = Eigen::Map, 0, Eigen::Stride<147456, 24576> >; V4 = Eigen::Map, 0, Eigen::InnerStride<24576> >; 
int n = 4; brokenline::karimaki_circle_fit = riemannFit::CircleFit]: 
Assertion `circle_results.qCharge * circle_results.par(1) <= 0' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Tue Aug 10 17:16:19 CEST 2021
Thread 8 (Thread 0x7fac784f7700 (LWP 230802)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefd9704 in usleep () from /lib64/libc.so.6
#2 0x00007facc97c15ea in FedRawDataInputSource::readSupervisor() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libEventFilterUtilities.so
#3 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7faca7152620) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007faccf2b8dd5 in start_thread () from /lib64/libpthread.so.0
[...]
Thread 7 (Thread 0x7fac84bff700 (LWP 230774)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facca071720 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3
#4 0x00007faca3a4052d in HFPreRecAlgo::reconstruct(QIE10DataFrame const&, int, HcalCoder const&, HcalChannelProperties const&) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloHcalRecAlgos.so
#5 0x00007faca3071fef in HFPreReconstructor::fillInfos(edm::Event const&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloHcalRecProducers.so
#6 0x00007faca3073acd in HFPreReconstructor::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloHcalRecProducers.so
#7 0x00007facd1ba2b8c in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8 0x00007facd1b7e4bd in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9 0x00007facd1adca05 in decltype ({parm#1}()) edm::convertException::wrap >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007facd1adcbbd in bool edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007facd1adcec6 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007facd1ade7c6 in edm::Worker::RunModuleTask >::execute() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
Thread 6 (Thread 0x7fac85dfe700 (LWP 230770)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facca071720 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3
#4 0x00007faca3e5d356 in Eigen::internal::copy_using_evaluator_innervec_CompleteUnrolling >, Eigen::internal::evaluator >, Eigen::internal::assign_op, 0>, 0, 100>::run(Eigen::internal::generic_dense_assignment_kernel >, Eigen::internal::evaluator >, Eigen::internal::assign_op, 0>&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#5 0x00007faca3e52031 in PulseChiSqSNNLS::updateCov(Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#6 0x00007faca3e54098 in PulseChiSqSNNLS::Minimize(Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#7 0x00007faca3e54712 in PulseChiSqSNNLS::DoFit(Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#8 0x00007faca3e4febc in EcalUncalibRecHitMultiFitAlgo::makeRecHit(EcalDataFrame const&, EcalPedestal const*, EcalMGPAGainRatio const*, std::array, 3ul> const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#9 0x00007faca4166a15 in EcalUncalibRecHitWorkerMultiFit::run(edm::Event const&, EcalDigiCollection const&, edm::SortedCollection >&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloEcalRecProducersPlugins.so
#10 0x00007faca41394cf in EcalUncalibRecHitProducer::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloEcalRecProducersPlugins.so
#11 0x00007facd1ba2b8c in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007facd1b7e4bd in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
Thread 5 (Thread 0x7fac867ff700 (LWP 230769)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facca071720 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3
#4 0x00007fac8b1f4a89 in CaloTowersCreationAlgo::finish(edm::SortedCollection >&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloCaloTowersCreator.so
#5 0x00007fac8b1fe1a0 in CaloTowersCreator::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloCaloTowersCreator.so
#6 0x00007facd1ba2b8c in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#7 0x00007facd1b7e4bd in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8 0x00007facd1adca05 in decltype ({parm#1}()) edm::convertException::wrap >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9 0x00007facd1adcbbd in bool edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007facd1adcec6 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007facd1ade7c6 in edm::Worker::RunModuleTask >::execute() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007facd1d1fb25 in tbb::internal::function_task::execute() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
Thread 4 (Thread 0x7fac874f8700 (LWP 230762)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facc97a957f in evf::FastMonitoringService::snapshotRunner() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libEventFilterUtilities.so
#3 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7fac89e6c800) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007faccf2b8dd5 in start_thread () from /lib64/libpthread.so.0
[...]
Thread 3 (Thread 0x7fac8ddff700 (LWP 230719)):
#0 0x00007faccf2bc965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007faccf8b385c in __gthread_cond_wait (__mutex=, __cond=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_1_0_pre6-slc7_amd64_gcc900/build/CMSSW_11_1_0_pre6-build/BUILD/slc7_amd64_gcc900/external/gcc/9.3.0/gcc-9.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007facc97bc589 in FedRawDataInputSource::readWorker(unsigned int) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libEventFilterUtilities.so
#4 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7faca8050af0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
[...]
Thread 2 (Thread 0x7facb055b700 (LWP 230601)):
#0 0x00007faccf2c0179 in waitpid () from /lib64/libpthread.so.0
#1 0x00007facca0718d7 in edm::service::cmssw_stacktrace_fork() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2 0x00007facca07249a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7facb08c6bf0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007faccf2b8dd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007faccefe1ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7faccd5d6540 (LWP 228946)):
#0 0x00007faccefd720d in poll () from /lib64/libc.so.6
#1 0x00007facca071cd7 in full_read.constprop () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2 0x00007facca07256c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3 0x00007facca073922 in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4
#5 0x00007faccef1a207 in raise () from /lib64/libc.so.6
#6 0x00007faccef1b8f8 in abort () from /lib64/libc.so.6
#7 0x00007faccef13026 in __assert_fail_base () from /lib64/libc.so.6
#8 0x00007faccef130d2 in __assert_fail () from /lib64/libc.so.6
#9 0x00007fac87a1f48e in void brokenline::circleFit, 0, Eigen::Stride<73728, 24576> >, Eigen::Map, 0, Eigen::Stride<147456, 24576> >, Eigen::Map, 0, Eigen::InnerStride<24576> >, 4>(Eigen::Map, 0, Eigen::Stride<73728, 24576> > const&, Eigen::Map, 0, Eigen::Stride<147456, 24576> > const&, Eigen::Map, 0, Eigen::InnerStride<24576> > const&, double, brokenline::PreparedBrokenLineData<4>&, riemannFit::CircleFit&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00007fac87a153bc in HelixFitOnGPU::launchBrokenLineKernelsOnCPU(TrackingRecHit2DSOAView const*, unsigned int, unsigned int) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00007fac87a32635 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous const&, float) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#12 0x00007fac87a21c64 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#13 0x00007facd1b88e57 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so

We also saw these errors late (after 47 LS) into run 344558 (2 GPU + 2 CPU FUs) and relatively early into run 344560 (standard DAQ configuration with all FUs in).

What is rather strange is that these crashes were seen in standard CPU FUs (eg. fu-c2a02-27-02, fu-c2a02-45-02, fu-c2a01-07-03).

We did not see any crashes during our previous test (e-log) in CMSSW_11_3_2 (run 343991). We also did not see any issues when testing on Hilton in CMSSW_11_3_4.

Best regards,
Mateusz on behalf of TSG FOG

[1] /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2

The text was updated successfully, but these errors were encountered:

cmsbuild · 2021-08-10T16:53:34Z

A new Issue was created by @mzarucki Mateusz Zarucki.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2021-08-10T17:22:24Z

Did any of the crash happen running with GPUs, or did they all occur without any GPUs involved ?

Is the data from the corresponding runs, lumisections and possible events available ?

Could you prepare some instructions for reproducing the problem ?

Dr15Jones · 2021-08-10T17:22:34Z

assign reconstruction, heterogeneous

cmsbuild · 2021-08-10T17:22:41Z

New categories assigned: heterogeneous,reconstruction

@slava77,@fwyzard,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard · 2021-08-10T17:32:30Z

The first ELOG reports that

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoPixelVertexing/PixelTrackFitting/interface/BrokenLine.h:381: void brokenline::circleFit(const M3xN&, const M6xN&, const V4&, double, brokenline::PreparedBrokenLineData&, brokenline::karimaki_circle_fit&) [with M3xN = Eigen::Map, 0, Eigen::Stride<73728, 24576> >; M6xN = Eigen::Map, 0, Eigen::Stride<147456, 24576> >; V4 = Eigen::Map, 0, Eigen::InnerStride<24576> >; int n = 4; brokenline::karimaki_circle_fit = riemannFit::CircleFit]: Assertion `circle_results.qCharge * circle_results.par(1) <= 0' failed. 

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

...

#9 0x00007fac87a1f48e in void brokenline::circleFit, 0, Eigen::Stride<73728, 24576> >, Eigen::Map, 0, Eigen::Stride<147456, 24576> >, Eigen::Map, 0, Eigen::InnerStride<24576> >, 4>(Eigen::Map, 0, Eigen::Stride<73728, 24576> > const&, Eigen::Map, 0, Eigen::Stride<147456, 24576> > const&, Eigen::Map, 0, Eigen::InnerStride<24576> > const&, double, brokenline::PreparedBrokenLineData<4>&, riemannFit::CircleFit&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00007fac87a153bc in HelixFitOnGPU::launchBrokenLineKernelsOnCPU(TrackingRecHit2DSOAView const*, unsigned int, unsigned int) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00007fac87a32635 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous const&, float) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#12 0x00007fac87a21c64 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#13 0x00007facd1b88e57 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so



The FUs that crashed with there errors are:


fu-c2a02-27-02 - ERRORrun - RUN:344555 - process 228942 on resource(s) ['core11', 'core6', 'core47', 'core0'] exited with signal -6, retries left: 12021-08-10 17:24:22
fu-c2a02-35-02 - ERRORrun - RUN:344555 - process 99268 on resource(s) ['core13', 'core33', 'core25', 'core16'] exited with signal -6, retries left: 12021-08-10 17:23:33
fu-c2a02-45-02 - ERRORrun - RUN:344555 - process 453875 on resource(s) ['core59', 'core8', 'core46', 'core52'] exited with signal -6, retries left: 12021-08-10 17:23:02
fu-c2a02-27-02 - ERRORrun - RUN:344555 - process 228946 on resource(s) ['core43', 'core7', 'core35', 'core33'] exited with signal -6, retries left: 12021-08-10 17:16:52

The call to HelixFitOnGPU::launchBrokenLineKernelsOnCPU should happen only on the nodes without a GPU (unless the menu tries to run both CPU and GPU versions ?). Could you confirm that the error is always the same:

fu-c2a02-27-02: no GPUs, two crashes
fu-c2a02-35-02: has a GPU
fu-c2a02-45-02: no GPUs

To investigate the crash, as usual we would need the full instructions to reproduce it:

release CMSSW_11_3_4
what HLT menu? even better if you could provide a full python dump, including the DAQ customisations
what input data? did the crashes cause the data to be sent to the error stream? is it available?

mzarucki · 2021-08-10T17:57:25Z

Hi @fwyzard,

Did any of the crash happen running with GPUs, or did they all occur without any GPUs involved ?

Correction: the Pixel track fitting crashes occurred on CPUs only (fu-c2a02-27-02, fu-c2a02-45-02,fu-c2a01-07-03).

The call to HelixFitOnGPU::launchBrokenLineKernelsOnCPU should happen only on the nodes without a GPU (unless the menu tries to run both CPU and GPU versions ?).

There might be an issue with the HLT_Pixel path, which was added to run Pixel reconstruction (CMSHLT-2157) but does not filter on anything:

process.HLT_Pixel_v1 = cms.Path( process.HLTBeginSequenceHT + process.hltPrePixelGPU + process.HLTDoLocalPixelSequence + process.HLTRecopixelvertexingSequence + process.HLTEndSequence )

what HLT menu? even better if you could provide a full python dump, including the DAQ customisations

The HLT menu name is listed under [1] above. I will attach the full python config to this ticket.

Is the data from the corresponding runs, lumisections and possible events available ?

what input data? did the crashes cause the data to be sent to the error stream? is it available?

Could you prepare some instructions for reproducing the problem ?

We will work on a recipe to reproduce the errors, over the same input data (we contacted our DAQ colleagues to save it locally).

We will keep you updated.

Best,
Mateusz

mzarucki · 2021-08-10T20:34:11Z

The menu /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2 python config file is attached here as CosmicsGPU-V2.txt. You can also find it under /nfshome0/mzarucki/GPUTests/CosmicsGPU-V2.py.

Note: The only difference wrt. V1 (and CosmicsGPUFixed-V1.py) is the change in the GT configuration, relevant to the BeamSpot workflow updates.

Best,
Mateusz

mzarucki · 2021-08-11T00:46:57Z

Hi again,

The .raw data could not be copied locally (only during the run it seems) and the repacked RAW .root files would need a bit time to be accessible via Rucio.

Nevertheless, I tried to recreate the crashes using an older set of data from run 343387 (stored on EOS via our Rucio rule):
/store/data/Commissioning2021/Cosmics/RAW/v1/000/343/387/00000/05c79a06-0011-4a30-8762-02b7708ecaaa.root

with the following simplified recipe (which includes the input file config) that is to be run on a GPU machine (see TriggerDevelopmentWithGPUs):

cmsrel CMSSW_11_3_4
cd CMSSW_11_3_4/src
cmsenv
cp /nfshome0/mzarucki/GPUTests/CosmicsGPU-V2_modified.py .
cmsRun CosmicsGPU-V2_modified.py

Note: Here number of threads/streams is unset.

After running for a while, I do see crashes, however, of a different nature:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc:143: 
virtual void SiPixelRecHitFromCUDA::produce(edm::Event&, const edm::EventSetup&): 
Assertion `nhits <= dsv.size()' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

...

Module: SiPixelRecHitFromCUDA:hltSiPixelRecHits@cuda (crashed)

Revisiting F3Mon I do see this crash on the GPU nodes (added e-log). Therefore, the initial report is incorrect that there is only one type of crash. Apologies for overlooking this.

Therefore, to summarise, we saw this PixelRecHitFromCUDA crash only on the GPU nodes and the initial Pixel track fitting crash on the CPU nodes. I have updated the above responses to avoid confusion.

Best,
Mateusz

fwyzard · 2021-08-11T06:19:43Z

As far as I know, the pixel track reconstruction has not been designed to run at 0 tesla.

My first suggestion would be to switch it off, and turn it on only once the magnetic field is nominal.

fwyzard · 2021-08-11T06:24:20Z

@mtosi @vmariani @mmusich FYI

fwyzard · 2021-08-11T08:17:18Z

After copying the input file locally¹ and updating the configuration to use it, the GPU error is consistently reproducible:

cmsRun: .../src/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc:143:
virtual void SiPixelRecHitFromCUDA::produce(edm::Event&, const edm::EventSetup&): Assertion `nhits <= dsv.size()' failed.

Instead, I have not been able to reproduce the CPU-only crash

kinit ${USER}@CERN.CH
xrdcp root://eoscms.cern.ch//eos/cms/store/data/Commissioning2021/Cosmics/RAW/v1/000/343/387/00000/05c79a06-0011-4a30-8762-02b7708ecaaa.root .

mzarucki · 2021-08-12T19:27:40Z

Dear all,

Today we performed a repeated test of the HLT GPU menu (e-log), with the intent of reproducing the above crashes. During the test, in run 344676 we were able to see the GPU crash (e-log) only. Considering the GPU menu is exactly the same as the cosmic menu in its content, the decision was made to keep running with the cosmics GPU menu. Roughly an hour into the run 344679 after the formal GPU tests, we saw the CPU crash (e-log) that we were trying to reproduce.

Since DAQ enabled the error stream this morning, we were able to access the data from the crashes. The .raw files from the GPU crash (recipe above) have been saved in /nfshome0/mzarucki/GPUTests/PixelCrashGPU, whereas the data from the CPU crash have been saved as /nfshome0/mzarucki/GPUTests/PixelCrashCPU/run344679_ls0151_index000010_fu-c2a02-33-02_pid48404.raw.

Here is a recipe reproduce the CPU crash (which can be done on a GPU node):

export CUDA_VISIBLE_DEVICES=
cmsrel CMSSW_11_3_4
cd CMSSW_11_3_4/src
cmsenv
cp /nfshome0/mzarucki/GPUTests/PixelCrashCPU/CosmicsGPU-V2_modified.py .
cmsRun CosmicsGPU-V2_modified.py

where the HLT config file has been modified to take the error stream data as input.

Best regards,
Mateusz on behalf of FOG

fwyzard · 2021-08-12T19:56:28Z

Note that the two crashes seem unrelated:

running over the events from run 344679 crashes the CPU reconstruction, but seems to work fine with the GPU reconstruction (though it could just be that we do not enable assertions on the GPU)
running over the events from run 344676 crashes on the GPU (actually, in the module that copies the data from the GPU to the CPU) but works fine on the CPU

jlawrenc · 2021-08-13T07:20:35Z

As mentioned in this elog: http://cmsonline.cern.ch/cms-elog/1122928, we attempted to reproduce the error during run 344675, but nothing occurred during the 24 minute run. When we included DT in the next run (344676), the error occurred (only GPU error), within 11 minutes.

slava77 · 2021-08-13T12:11:34Z

@cms-sw/trk-dpg-l2
please take a note (I'm not sure if you received this already)

@fwyzard
do you know, perhaps, if someone is looking/debugging this already?

fwyzard · 2021-08-13T14:03:50Z

@fwyzard
do you know, perhaps, if someone is looking/debugging this already?

No, not that I know of.
I've just started looking at them myself.

fwyzard · 2021-08-13T15:16:53Z

The error from run 344676 looks like a real mismatch between the legacy and GPU pixel cluster / rechit reconstruction.

I think we need @mmusich or @VinInn to have a look.

slava77 · 2021-09-03T19:19:55Z

the case with Assertion circle_results.qCharge * circle_results.par(1) <= 0' failed.` is apparently "fixed" (the assert is removed) in #35128.

Will this need to be backported to 11_3_X and 12_0_X, or is 12_0_X enough?

fwyzard · 2021-09-03T19:57:42Z

12_0_X is enough

makortel · 2021-10-14T13:10:29Z

This one has been fixed, right?

VinInn · 2021-10-14T13:26:02Z

"fixed" yes.

makortel · 2021-10-14T13:29:53Z

+heterogneous

jpata · 2022-05-17T07:45:02Z

+reconstruction

addressed in Fix for SiPixelRecHitFromCUDA crash during online GPU tests #35229
backported to 12_0_X: [Backport 12_0_X] Fix for SiPixelRecHitFromCUDA crash during online GPU tests #35317

jpata · 2022-05-17T07:49:50Z

@cmsbuild please close

(typo in the heterogeneous comment prevented a full sig)

fwyzard · 2022-05-17T09:26:08Z

+heterogeneous

Just for the record... we should have picked a word easier to spell :-/

cmsbuild · 2022-05-17T09:26:22Z

This issue is fully signed and ready to be closed.

cmsbuild added the pending-assignment label Aug 10, 2021

cmsbuild added heterogeneous-pending pending-signatures reconstruction-pending and removed pending-assignment labels Aug 10, 2021

mzarucki changed the title ~~Pixel track fitting crash during online GPU tests (CMSSW_11_3_4)~~ Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) Aug 11, 2021

VinInn mentioned this issue Sep 2, 2021

improve math in broken-line fit #35128

Merged

This was referenced Sep 10, 2021

Fix for SiPixelRecHitFromCUDA crash during online GPU tests #35229

Merged

[Backport 12_0_X] Fix for SiPixelRecHitFromCUDA crash during online GPU tests #35317

Merged

cmsbuild added reconstruction-approved and removed reconstruction-pending labels May 17, 2022

cmsbuild closed this as completed May 17, 2022

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831

Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831

mzarucki commented Aug 10, 2021 •

edited

Loading

cmsbuild commented Aug 10, 2021

fwyzard commented Aug 10, 2021 •

edited

Loading

Dr15Jones commented Aug 10, 2021

cmsbuild commented Aug 10, 2021

fwyzard commented Aug 10, 2021

mzarucki commented Aug 10, 2021 •

edited

Loading

mzarucki commented Aug 10, 2021

mzarucki commented Aug 11, 2021 •

edited

Loading

fwyzard commented Aug 11, 2021

fwyzard commented Aug 11, 2021

fwyzard commented Aug 11, 2021

mzarucki commented Aug 12, 2021

fwyzard commented Aug 12, 2021

jlawrenc commented Aug 13, 2021 •

edited

Loading

slava77 commented Aug 13, 2021

fwyzard commented Aug 13, 2021

fwyzard commented Aug 13, 2021

slava77 commented Sep 3, 2021

fwyzard commented Sep 3, 2021 via email

makortel commented Oct 14, 2021

VinInn commented Oct 14, 2021

makortel commented Oct 14, 2021

jpata commented May 17, 2022

jpata commented May 17, 2022

fwyzard commented May 17, 2022

cmsbuild commented May 17, 2022

Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831

Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831

Comments

mzarucki commented Aug 10, 2021 • edited Loading

cmsbuild commented Aug 10, 2021

fwyzard commented Aug 10, 2021 • edited Loading

Dr15Jones commented Aug 10, 2021

cmsbuild commented Aug 10, 2021

fwyzard commented Aug 10, 2021

mzarucki commented Aug 10, 2021 • edited Loading

mzarucki commented Aug 10, 2021

mzarucki commented Aug 11, 2021 • edited Loading

fwyzard commented Aug 11, 2021

fwyzard commented Aug 11, 2021

fwyzard commented Aug 11, 2021

mzarucki commented Aug 12, 2021

fwyzard commented Aug 12, 2021

jlawrenc commented Aug 13, 2021 • edited Loading

slava77 commented Aug 13, 2021

fwyzard commented Aug 13, 2021

fwyzard commented Aug 13, 2021

slava77 commented Sep 3, 2021

fwyzard commented Sep 3, 2021 via email

makortel commented Oct 14, 2021

VinInn commented Oct 14, 2021

makortel commented Oct 14, 2021

jpata commented May 17, 2022

jpata commented May 17, 2022

fwyzard commented May 17, 2022

cmsbuild commented May 17, 2022

mzarucki commented Aug 10, 2021 •

edited

Loading

fwyzard commented Aug 10, 2021 •

edited

Loading

mzarucki commented Aug 10, 2021 •

edited

Loading

mzarucki commented Aug 11, 2021 •

edited

Loading

jlawrenc commented Aug 13, 2021 •

edited

Loading