Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831

Closed
mzarucki opened this issue Aug 10, 2021 · 26 comments

Comments

@mzarucki
Copy link
Contributor

mzarucki commented Aug 10, 2021

Dear all,

During our online GPU tests in CMSSW_11_3_4 (eg. run 34555 with Pixel, ECAL and HCAL in global, and with 2 GPU + 2 CPU FUs in the DAQ configuration) with a GPU menu that includes pixel reconstruction (CMSHLT-2157, [1]), we saw the following crashes (e-log):

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoPixelVertexing/PixelTrackFitting/interface/BrokenLine.h:381: 
void brokenline::circleFit(const M3xN&, const M6xN&, const V4&, double, brokenline::PreparedBrokenLineData&, brokenline::karimaki_circle_fit&) 
[with M3xN = Eigen::Map, 0, Eigen::Stride<73728, 24576> >; M6xN = Eigen::Map, 0, Eigen::Stride<147456, 24576> >; V4 = Eigen::Map, 0, Eigen::InnerStride<24576> >; 
int n = 4; brokenline::karimaki_circle_fit = riemannFit::CircleFit]: 
Assertion `circle_results.qCharge * circle_results.par(1) <= 0' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Tue Aug 10 17:16:19 CEST 2021
Thread 8 (Thread 0x7fac784f7700 (LWP 230802)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefd9704 in usleep () from /lib64/libc.so.6
#2 0x00007facc97c15ea in FedRawDataInputSource::readSupervisor() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libEventFilterUtilities.so
#3 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7faca7152620) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007faccf2b8dd5 in start_thread () from /lib64/libpthread.so.0
[...]
Thread 7 (Thread 0x7fac84bff700 (LWP 230774)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facca071720 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3
#4 0x00007faca3a4052d in HFPreRecAlgo::reconstruct(QIE10DataFrame const&, int, HcalCoder const&, HcalChannelProperties const&) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloHcalRecAlgos.so
#5 0x00007faca3071fef in HFPreReconstructor::fillInfos(edm::Event const&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloHcalRecProducers.so
#6 0x00007faca3073acd in HFPreReconstructor::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloHcalRecProducers.so
#7 0x00007facd1ba2b8c in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8 0x00007facd1b7e4bd in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9 0x00007facd1adca05 in decltype ({parm#1}()) edm::convertException::wrap >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007facd1adcbbd in bool edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007facd1adcec6 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007facd1ade7c6 in edm::Worker::RunModuleTask >::execute() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
Thread 6 (Thread 0x7fac85dfe700 (LWP 230770)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facca071720 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3
#4 0x00007faca3e5d356 in Eigen::internal::copy_using_evaluator_innervec_CompleteUnrolling >, Eigen::internal::evaluator >, Eigen::internal::assign_op, 0>, 0, 100>::run(Eigen::internal::generic_dense_assignment_kernel >, Eigen::internal::evaluator >, Eigen::internal::assign_op, 0>&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#5 0x00007faca3e52031 in PulseChiSqSNNLS::updateCov(Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#6 0x00007faca3e54098 in PulseChiSqSNNLS::Minimize(Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#7 0x00007faca3e54712 in PulseChiSqSNNLS::DoFit(Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#8 0x00007faca3e4febc in EcalUncalibRecHitMultiFitAlgo::makeRecHit(EcalDataFrame const&, EcalPedestal const*, EcalMGPAGainRatio const*, std::array, 3ul> const&, Eigen::Matrix const&, Eigen::Matrix const&, Eigen::Matrix const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libRecoLocalCaloEcalRecAlgos.so
#9 0x00007faca4166a15 in EcalUncalibRecHitWorkerMultiFit::run(edm::Event const&, EcalDigiCollection const&, edm::SortedCollection >&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloEcalRecProducersPlugins.so
#10 0x00007faca41394cf in EcalUncalibRecHitProducer::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloEcalRecProducersPlugins.so
#11 0x00007facd1ba2b8c in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007facd1b7e4bd in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
Thread 5 (Thread 0x7fac867ff700 (LWP 230769)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facca071720 in sig_pause_for_stacktrace () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3
#4 0x00007fac8b1f4a89 in CaloTowersCreationAlgo::finish(edm::SortedCollection >&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloCaloTowersCreator.so
#5 0x00007fac8b1fe1a0 in CaloTowersCreator::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoLocalCaloCaloTowersCreator.so
#6 0x00007facd1ba2b8c in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#7 0x00007facd1b7e4bd in edm::WorkerT::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#8 0x00007facd1adca05 in decltype ({parm#1}()) edm::convertException::wrap >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#9 0x00007facd1adcbbd in bool edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#10 0x00007facd1adcec6 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#11 0x00007facd1ade7c6 in edm::Worker::RunModuleTask >::execute() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#12 0x00007facd1d1fb25 in tbb::internal::function_task::execute() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
Thread 4 (Thread 0x7fac874f8700 (LWP 230762)):
#0 0x00007faccefa8e2d in nanosleep () from /lib64/libc.so.6
#1 0x00007faccefa8cc4 in sleep () from /lib64/libc.so.6
#2 0x00007facc97a957f in evf::FastMonitoringService::snapshotRunner() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libEventFilterUtilities.so
#3 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7fac89e6c800) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007faccf2b8dd5 in start_thread () from /lib64/libpthread.so.0
[...]
Thread 3 (Thread 0x7fac8ddff700 (LWP 230719)):
#0 0x00007faccf2bc965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007faccf8b385c in __gthread_cond_wait (__mutex=, __cond=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_1_0_pre6-slc7_amd64_gcc900/build/CMSSW_11_1_0_pre6-build/BUILD/slc7_amd64_gcc900/external/gcc/9.3.0/gcc-9.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:865
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53
#3 0x00007facc97bc589 in FedRawDataInputSource::readWorker(unsigned int) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libEventFilterUtilities.so
#4 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7faca8050af0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
[...]
Thread 2 (Thread 0x7facb055b700 (LWP 230601)):
#0 0x00007faccf2c0179 in waitpid () from /lib64/libpthread.so.0
#1 0x00007facca0718d7 in edm::service::cmssw_stacktrace_fork() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2 0x00007facca07249a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3 0x00007faccf8b8af0 in std::execute_native_thread_routine (__p=0x7facb08c6bf0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007faccf2b8dd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007faccefe1ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7faccd5d6540 (LWP 228946)):
#0 0x00007faccefd720d in poll () from /lib64/libc.so.6
#1 0x00007facca071cd7 in full_read.constprop () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2 0x00007facca07256c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3 0x00007facca073922 in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4
#5 0x00007faccef1a207 in raise () from /lib64/libc.so.6
#6 0x00007faccef1b8f8 in abort () from /lib64/libc.so.6
#7 0x00007faccef13026 in __assert_fail_base () from /lib64/libc.so.6
#8 0x00007faccef130d2 in __assert_fail () from /lib64/libc.so.6
#9 0x00007fac87a1f48e in void brokenline::circleFit, 0, Eigen::Stride<73728, 24576> >, Eigen::Map, 0, Eigen::Stride<147456, 24576> >, Eigen::Map, 0, Eigen::InnerStride<24576> >, 4>(Eigen::Map, 0, Eigen::Stride<73728, 24576> > const&, Eigen::Map, 0, Eigen::Stride<147456, 24576> > const&, Eigen::Map, 0, Eigen::InnerStride<24576> > const&, double, brokenline::PreparedBrokenLineData<4>&, riemannFit::CircleFit&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00007fac87a153bc in HelixFitOnGPU::launchBrokenLineKernelsOnCPU(TrackingRecHit2DSOAView const*, unsigned int, unsigned int) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00007fac87a32635 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous const&, float) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#12 0x00007fac87a21c64 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#13 0x00007facd1b88e57 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so

We also saw these errors late (after 47 LS) into run 344558 (2 GPU + 2 CPU FUs) and relatively early into run 344560 (standard DAQ configuration with all FUs in).

What is rather strange is that these crashes were seen in standard CPU FUs (eg. fu-c2a02-27-02, fu-c2a02-45-02, fu-c2a01-07-03).

We did not see any crashes during our previous test (e-log) in CMSSW_11_3_2 (run 343991). We also did not see any issues when testing on Hilton in CMSSW_11_3_4.

Best regards,
Mateusz on behalf of TSG FOG

[1] /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2

@cmsbuild
Copy link
Contributor

A new Issue was created by @mzarucki Mateusz Zarucki.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Aug 10, 2021

Did any of the crash happen running with GPUs, or did they all occur without any GPUs involved ?

Is the data from the corresponding runs, lumisections and possible events available ?

Could you prepare some instructions for reproducing the problem ?

@Dr15Jones
Copy link
Contributor

assign reconstruction, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,reconstruction

@slava77,@fwyzard,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fwyzard
Copy link
Contributor

fwyzard commented Aug 10, 2021

The first ELOG reports that

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoPixelVertexing/PixelTrackFitting/interface/BrokenLine.h:381: void brokenline::circleFit(const M3xN&, const M6xN&, const V4&, double, brokenline::PreparedBrokenLineData&, brokenline::karimaki_circle_fit&) [with M3xN = Eigen::Map, 0, Eigen::Stride<73728, 24576> >; M6xN = Eigen::Map, 0, Eigen::Stride<147456, 24576> >; V4 = Eigen::Map, 0, Eigen::InnerStride<24576> >; int n = 4; brokenline::karimaki_circle_fit = riemannFit::CircleFit]: Assertion `circle_results.qCharge * circle_results.par(1) <= 0' failed. 

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

...

#9 0x00007fac87a1f48e in void brokenline::circleFit, 0, Eigen::Stride<73728, 24576> >, Eigen::Map, 0, Eigen::Stride<147456, 24576> >, Eigen::Map, 0, Eigen::InnerStride<24576> >, 4>(Eigen::Map, 0, Eigen::Stride<73728, 24576> > const&, Eigen::Map, 0, Eigen::Stride<147456, 24576> > const&, Eigen::Map, 0, Eigen::InnerStride<24576> > const&, double, brokenline::PreparedBrokenLineData<4>&, riemannFit::CircleFit&) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00007fac87a153bc in HelixFitOnGPU::launchBrokenLineKernelsOnCPU(TrackingRecHit2DSOAView const*, unsigned int, unsigned int) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00007fac87a32635 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous const&, float) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#12 0x00007fac87a21c64 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/pluginRecoPixelVertexingPixelTripletsPlugins.so
#13 0x00007facd1b88e57 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/lib/slc7_amd64_gcc900/libFWCoreFramework.so



The FUs that crashed with there errors are:


fu-c2a02-27-02 - ERRORrun - RUN:344555 - process 228942 on resource(s) ['core11', 'core6', 'core47', 'core0'] exited with signal -6, retries left: 12021-08-10 17:24:22
fu-c2a02-35-02 - ERRORrun - RUN:344555 - process 99268 on resource(s) ['core13', 'core33', 'core25', 'core16'] exited with signal -6, retries left: 12021-08-10 17:23:33
fu-c2a02-45-02 - ERRORrun - RUN:344555 - process 453875 on resource(s) ['core59', 'core8', 'core46', 'core52'] exited with signal -6, retries left: 12021-08-10 17:23:02
fu-c2a02-27-02 - ERRORrun - RUN:344555 - process 228946 on resource(s) ['core43', 'core7', 'core35', 'core33'] exited with signal -6, retries left: 12021-08-10 17:16:52

The call to HelixFitOnGPU::launchBrokenLineKernelsOnCPU should happen only on the nodes without a GPU (unless the menu tries to run both CPU and GPU versions ?). Could you confirm that the error is always the same:

  • fu-c2a02-27-02: no GPUs, two crashes
  • fu-c2a02-35-02: has a GPU
  • fu-c2a02-45-02: no GPUs

To investigate the crash, as usual we would need the full instructions to reproduce it:

  • release CMSSW_11_3_4
  • what HLT menu? even better if you could provide a full python dump, including the DAQ customisations
  • what input data? did the crashes cause the data to be sent to the error stream? is it available?

@mzarucki
Copy link
Contributor Author

mzarucki commented Aug 10, 2021

Hi @fwyzard,

Did any of the crash happen running with GPUs, or did they all occur without any GPUs involved ?

Correction: the Pixel track fitting crashes occurred on CPUs only (fu-c2a02-27-02, fu-c2a02-45-02,fu-c2a01-07-03).

The call to HelixFitOnGPU::launchBrokenLineKernelsOnCPU should happen only on the nodes without a GPU (unless the menu tries to run both CPU and GPU versions ?).

There might be an issue with the HLT_Pixel path, which was added to run Pixel reconstruction (CMSHLT-2157) but does not filter on anything:

process.HLT_Pixel_v1 = cms.Path( process.HLTBeginSequenceHT + process.hltPrePixelGPU + process.HLTDoLocalPixelSequence + process.HLTRecopixelvertexingSequence + process.HLTEndSequence )

what HLT menu? even better if you could provide a full python dump, including the DAQ customisations

The HLT menu name is listed under [1] above. I will attach the full python config to this ticket.

Is the data from the corresponding runs, lumisections and possible events available ?

what input data? did the crashes cause the data to be sent to the error stream? is it available?

Could you prepare some instructions for reproducing the problem ?

We will work on a recipe to reproduce the errors, over the same input data (we contacted our DAQ colleagues to save it locally).

We will keep you updated.

Best,
Mateusz

@mzarucki
Copy link
Contributor Author

The menu /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2 python config file is attached here as CosmicsGPU-V2.txt. You can also find it under /nfshome0/mzarucki/GPUTests/CosmicsGPU-V2.py.

Note: The only difference wrt. V1 (and CosmicsGPUFixed-V1.py) is the change in the GT configuration, relevant to the BeamSpot workflow updates.

Best,
Mateusz

@mzarucki
Copy link
Contributor Author

mzarucki commented Aug 11, 2021

Hi again,

The .raw data could not be copied locally (only during the run it seems) and the repacked RAW .root files would need a bit time to be accessible via Rucio.

Nevertheless, I tried to recreate the crashes using an older set of data from run 343387 (stored on EOS via our Rucio rule):
/store/data/Commissioning2021/Cosmics/RAW/v1/000/343/387/00000/05c79a06-0011-4a30-8762-02b7708ecaaa.root

with the following simplified recipe (which includes the input file config) that is to be run on a GPU machine (see TriggerDevelopmentWithGPUs):

cmsrel CMSSW_11_3_4
cd CMSSW_11_3_4/src
cmsenv
cp /nfshome0/mzarucki/GPUTests/CosmicsGPU-V2_modified.py .
cmsRun CosmicsGPU-V2_modified.py

Note: Here number of threads/streams is unset.

After running for a while, I do see crashes, however, of a different nature:

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc:143: 
virtual void SiPixelRecHitFromCUDA::produce(edm::Event&, const edm::EventSetup&): 
Assertion `nhits <= dsv.size()' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

...

Module: SiPixelRecHitFromCUDA:hltSiPixelRecHits@cuda (crashed)

Revisiting F3Mon I do see this crash on the GPU nodes (added e-log). Therefore, the initial report is incorrect that there is only one type of crash. Apologies for overlooking this.

Therefore, to summarise, we saw this PixelRecHitFromCUDA crash only on the GPU nodes and the initial Pixel track fitting crash on the CPU nodes. I have updated the above responses to avoid confusion.

Best,
Mateusz

@mzarucki mzarucki changed the title Pixel track fitting crash during online GPU tests (CMSSW_11_3_4) Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) Aug 11, 2021
@fwyzard
Copy link
Contributor

fwyzard commented Aug 11, 2021

As far as I know, the pixel track reconstruction has not been designed to run at 0 tesla.

My first suggestion would be to switch it off, and turn it on only once the magnetic field is nominal.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 11, 2021

@mtosi @vmariani @mmusich FYI

@fwyzard
Copy link
Contributor

fwyzard commented Aug 11, 2021

After copying the input file locally¹ and updating the configuration to use it, the GPU error is consistently reproducible:

cmsRun: .../src/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc:143:
virtual void SiPixelRecHitFromCUDA::produce(edm::Event&, const edm::EventSetup&): Assertion `nhits <= dsv.size()' failed.

Instead, I have not been able to reproduce the CPU-only crash


  1. kinit ${USER}@CERN.CH
    xrdcp root://eoscms.cern.ch//eos/cms/store/data/Commissioning2021/Cosmics/RAW/v1/000/343/387/00000/05c79a06-0011-4a30-8762-02b7708ecaaa.root .
    

@mzarucki
Copy link
Contributor Author

Dear all,

Today we performed a repeated test of the HLT GPU menu (e-log), with the intent of reproducing the above crashes. During the test, in run 344676 we were able to see the GPU crash (e-log) only. Considering the GPU menu is exactly the same as the cosmic menu in its content, the decision was made to keep running with the cosmics GPU menu. Roughly an hour into the run 344679 after the formal GPU tests, we saw the CPU crash (e-log) that we were trying to reproduce.

Since DAQ enabled the error stream this morning, we were able to access the data from the crashes. The .raw files from the GPU crash (recipe above) have been saved in /nfshome0/mzarucki/GPUTests/PixelCrashGPU, whereas the data from the CPU crash have been saved as /nfshome0/mzarucki/GPUTests/PixelCrashCPU/run344679_ls0151_index000010_fu-c2a02-33-02_pid48404.raw.

Here is a recipe reproduce the CPU crash (which can be done on a GPU node):

export CUDA_VISIBLE_DEVICES=
cmsrel CMSSW_11_3_4
cd CMSSW_11_3_4/src
cmsenv
cp /nfshome0/mzarucki/GPUTests/PixelCrashCPU/CosmicsGPU-V2_modified.py .
cmsRun CosmicsGPU-V2_modified.py

where the HLT config file has been modified to take the error stream data as input.

Best regards,
Mateusz on behalf of FOG

@fwyzard
Copy link
Contributor

fwyzard commented Aug 12, 2021

Note that the two crashes seem unrelated:

  • running over the events from run 344679 crashes the CPU reconstruction, but seems to work fine with the GPU reconstruction (though it could just be that we do not enable assertions on the GPU)
  • running over the events from run 344676 crashes on the GPU (actually, in the module that copies the data from the GPU to the CPU) but works fine on the CPU

@jlawrenc
Copy link

jlawrenc commented Aug 13, 2021

As mentioned in this elog: http://cmsonline.cern.ch/cms-elog/1122928, we attempted to reproduce the error during run 344675, but nothing occurred during the 24 minute run. When we included DT in the next run (344676), the error occurred (only GPU error), within 11 minutes.

@slava77
Copy link
Contributor

slava77 commented Aug 13, 2021

@cms-sw/trk-dpg-l2
please take a note (I'm not sure if you received this already)

@fwyzard
do you know, perhaps, if someone is looking/debugging this already?

@fwyzard
Copy link
Contributor

fwyzard commented Aug 13, 2021

@fwyzard
do you know, perhaps, if someone is looking/debugging this already?

No, not that I know of.
I've just started looking at them myself.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 13, 2021

The error from run 344676 looks like a real mismatch between the legacy and GPU pixel cluster / rechit reconstruction.

I think we need @mmusich or @VinInn to have a look.

@slava77
Copy link
Contributor

slava77 commented Sep 3, 2021

the case with Assertion circle_results.qCharge * circle_results.par(1) <= 0' failed.` is apparently "fixed" (the assert is removed) in #35128.

Will this need to be backported to 11_3_X and 12_0_X, or is 12_0_X enough?

@fwyzard
Copy link
Contributor

fwyzard commented Sep 3, 2021 via email

@makortel
Copy link
Contributor

This one has been fixed, right?

@VinInn
Copy link
Contributor

VinInn commented Oct 14, 2021

"fixed" yes.

@makortel
Copy link
Contributor

+heterogneous

@jpata
Copy link
Contributor

jpata commented May 17, 2022

@jpata
Copy link
Contributor

jpata commented May 17, 2022

@cmsbuild please close

(typo in the heterogeneous comment prevented a full sig)

@fwyzard
Copy link
Contributor

fwyzard commented May 17, 2022

+heterogeneous

Just for the record... we should have picked a word easier to spell :-/

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants