HLT GPU crash observed during the Heavy Ion test Run #40623

denerslemos · 2023-01-26T16:49:20Z

Hi all,

During the HI test run in the end of 2022 we observed some crash's happening (1 in every ~2000 events) that looks like are coming from HLT GPU's.

I have reproduced some of the errors using Run 362317 which still available in the stream errors (/store/error_stream/run362317). We follow the instructions in https://twiki.cern.ch/twiki/bin/view/CMS/HLTReportingFarmCrashes, but using for EvFDaqDirector and FedRawDataInputSource the following:

process.EvFDaqDirector.buBaseDir = '/store/error_stream'
process.EvFDaqDirector.runNumber = 362317

process.source.fileListMode = True
process.source.fileNames = cms.untracked.vstring('/store/error_stream/run362317/run362317_ls0011_index000031_fu-c2b02-21-01_pid3489238.raw')

I copy and paste the part of error here:

%MSG-w EvFDaqDirector:  DQMFileSaverPB:hltDQMFileSaverPB@beginRun  18-Jan-2023 14:50:51  Run: 362317
Transfer system mode definitions missing for -: streamDQMHistograms (permissive mode)
%MSG
too many pixels in module 44: 6932 > 6000
too many pixels in module 47: 6990 > 6000
too many pixels in module 45: 7330 > 6000
too many pixels in module 52: 6538 > 6000
too many pixels in module 54: 6788 > 6000
too many pixels in module 53: 6316 > 6000
too many pixels in module 40: 7670 > 6000
too many pixels in module 43: 7204 > 6000
too many pixels in module 42: 7392 > 6000
too many pixels in module 41: 7682 > 6000
too many pixels in module 51: 7362 > 6000
too many pixels in module 49: 7182 > 6000
too many pixels in module 48: 7334 > 6000
----- Begin Fatal Exception 18-Jan-2023 14:51:02 -----------------------
An exception of category 'CUDAError' occurred while
   [0] Processing  Event run: 362317 lumi: 11 event: 5176582 stream: 0
   [1] Running path 'AlCa_LumiPixelsCounts_ZeroBias_v4'
   [2] Calling method for module SiPixelRawToClusterCUDA/'hltSiPixelClustersGPU'
Exception Message:
Callback of CUDA stream 0x7fb8d7015670 in device 0 error cudaErrorIllegalAddress: an illegal memory access was encountered
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 18-Jan-2023 14:51:02   Run: 362317 Event: 5176582
 STREAM 0 earlyTermination -: ID:run: 362317 lumi: 11 event: 5176582 LS:11  FromThisContext
%MSG
terminate called after throwing an instance of 'std::runtime_error'
  what():
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_2-el8_amd64_gcc10/build/CMSSW_12_5_2-build/tmp/BUILDROOT/031342ca2bb2896e4fe0fed19213b336/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/src/CalibTracker/SiPixelESProducers/src/SiPixelGainCalibrationForHLTGPU.cc, line 77:
cudaCheck(cudaFreeHost(gainForHLTonHost_));
cudaErrorIllegalAddress: an illegal memory access was encountered

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

Wed Jan 18 14:51:03  2023
Thread 8 (Thread 0x7fb8d7fff700 (LWP 1076578) "cmsRun"):
#0  0x00007fb90fa64d98 in nanosleep () from /lib64/libc.so.6
#1  0x00007fb90fa64c9e in sleep () from /lib64/libc.so.6
#2  0x00007fb905e9d3a0 in sig_pause_for_stacktrace () from /opt/offline/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_2/lib/el8_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fb90fa9355d in syscall () from /lib64/libc.so.6
#5  0x00007fb910bcd2c7 in tbb::detail::r1::futex_wait (comparand=2, futex=0x7fb90a8e912c) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/semaphore.h:103
#6  tbb::detail::r1::binary_semaphore::P (this=0x7fb90a8e912c) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/semaphore.h:290
#7  0x00007fb910bdf8f1 in tbb::detail::r1::rml::internal::thread_monitor::commit_wait (c=..., this=0x7fb90a8e9120) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/rml_thread_monitor.h:243
#8  tbb::detail::r1::rml::private_worker::run (this=0x7fb90a8e9100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/private_server.cpp:274
#9  tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fb90a8e9100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_5_0_pre4-el8_amd64_gcc10/build/CMSSW_12_5_0_pre4-build/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-3cd580209e999b2fb4f8344347204353/tbb-v2021.5.0/src/tbb/private_server.cpp:221
#10 0x00007fb90fd6b17a in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb90fa98df3 in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fb843ef0700 (LWP 1076577) "cmsRun"):
#0  0x00007fb90fd73cd6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007fb90fd73dc8 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fb90286e0a2 in ?? () from /lib64/libcuda.so.1

.
.
.

Current Modules:

Module: none (crashed)
Module: none

A fatal system signal has occurred: abort signal

Since it is HI collisions a lot of tracks are produces and looks like the number of pixels is high than some threshold which I think crashes the hltSiPixelClustersGPU. Would be good if we can solve this issue.

Thank you in advance,

Best regards,
Dener Lemos

@FHead
@missirol
@fwyzard
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-01-26T16:49:37Z

A new Issue was created by @denerslemos Dener Lemos.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2023-01-26T17:00:37Z

assign heterogeneous, hlt, reconstruction

FYI @cms-sw/trk-dpg-l2, @cms-sw/tracking-pog-l2

cmsbuild · 2023-01-26T17:01:01Z

New categories assigned: heterogeneous,hlt,reconstruction

@mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2023-01-26T17:08:29Z

Does the job use multiple threads/streams? If it does, the actual error could happen elsewhere than hltSiPixelClustersGPU and that module just happens to be the first one to catch and report it (because in CUDA errors in asynchronous processing are reported by all calls to CUDA API issued after the error).

denerslemos · 2023-01-26T19:28:44Z

I have run using the default, which is think is multi thread. Should I test it again using single thread?

AdrianoDee · 2023-02-01T15:04:31Z

This goes away with something like

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 48dfa98839d..6e3d50e6d5b 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -656,7 +656,7 @@ namespace pixelgpudetails {
           digis_d.view().moduleInd(), clusters_d.moduleStart(), digis_d.view().clus(), wordCounter);
       cudaCheck(cudaGetLastError());
 
-      threadsPerBlock = 256 + 128;  /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
+      threadsPerBlock = 256 + 128 + 128;  /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
       blocks = phase1PixelTopology::numberOfModules;
 #ifdef GPU_DEBUG
       std::cout << "CUDA findClus kernel launch with " << blocks << " blocks of " << threadsPerBlock << " threads\n";
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
index ed3510e4918..fe36d22ab46 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
@@ -141,7 +141,7 @@ namespace gpuClustering {
 
       //init hist  (ymax=416 < 512 : 9bits)
       //6000 max pixels required for HI operations with no measurable impact on pp performance
-      constexpr uint32_t maxPixInModule = 6000;
+      constexpr uint32_t maxPixInModule = 10000;
       constexpr auto nbins = isPhase2 ? 1024 : phase1PixelTopology::numColsInModule + 2;  //2+2;
       constexpr auto nbits = isPhase2 ? 10 : 9;                                           //2+2;
       using Hist = cms::cuda::HistoContainer<uint16_t, nbins, maxPixInModule, nbits, uint16_t>;
[adiflori@fu-c2a02-37-02 (gpu-c2a02-37-02) src]$ vi RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu 
[adiflori@fu-c2a02-37-02 (gpu-c2a02-37-02) src]$ git diff
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 48dfa98839d..e5d59b1540b 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -656,7 +656,7 @@ namespace pixelgpudetails {
           digis_d.view().moduleInd(), clusters_d.moduleStart(), digis_d.view().clus(), wordCounter);
       cudaCheck(cudaGetLastError());
 
-      threadsPerBlock = 256 + 128;  /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
+      threadsPerBlock = 256 + 128 + 128 + 128;  /// should be larger than 10000/16 aka (maxPixInModule/maxiter in the kernel)
       blocks = phase1PixelTopology::numberOfModules;
 #ifdef GPU_DEBUG
       std::cout << "CUDA findClus kernel launch with " << blocks << " blocks of " << threadsPerBlock << " threads\n";
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
index ed3510e4918..fe36d22ab46 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
@@ -141,7 +141,7 @@ namespace gpuClustering {
 
       //init hist  (ymax=416 < 512 : 9bits)
       //6000 max pixels required for HI operations with no measurable impact on pp performance
-      constexpr uint32_t maxPixInModule = 6000;
+      constexpr uint32_t maxPixInModule = 10000;
       constexpr auto nbins = isPhase2 ? 1024 : phase1PixelTopology::numColsInModule + 2;  //2+2;
       constexpr auto nbits = isPhase2 ? 10 : 9;                                           //2+2;
       using Hist = cms::cuda::HistoContainer<uint16_t, nbins, maxPixInModule, nbits, uint16_t>;

in 12_5_3. It would be useful to have a distribution for the number of pixel in a module. When checking with RelValHydjetQ_B12_5020GeV_2021 (e.g.) I see that the distribution is well below 6000 (see below for 5000 events). Is there a more representative sample?

mandrenguyen · 2023-02-01T15:12:45Z

The "B12" in the sample name means its peripheral events, so pretty light.
A MinBias Hydjet sample is here:
https://cmsweb.cern.ch/das/request?input=/MinBias_Hydjet_Drum5F_5p02TeV/Run3Winter22PbPbNoMixRECOMiniAOD-122X_mcRun3_2021_realistic_HI_v10-v3/MINIAODSIM

There's also real data from the test run, but the MB trigger was pretty noisy.
Maybe best to stick with MC.

AdrianoDee · 2023-02-01T15:13:55Z

@mandrenguyen thanks! I'd check that then.

missirol · 2023-02-24T20:02:36Z

Any chance that a fix for this could converge in time for 13_0_0 ?

mmusich · 2023-02-24T20:06:57Z

Any chance that a fix for this could converge in time for 13_0_0 ?

The fix seem to be pretty HI-dependent (I don't think we can attain that occupancy in pp).
Then, why the urgency? My understanding is that the next HI run will be processed in 13_2_x (as per this).
Does HLT plan to stick to 13_0_X instead?

missirol · 2023-02-24T20:21:10Z

(I don't think we can attain that occupancy in pp)

I didn't know that. I agree it's not urgent, I asked as a way to understand what the status of the fix is.

AdrianoDee · 2023-02-25T07:56:40Z

In principle this could be disentangled from pp (and that's my plan for the final fix). I have a PR basically ready for this but with the whole Alpaka migration happening in the background I would wait for that to happen to have this fix on top.

missirol · 2023-08-07T16:20:51Z

@AdrianoDee , was this issue resolved by #41632 ?

AdrianoDee · 2023-08-08T07:49:11Z

@missirol yes, using the HIonPhase1 modules in place of the standard "pp" ones.

mmusich · 2023-10-14T14:16:19Z

+hlt

see HLT GPU crash observed during the Heavy Ion test Run #40623 (comment)
technically I think this particular issue is fixed by More Configurable Pixel Tracks and HIon Template #41632, even though in the 2023 production HIon run we weren't able yet to run the full pixel tracking on GPUs (only local reco)

mandrenguyen · 2023-10-14T14:42:22Z

+1

cmsbuild added the pending-assignment label Jan 26, 2023

cmsbuild added heterogeneous-pending hlt-pending pending-signatures reconstruction-pending and removed pending-assignment labels Jan 26, 2023

AdrianoDee mentioned this issue May 11, 2023

More Configurable Pixel Tracks and HIon Template #41632

Merged

cmsbuild added hlt-approved and removed hlt-pending labels Oct 14, 2023

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT GPU crash observed during the Heavy Ion test Run #40623

HLT GPU crash observed during the Heavy Ion test Run #40623

denerslemos commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

makortel commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

makortel commented Jan 26, 2023

denerslemos commented Jan 26, 2023

AdrianoDee commented Feb 1, 2023 •

edited

Loading

mandrenguyen commented Feb 1, 2023

AdrianoDee commented Feb 1, 2023

missirol commented Feb 24, 2023

mmusich commented Feb 24, 2023

missirol commented Feb 24, 2023

AdrianoDee commented Feb 25, 2023 •

edited

Loading

missirol commented Aug 7, 2023

AdrianoDee commented Aug 8, 2023

mmusich commented Oct 14, 2023

mandrenguyen commented Oct 14, 2023

HLT GPU crash observed during the Heavy Ion test Run #40623

HLT GPU crash observed during the Heavy Ion test Run #40623

Comments

denerslemos commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

makortel commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

makortel commented Jan 26, 2023

denerslemos commented Jan 26, 2023

AdrianoDee commented Feb 1, 2023 • edited Loading

mandrenguyen commented Feb 1, 2023

AdrianoDee commented Feb 1, 2023

missirol commented Feb 24, 2023

mmusich commented Feb 24, 2023

missirol commented Feb 24, 2023

AdrianoDee commented Feb 25, 2023 • edited Loading

missirol commented Aug 7, 2023

AdrianoDee commented Aug 8, 2023

mmusich commented Oct 14, 2023

mandrenguyen commented Oct 14, 2023

AdrianoDee commented Feb 1, 2023 •

edited

Loading

AdrianoDee commented Feb 25, 2023 •

edited

Loading