-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT GPU crash observed during the Heavy Ion test Run #40623
Comments
A new Issue was created by @denerslemos Dener Lemos. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign heterogeneous, hlt, reconstruction FYI @cms-sw/trk-dpg-l2, @cms-sw/tracking-pog-l2 |
New categories assigned: heterogeneous,hlt,reconstruction @mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Does the job use multiple threads/streams? If it does, the actual error could happen elsewhere than |
I have run using the default, which is think is multi thread. Should I test it again using single thread? |
This goes away with something like diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 48dfa98839d..6e3d50e6d5b 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -656,7 +656,7 @@ namespace pixelgpudetails {
digis_d.view().moduleInd(), clusters_d.moduleStart(), digis_d.view().clus(), wordCounter);
cudaCheck(cudaGetLastError());
- threadsPerBlock = 256 + 128; /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
+ threadsPerBlock = 256 + 128 + 128; /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
blocks = phase1PixelTopology::numberOfModules;
#ifdef GPU_DEBUG
std::cout << "CUDA findClus kernel launch with " << blocks << " blocks of " << threadsPerBlock << " threads\n";
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
index ed3510e4918..fe36d22ab46 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
@@ -141,7 +141,7 @@ namespace gpuClustering {
//init hist (ymax=416 < 512 : 9bits)
//6000 max pixels required for HI operations with no measurable impact on pp performance
- constexpr uint32_t maxPixInModule = 6000;
+ constexpr uint32_t maxPixInModule = 10000;
constexpr auto nbins = isPhase2 ? 1024 : phase1PixelTopology::numColsInModule + 2; //2+2;
constexpr auto nbits = isPhase2 ? 10 : 9; //2+2;
using Hist = cms::cuda::HistoContainer<uint16_t, nbins, maxPixInModule, nbits, uint16_t>;
[adiflori@fu-c2a02-37-02 (gpu-c2a02-37-02) src]$ vi RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
[adiflori@fu-c2a02-37-02 (gpu-c2a02-37-02) src]$ git diff
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 48dfa98839d..e5d59b1540b 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -656,7 +656,7 @@ namespace pixelgpudetails {
digis_d.view().moduleInd(), clusters_d.moduleStart(), digis_d.view().clus(), wordCounter);
cudaCheck(cudaGetLastError());
- threadsPerBlock = 256 + 128; /// should be larger than 6000/16 aka (maxPixInModule/maxiter in the kernel)
+ threadsPerBlock = 256 + 128 + 128 + 128; /// should be larger than 10000/16 aka (maxPixInModule/maxiter in the kernel)
blocks = phase1PixelTopology::numberOfModules;
#ifdef GPU_DEBUG
std::cout << "CUDA findClus kernel launch with " << blocks << " blocks of " << threadsPerBlock << " threads\n";
diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
index ed3510e4918..fe36d22ab46 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h
@@ -141,7 +141,7 @@ namespace gpuClustering {
//init hist (ymax=416 < 512 : 9bits)
//6000 max pixels required for HI operations with no measurable impact on pp performance
- constexpr uint32_t maxPixInModule = 6000;
+ constexpr uint32_t maxPixInModule = 10000;
constexpr auto nbins = isPhase2 ? 1024 : phase1PixelTopology::numColsInModule + 2; //2+2;
constexpr auto nbits = isPhase2 ? 10 : 9; //2+2;
using Hist = cms::cuda::HistoContainer<uint16_t, nbins, maxPixInModule, nbits, uint16_t>; in |
The "B12" in the sample name means its peripheral events, so pretty light. There's also real data from the test run, but the MB trigger was pretty noisy. |
@mandrenguyen thanks! I'd check that then. |
Any chance that a fix for this could converge in time for |
The fix seem to be pretty HI-dependent (I don't think we can attain that occupancy in pp). |
I didn't know that. I agree it's not urgent, I asked as a way to understand what the status of the fix is. |
In principle this could be disentangled from pp (and that's my plan for the final fix). I have a PR basically ready for this but with the whole Alpaka migration happening in the background I would wait for that to happen to have this fix on top. |
@AdrianoDee , was this issue resolved by #41632 ? |
@missirol yes, using the HIonPhase1 modules in place of the standard "pp" ones. |
+hlt
|
+1 |
Hi all,
During the HI test run in the end of 2022 we observed some crash's happening (1 in every ~2000 events) that looks like are coming from HLT GPU's.
I have reproduced some of the errors using Run 362317 which still available in the stream errors (
/store/error_stream/run362317
). We follow the instructions in https://twiki.cern.ch/twiki/bin/view/CMS/HLTReportingFarmCrashes, but using for EvFDaqDirector and FedRawDataInputSource the following:I copy and paste the part of error here:
Since it is HI collisions a lot of tracks are produces and looks like the number of pixels is high than some threshold which I think crashes the hltSiPixelClustersGPU. Would be good if we can solve this issue.
Thank you in advance,
Best regards,
Dener Lemos
@FHead
@missirol
@fwyzard
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
The text was updated successfully, but these errors were encountered: