-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715
Comments
A new Issue was created by @silviodonato Silvio Donato. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Note: the release is |
assign hlt, reconstruction, heterogeneous |
New categories assigned: heterogeneous,hlt,reconstruction @mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
if one removes |
Yes, indeed I wanted to say that you get different results even if you run everything on CPU except |
Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis on GPU vs everything on CPU.
That's very very stange. unpaker is supposed to be identical. Nothing can be different (unless the cabiing map is wrong on GPU)
|
Hello Silvio, |
Hello,
vs
not sure if this can be the cause of the large difference that appears in
|
ping @cms-sw/trk-dpg-l2 |
But those are "user" errors. Should not affect reco |
I've taken an event where we have differences (290th in the file provided). There is no difference in the local reco objects (clusters,digis and hits). The problem seems that on GPU we get extra On GPU adding in diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..ab06086d150 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -168,11 +169,11 @@ namespace pixelgpudetails {
template <bool debug = false>
__device__ uint8_t
checkROC(uint32_t errorWord, uint8_t fedId, uint32_t link, const SiPixelROCsStatusAndMapping *cablingMap) {
uint8_t errorType = (errorWord >> sipixelconstants::ROC_shift) & sipixelconstants::ERROR_mask;
if (errorType < 25)
return 0;
+ printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId+1200,link,sipixelconstants::ROC_shift,sipixelconstants::ERROR_mask);
bool errorFound = false;
and for CPU side adding in diff --git a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
index 9bde98bef92..10cd0154d6d 100644
--- a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
+++ b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
@@ -27,8 +28,11 @@ bool ErrorChecker::checkROC(bool& errorsInEvent,
Word32& errorWord,
SiPixelFormatterErrors& errors) const {
int errorType = (errorWord >> ROC_shift) & ERROR_mask;
+
if LIKELY (errorType < 25)
return true;
+
+ printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId,(errorWord >> LINK_shift) & LINK_mask,ROC_shift,ERROR_mask); the GPU output has a couple of extra lines: 3358a3359
> errorType;29;1201668113;1208;17
3359a3361
> errorType;29;329252868;1227;4 Now why this is happening still I don't know. But the difference is there and it's causing, in this event, modules |
here there is an early return maybe also for other cases |
this change alone (which if I am not mistaken would put the FED error zoology treatment on par between GPU and CPU) diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..2f728fe024e 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -189,13 +189,13 @@ namespace pixelgpudetails {
case (26): {
if constexpr (debug)
printf("Gap word found (errorType = 26)\n");
- errorFound = true;
+ errorFound = false;
break;
}
case (27): {
if constexpr (debug)
printf("Dummy word found (errorType = 27)\n");
- errorFound = true;
+ errorFound = false;
break;
}
case (28): {
@@ -208,8 +208,10 @@ namespace pixelgpudetails {
if constexpr (debug)
printf("Timeout on a channel (errorType = 29)\n");
if ((errorWord >> sipixelconstants::OMIT_ERR_shift) & sipixelconstants::OMIT_ERR_mask) {
+ errorFound = false;
if constexpr (debug)
printf("...first errorType=29 error, this gets masked out\n");
+ break;
}
errorFound = true;
break; only gets the GPU count down of 1 (from 82 to 81 passing events). |
Hi everyone, following up on the comments and using this commit from Marco - I reran the to check the differences between CPU and GPU for For example:
Now with the change:
Similarly for the other The details are in this spreadsheet |
Hi everyone, Before :
Now with the PR:
Similarly for the other |
For the record there were two additional fixes: |
PR #42977 was included in CMSSW_13_2_6_patch1, that went online on October 13th 2023 for HI collision runs starting from run 375083. For more information about the effect on the offline pp menu, see #42978 (comment). |
Dear all,
@gparida recently made a new CPU vs CPU+GPU comparison of the trigger result of the 2023 HLT menu.
The results showed a very large difference in the soft di-electron parking.
Basically
HLT_DoubleEleXX_eta1p22_mMax6_v3
have a +25% of rate when running on GPU. The good news is that almost all events triggered by CPU are triggered also by GPU. This means that we are not loosing signal events at P5.The di-electron paths were recently updated in CMSHLT-2635 with the usage of triplets, instead of doublets, in the electron reconstruction.
Minor differences were already visible in the old version of the path (based on doublets) here , but in that case the average rate of CPU and GPU was compatible.
I investigated a bit the problem, and I see that the differences are already visible in the pixel matching module (
hltDoubleEle4eta1p22PixelMatchFilter
) before the GSF tracking.How to reproduce the problem
I copied in
/afs/cern.ch/work/s/sdonato/public/GPU_May23/hlt_onlypixelmatching_dump.py
a python config containing a fake HLT path running the pixel matching filter, and in/afs/cern.ch/work/s/sdonato/public/GPU_May23/skim.root
740 events passing all the filters before it.Running
you can see that:
The differences are still visible if you remove the following cuda branches
but they disappear if you remove
This means that the origin of the difference is somehow in the pixel unpacker:
@cms-sw/hlt-l2 @cms-sw/egamma-pog-l2 @cms-sw/trk-dpg-l2 @cms-sw/heterogeneous-l2
The text was updated successfully, but these errors were encountered: