Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715

silviodonato · 2023-05-18T07:42:44Z

Dear all,

@gparida recently made a new CPU vs CPU+GPU comparison of the trigger result of the 2023 HLT menu.
The results showed a very large difference in the soft di-electron parking.
Basically HLT_DoubleEleXX_eta1p22_mMax6_v3 have a +25% of rate when running on GPU. The good news is that almost all events triggered by CPU are triggered also by GPU. This means that we are not loosing signal events at P5.

The di-electron paths were recently updated in CMSHLT-2635 with the usage of triplets, instead of doublets, in the electron reconstruction.

Minor differences were already visible in the old version of the path (based on doublets) here , but in that case the average rate of CPU and GPU was compatible.

I investigated a bit the problem, and I see that the differences are already visible in the pixel matching module (hltDoubleEle4eta1p22PixelMatchFilter) before the GSF tracking.

How to reproduce the problem

I copied in /afs/cern.ch/work/s/sdonato/public/GPU_May23/hlt_onlypixelmatching_dump.py a python config containing a fake HLT path running the pixel matching filter, and in /afs/cern.ch/work/s/sdonato/public/GPU_May23/skim.root 740 events passing all the filters before it.

Running

CUDA_VISIBLE_DEVICES=0 cmsRun hlt_onlypixelmatching_dump.py >& logGPU &
CUDA_VISIBLE_DEVICES= cmsRun hlt_onlypixelmatching_dump.py >& logCPU &

you can see that:

with GPU 82 events are accepted
without GPU 73 events are accepted.

The differences are still visible if you remove the following cuda branches

del process.hltEcalDigis.cuda
del process.hltEcalUncalibRecHit.cuda
del process.hltHbhereco.cuda
del process.hltPixelTracksSoA.cuda
del process.hltPixelVerticesSoA.cuda
del process.hltSiPixelRecHits.cuda
del process.hltSiPixelRecHitsSoA.cuda
del process.hltSiPixelClusters.cuda

but they disappear if you remove

del process.hltSiPixelDigis.cuda

This means that the origin of the difference is somehow in the pixel unpacker:

process.hltSiPixelClustersGPU = cms.EDProducer("SiPixelRawToClusterCUDA",
    CablingMapLabel = cms.string(''),
    IncludeErrors = cms.bool(True),
    InputLabel = cms.InputTag("rawDataCollector"),
    Regions = cms.PSet(

    ),
    UseQualityInfo = cms.bool(False),
    clusterThreshold_layer1 = cms.int32(4000),
    clusterThreshold_otherLayers = cms.int32(4000),
    isRun2 = cms.bool(False)
)

process.hltSiPixelDigiErrorsSoA = cms.EDProducer("SiPixelDigiErrorsSoAFromCUDA",
    src = cms.InputTag("hltSiPixelClustersGPU")
)

process.hltSiPixelDigisFromSoA = cms.EDProducer("SiPixelDigiErrorsFromSoA",
    CablingMapLabel = cms.string(''),
    ErrorList = cms.vint32(29),
    UsePhase1 = cms.bool(True),
    UserErrorList = cms.vint32(40),
    digiErrorSoASrc = cms.InputTag("hltSiPixelDigiErrorsSoA")
)

process.hltSiPixelDigisLegacy = cms.EDProducer("SiPixelRawToDigi",
    CablingMapLabel = cms.string(''),
    ErrorList = cms.vint32(29),
    IncludeErrors = cms.bool(True),
    InputLabel = cms.InputTag("rawDataCollector"),
    Regions = cms.PSet(

    ),
    SiPixelQualityLabel = cms.string(''),
    UsePhase1 = cms.bool(True),
    UsePilotBlade = cms.bool(False),
    UseQualityInfo = cms.bool(False),
    UserErrorList = cms.vint32()
)

process.hltSiPixelDigis = SwitchProducerCUDA(
    cpu = cms.EDAlias(
        hltSiPixelDigisLegacy = cms.VPSet(
            cms.PSet(
                type = cms.string('DetIdedmEDCollection')
            ),
            cms.PSet(
                type = cms.string('SiPixelRawDataErroredmDetSetVector')
            ),
            cms.PSet(
                type = cms.string('PixelFEDChanneledmNewDetSetVector')
            )
        )
    ),
    cuda = cms.EDAlias(
        hltSiPixelDigisFromSoA = cms.VPSet(cms.PSet(
            type = cms.string('*')
        ))
    )
)

@cms-sw/hlt-l2 @cms-sw/egamma-pog-l2 @cms-sw/trk-dpg-l2 @cms-sw/heterogeneous-l2

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-05-18T07:43:01Z

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

AdrianoDee · 2023-05-18T12:09:02Z

Note: the release is 13_0_5_patch1

makortel · 2023-05-18T13:55:47Z

assign hlt, reconstruction, heterogeneous

cmsbuild · 2023-05-18T13:56:02Z

New categories assigned: heterogeneous,hlt,reconstruction

@mandrenguyen,@missirol,@fwyzard,@clacaputo,@makortel,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

VinInn · 2023-05-18T17:16:19Z

if one removes process.hltSiPixelDigis.cuda how the rest of the pixel workflow is supposed to run on GPU?

silviodonato · 2023-05-19T20:14:03Z

Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis on GPU vs everything on CPU.

VinInn · 2023-05-20T09:34:16Z

Yes, indeed I wanted to say that you get different results even if you run everything on CPU except process.hltSiPixelDigis on GPU vs everything on CPU.

That's very very stange. unpaker is supposed to be identical. Nothing can be different (unless the cabiing map is wrong on GPU)

swagata87 · 2023-05-31T10:19:41Z

Hello Silvio,
Thank you for reporting this.
While electron HLT paths are affected, it looks like the real issue is somewhere upstream.
Let me tag EGM HLT convenors @RSalvatico and @ravindkv and EGM Reco convenors @sameasy and @Prasant1993, who can follow it up.

silviodonato · 2023-06-01T17:32:15Z

Hello,
running on CPU and on GPU the pixel local sequence, I noticed that DetIdedmEDCollection_hltSiPixelDigis__HLTX. (edm::EDCollection<DetId> "hltSiPixelDigis") is different.
(CPU)

root [1] Events->Scan("DetIdedmEDCollection_hltSiPixelDigis__HLTX.obj.size()")
************************
*    Row   * DetIdedmE *
************************
*        0 *         0 *
************************

vs
(GPU)

root [1] Events->Scan("DetIdedmEDCollection_hltSiPixelDigis__HLTX.obj.size()")
************************
*    Row   * DetIdedmE *
************************
*        0 *         2 *
************************

not sure if this can be the cause of the large difference that appears in hltEgammaElectronPixelSeeds:

[email protected]() is 4 on CPU and 36 on GPU.

silviodonato · 2023-06-06T08:37:32Z

ping @cms-sw/trk-dpg-l2

VinInn · 2023-06-06T09:07:00Z

UserErrorList = cms.vint32(40)
vs
UserErrorList = cms.vint32()
?

But those are "user" errors. Should not affect reco
(still better to make the two config consistent....)

AdrianoDee · 2023-06-07T11:04:17Z

I've taken an event where we have differences (290th in the file provided). There is no difference in the local reco objects (clusters,digis and hits). The problem seems that on GPU we get extra SiPixelDigiErrors. These are stored in the hltPixelDigis and then are parsed to process.hltSiStripClusters = cms.EDProducer("MeasurementTrackerEventProducer" ... and are used to flag the "bad detectors". This then messes with the hltEgammaElectronPixelSeeds producer because trajectories are different (the trajectory builder checks if a module is active or not and a bad one is set to be inactive, if I understood well).

On GPU adding in RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu something like:

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..ab06086d150 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -168,11 +169,11 @@ namespace pixelgpudetails {
   template <bool debug = false>
   __device__ uint8_t
   checkROC(uint32_t errorWord, uint8_t fedId, uint32_t link, const SiPixelROCsStatusAndMapping *cablingMap) {
     uint8_t errorType = (errorWord >> sipixelconstants::ROC_shift) & sipixelconstants::ERROR_mask;
     if (errorType < 25)
       return 0;
+    printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId+1200,link,sipixelconstants::ROC_shift,sipixelconstants::ERROR_mask);
     bool errorFound = false;

and for CPU side adding inRecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu something like:

diff --git a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
index 9bde98bef92..10cd0154d6d 100644
--- a/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
+++ b/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc
@@ -27,8 +28,11 @@ bool ErrorChecker::checkROC(bool& errorsInEvent,
                             Word32& errorWord,
                             SiPixelFormatterErrors& errors) const {
   int errorType = (errorWord >> ROC_shift) & ERROR_mask;
+
   if LIKELY (errorType < 25)
     return true;
+  
+  printf("errorType;%d;%d;%d;%d;%d;%d\n",errorType,errorWord,fedId,(errorWord >> LINK_shift) & LINK_mask,ROC_shift,ERROR_mask);

the GPU output has a couple of extra lines:

3358a3359
> errorType;29;1201668113;1208;17
3359a3361
> errorType;29;329252868;1227;4

Now why this is happening still I don't know. But the difference is there and it's causing, in this event, modules 303083540 to be not active when running on GPU while it is when running on CPU.

VinInn · 2023-06-07T11:31:46Z

here there is an early return
https://cmssdt.cern.ch/dxr/CMSSW/source/EventFilter/SiPixelRawToDigi/src/ErrorChecker.cc#55
here not
https://cmssdt.cern.ch/dxr/CMSSW/source/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu#207

maybe also for other cases

mmusich · 2023-06-08T07:48:30Z

this change alone (which if I am not mistaken would put the FED error zoology treatment on par between GPU and CPU)

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 293d4422e84..2f728fe024e 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -189,13 +189,13 @@ namespace pixelgpudetails {
       case (26): {
         if constexpr (debug)
           printf("Gap word found (errorType = 26)\n");
-        errorFound = true;
+        errorFound = false;
         break;
       }
       case (27): {
         if constexpr (debug)
           printf("Dummy word found (errorType = 27)\n");
-        errorFound = true;
+        errorFound = false;
         break;
       }
       case (28): {
@@ -208,8 +208,10 @@ namespace pixelgpudetails {
         if constexpr (debug)
           printf("Timeout on a channel (errorType = 29)\n");
         if ((errorWord >> sipixelconstants::OMIT_ERR_shift) & sipixelconstants::OMIT_ERR_mask) {
+          errorFound = false;
           if constexpr (debug)
             printf("...first errorType=29 error, this gets masked out\n");
+          break;
         }
         errorFound = true;
         break;

only gets the GPU count down of 1 (from 82 to 81 passing events).
Incidentally I realized we're not monitoring in the DQM the CPU vs GPU differences for the SiPixelDigiErrors (which instead we should, as demonstrated here).

gparida · 2023-06-16T10:24:47Z

Hi everyone, following up on the comments and using this commit from Marco - I reran the to check the differences between CPU and GPU for Double_Ele* paths. They are back to the 2-3% levels as before.

For example:
Before implementing the commit (GPU was accepting 20% more events )

1135985	298	360	64	-2	66	22.15	HLT_DoubleEle6p5_eta1p22_mMax6_v3

Now with the change:

1224276	708	702	5	-11	16	2.26	HLT_DoubleEle6p5_eta1p22_mMax6_v4

Similarly for the other Double_Ele paths.

The details are in this spreadsheet

mmusich · 2023-06-20T07:25:34Z

@gparida can you please check the proposed fix at #42010 as well?

gparida · 2023-06-22T07:27:41Z

Hi everyone,
I checked with the PR #42010 on same Data and the release, With the PR, the differences are reduced by 8-9% level

Before :

1135985	154	190	36	0	36	23.38	HLT_DoubleEle9p5_eta1p22_mMax6_v3

Now with the PR:

915722	221	246	28	-3	31	14.0271	HLT_DoubleEle9p5_eta1p22_mMax6_v4

Similarly for the other DoubleEle* paths. Details are in this spreadsheet.

mmusich · 2023-10-10T20:58:37Z

For the record there were two additional fixes:

mmusich · 2023-10-14T07:45:47Z

PR #42977 was included in CMSSW_13_2_6_patch1, that went online on October 13th 2023 for HI collision runs starting from run 375083.
In the subsequent fill n. 9254 no further CPU vs GPU Pixel FED error mismatches were observed in online DQM, see e.g. for run 375110 .

For more information about the effect on the offline pp menu, see #42978 (comment).

cmsbuild added the pending-assignment label May 18, 2023

cmsbuild added reconstruction-pending hlt-pending pending-signatures heterogeneous-pending and removed pending-assignment labels May 18, 2023

This was referenced Jun 11, 2023

Add SiPixelRawDataError monitoring for gpuValidation setup #41920

Merged

[13.1.X] Add SiPixelRawDataError monitoring for gpuValidation setup #41938

Merged

[13.0.X] Add SiPixelRawDataError monitoring for gpuValidation setup #41939

Merged

ferencek mentioned this issue Jun 19, 2023

Fixed channel decoding for the timeout error in SiPixel RawToDigi #42010

Merged

sroychow mentioned this issue Jun 21, 2023

[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

Merged

sroychow mentioned this issue Aug 21, 2023

Fixed channel decoding for the timeout error in SiPixelRawToClusterGPUKernel #42618

Merged

mmusich mentioned this issue Oct 10, 2023

[13_2_X] Additional fix for pixel FED error imbalance in DQM plots #42978

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715

Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715

silviodonato commented May 18, 2023

cmsbuild commented May 18, 2023

AdrianoDee commented May 18, 2023

makortel commented May 18, 2023

cmsbuild commented May 18, 2023

VinInn commented May 18, 2023

silviodonato commented May 19, 2023

VinInn commented May 20, 2023 via email

swagata87 commented May 31, 2023

silviodonato commented Jun 1, 2023 •

edited

Loading

silviodonato commented Jun 6, 2023

VinInn commented Jun 6, 2023

AdrianoDee commented Jun 7, 2023 •

edited

Loading

VinInn commented Jun 7, 2023

mmusich commented Jun 8, 2023

gparida commented Jun 16, 2023

mmusich commented Jun 20, 2023

gparida commented Jun 22, 2023

mmusich commented Oct 10, 2023

mmusich commented Oct 14, 2023 •

edited

Loading

Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715

Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715

Comments

silviodonato commented May 18, 2023

cmsbuild commented May 18, 2023

AdrianoDee commented May 18, 2023

makortel commented May 18, 2023

cmsbuild commented May 18, 2023

VinInn commented May 18, 2023

silviodonato commented May 19, 2023

VinInn commented May 20, 2023 via email

swagata87 commented May 31, 2023

silviodonato commented Jun 1, 2023 • edited Loading

silviodonato commented Jun 6, 2023

VinInn commented Jun 6, 2023

AdrianoDee commented Jun 7, 2023 • edited Loading

VinInn commented Jun 7, 2023

mmusich commented Jun 8, 2023

gparida commented Jun 16, 2023

mmusich commented Jun 20, 2023

gparida commented Jun 22, 2023

mmusich commented Oct 10, 2023

mmusich commented Oct 14, 2023 • edited Loading

silviodonato commented Jun 1, 2023 •

edited

Loading

AdrianoDee commented Jun 7, 2023 •

edited

Loading

mmusich commented Oct 14, 2023 •

edited

Loading