Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECAL Phase 2 weights method amplitude reconstruction on GPU #37695

Merged
merged 23 commits into from
Aug 4, 2022

Conversation

ChrisSandever
Copy link
Contributor

PR description:

This PR contains two extra modules in RecoLocalCalo/EcalRecProducers for the ECAL reconstruction on GPU for the Phase 2 upgrade. It modifies 2 other modules with no effect on phase 1 operation.
The new modules are producers, the first of which is a data format converter, which converts phase 2 digis to SoA digis copied onto the GPU device. The other modules is a producer that produces UncalibratedRecHits by reconstructing the amplitudes using the weights method of reconstruction. It is Phase 2 compatible and does the reconstruction on a GPU.
There is an existing module that produces UncalibratedRecHits with the same method for Phase 2 executing the reconstruction on CPU. The UncalibratedRecHits from both modules are virtually identical. Furthermore, the CUDADataFormats/EcalRecHitSoA/interface/EcalUncalibratedRecHit.h data format has been modified to include the AmplitudeError variable to be compatible with the CPU data format.
A switch producer has been added to the phase 2 configuration to be able to run the CPU and GPU algorithms.

PR validation:

The PR passed scram b -tests and runTheMatrix.py -l limited -i all --ibeos.
The configuration was tested with ECAL Phase 2 WF 28234.61.
By adding the gpu modifier to the reconstruction step of WF 28234.61 the GPU branch of the switch producer has been tested.
The UncalibratedRecHits are equivalent for all variables between the CPU and GPU reconstructions.
The timings of the CPU and GPU algorithms have been compared and are similar.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37695/29516

  • This PR adds an extra 32KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @ChrisSandever (Christopher Rhys Sandever) for master.

It involves the following packages:

  • CUDADataFormats/EcalRecHitSoA (heterogeneous, reconstruction)
  • RecoLocalCalo/EcalRecProducers (reconstruction)

@makortel, @slava77, @clacaputo, @cmsbuild, @fwyzard, @jpata can you please review it and eventually sign? Thanks.
@rchatter, @rovere, @argiro, @apsallid, @thomreis, @simonepigazzini this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@jpata
Copy link
Contributor

jpata commented Apr 27, 2022

type ecal

@cmsbuild cmsbuild added the ecal label Apr 27, 2022
@jpata
Copy link
Contributor

jpata commented Apr 27, 2022

enable gpu

@jpata
Copy link
Contributor

jpata commented Apr 27, 2022

@cmsbuild please test

@jpata
Copy link
Contributor

jpata commented Apr 27, 2022

We have some GPU workflows for Run3 in the matrix, are you planning to perhaps include also one for Phase2, so in the future this can be tested with the bot / in the IB-s?

@ChrisSandever
Copy link
Contributor Author

We also have ECAL Phase 2 development workflows (.61 suffix). We are currently considering to enable the gpu modifier for those. The other option would be to have dedicated WFs like for run3.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-GPU
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19274f/24264/summary.html
COMMIT: e4e0aea
CMSSW: CMSSW_12_4_X_2022-04-27-1100/slc7_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37695/24264/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-GPU

  • 11634.52211634.522_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.50611634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.51211634.512_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

Comparison Summary

There are some workflows for which there are errors in the baseline:
1001.2 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695434
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3695404
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 205 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 28, 2022

The changes seem to break the HLT running on GPU:

----- Begin Fatal Exception 27-Apr-2022 14:44:24 UTC-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'MC_PFBTagDeepJet_v1'
   [2] Prefetching for module CaloTowersCreator/'hltTowerMakerForAll'
   [3] Prefetching for module EcalRecHitProducer/'hltEcalRecHit@cuda'
   [4] Prefetching for module EcalUncalibRecHitConvertGPU2CPUFormat/'hltEcalUncalibRecHit@cuda'
   [5] Calling method for module EcalCPUUncalibRecHitProducer/'hltEcalUncalibRecHitSoA'
Exception Message:
A std::exception was thrown.

/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_12_4_X_2022-04-27-1100/src/RecoLocalCalo/EcalRecProducers/plugins/EcalCPUUncalibRecHitProducer.cc, line 84:
cudaCheck(cudaMemcpyAsync(dest.data(), src, dest.size() * sizeof(type), cudaMemcpyDeviceToHost, ctx.stream()));
cudaErrorInvalidValue: invalid argument
----- End Fatal Exception -------------------------------------------------

Can you look into it ?

@ChrisSandever
Copy link
Contributor Author

ChrisSandever commented Apr 28, 2022

The reason for the error is the inclusion of the amplitudeError variable in the converters. We have multiple solutions to this at the moment, one is to include the amplitudeError in the phase 1 GPU multifit module where it should be initialized. The other option is to use the boolean introduced in the converters for turning off EE uncalib rechit production to switch between phase 1 and phase 2 behavior to ignore the amplitudeError if in a Phase 1 WF. Both options have been tested and pass these WFs, we are currently timing them.

@thomreis
Copy link
Contributor

@fwyzard is the reason that this ECAL changes also break the PixelOnly and HcalOnly WFs because at the HLT all GPU reconstructions are running and the Hcal/PixelOnly in the WF name just corresponds to the reco step?

@jpata
Copy link
Contributor

jpata commented Jul 7, 2022

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 7, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19274f/26038/summary.html
COMMIT: 0820f19
CMSSW: CMSSW_12_5_X_2022-07-06-2300/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37695/26038/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19869
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19860
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3654771
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3654741
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Jul 13, 2022

+heterogeneous

@fwyzard
Copy link
Contributor

fwyzard commented Jul 13, 2022

@ChrisSandever did you open an issue to keep track of the pending requests ?

@ChrisSandever
Copy link
Contributor Author

Yes, it can be found here: #38619 (comment)

@jpata
Copy link
Contributor

jpata commented Aug 1, 2022

@cmsbuild please test

refreshing the tests before signing (I was off)

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 1, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19274f/26560/summary.html
COMMIT: 0820f19
CMSSW: CMSSW_12_5_X_2022-08-01-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37695/26560/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19868
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3669004
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3668968
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 210 log files, 47 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

Comment on lines +39 to +40
isPhase2 = cms.bool(True),
recHitsLabelGPUEB = cms.InputTag('ecalUncalibRecHitSoA', 'EcalUncalibRecHitsEB'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just note here that

Suggested change
isPhase2 = cms.bool(True),
recHitsLabelGPUEB = cms.InputTag('ecalUncalibRecHitSoA', 'EcalUncalibRecHitsEB'),
isPhase2 = True,
recHitsLabelGPUEB = ('ecalUncalibRecHitSoA', 'EcalUncalibRecHitsEB'),

would be preferred, but this can be done as a quick follow-up too.

@jpata
Copy link
Contributor

jpata commented Aug 2, 2022

+reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Aug 4, 2022

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants