Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA implementation of RecoLocalTracker/SiStripCluster ClustersFromRawProducer #34618

Merged
merged 11 commits into from
Mar 27, 2023

Conversation

dan131riley
Copy link

@dan131riley dan131riley commented Jul 25, 2021

PR description:

Elements are:

Structure of Arrays (SoA) version of SiStripClusters. The number of array elements for the cluster ADCs is is configurable at runtime via the PSET:

DataFormats/SiStripCluster/interface/SiStripClustersSOABase.h
DataFormats/SiStripCluster/interface/SiStripClustersSOA.h
DataFormats/SiStripCluster/src/SiStripClustersSOA.cc
CUDADataFormats/SiStripCluster/interface/SiStripClustersCUDA.h
CUDADataFormats/SiStripCluster/src/SiStripClustersCUDA.cc

Some associated typedefs:

DataFormats/SiStripCluster/interface/SiStripTypes.h

SoA version of SiStripClusterizerConditions plus an auxiliary class to map FedID & channel to DetID & APV pair:

CalibFormats/SiStripObjects/interface/SiStripClusterizerConditionsGPU.h
CalibFormats/SiStripObjects/src/SiStripClusterizerConditionsGPU.cc

Event setup dependency records:

RecoLocalTracker/Records/interface/SiStripClusterizerConditionsGPURcd.h
RecoLocalTracker/Records/src/SiStripClusterizerConditionsGPURcd.cc

Data structure for mapping raw data into DetID/strip ordering:

RecoLocalTracker/SiStripClusterizer/plugins/ChannelLocsGPU.h
RecoLocalTracker/SiStripClusterizer/plugins/ChannelLocsGPU.cc

Producer that consumes raw data and produces SiStripClustersCUDA:

RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducerGPU.cc

Top level algorithm implementation that sequences kernels:

RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.h
RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.cc

CUDA implementation of the ThreeThresholdAlgorithm. Includes kernels that identify candidate seed strips, finds the left and right boundaries of cluster candidates, and checks the cluster cuts and calculates the charge and barycenter, and CUDA kernel to reorganize raw data on the GPU into DetID/strip order:

RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.cu

Producer that copies SiStripClustersSOA from device to host, and producer that consumes SiStripClustersCUDAHost and produces edmNew::DetSetVector:

RecoLocalTracker/SiStripClusterizer/plugins/SiStripClustersSOAtoHost.cc
RecoLocalTracker/SiStripClusterizer/plugins/SiStripClustersFromSOA.cc

ESProducer for SiStripClusterizerConditionsGPU:

RecoLocalTracker/SiStripClusterizer/plugins/SiStripClusterizerConditionsGPUESProducer.cc

Customize function to replace the RECO siStripClusters from DIGIs producer with the siStripClusters from RawData producer. Intended to only be used for validation:

RecoLocalTracker/SiStripClusterizer/python/customizeStripClustersFromRaw.py

Modified:

Config for siStripClusters from RawData producer modified to use the gpu process modifier and the SwitchProducerCUDA to switch between SiStripClusterizerFromRaw and SiStripClusterizerFromRawGPU. As far as I know this config is not used in production:

RecoLocalTracker/SiStripClusterizer/python/SiStripClusterizerOnDemand_cfi.py

ThreeThresholdAlgorithm and associated factory modified to take a cluster size limit cut to match the limit in the SOA/CUDA version:

RecoLocalTracker/SiStripClusterizer/interface/ThreeThresholdAlgorithm.h
RecoLocalTracker/SiStripClusterizer/src/ThreeThresholdAlgorithm.cc

Various BuildFile.xml and event setup registration.

PR validation:

Validation plots from TTBar with pu flat55-75 comparing the standard clusterizer with the CPU and GPU versions discarding clusters larger than the max (16 strips) can be found here:

https://www.classe.cornell.edu/~dsr/mic-track/validation/PR34618/

CPU and GPU versions are close enough to identical that the differences could be ascribed to platform and implementation differences. For this data sample, there is no significant loss of efficiency and a 10% reduction in fakes in the barrel.

Comparison of various cluster size cuts for the same data sample:

https://www.classe.cornell.edu/~dsr/mic-track/validation/PR34618-cluster-size/

and similarly for a QCD Pt1800-2400 with pu flat55-75:

https://www.classe.cornell.edu/~dsr/mic-track/validation/PR34618-QCD-HighPt/

This sample does show a loss of efficiency for cluster size cuts < 24, primarily in the jetCore iteration.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34618/24197

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @dan131riley (Dan Riley) for master.

It involves the following packages:

  • CUDADataFormats/SiStripCluster (****)
  • CalibFormats/SiStripObjects (alca)
  • RecoLocalTracker/Records (reconstruction)
  • RecoLocalTracker/SiStripClusterizer (reconstruction)

The following packages do not have a category, yet:

CUDADataFormats/SiStripCluster
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@perrotta, @malbouis, @yuanchao, @tlampen, @cmsbuild, @slava77, @jpata, @pohsun, @francescobrivio, @tvami can you please review it and eventually sign? Thanks.
@echabert, @pieterdavid, @gbenelli, @makortel, @robervalwalsh, @yduhm, @GiacomoSguazzoni, @JanFSchulte, @tocheng, @VinInn, @alesaggio, @rovere, @felicepantaleo, @mtosi, @gpetruc, @mmusich, @threus this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy, @perrotta you are the release manager for this.

cms-bot commands are listed here

@perrotta
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@perrotta
Copy link
Contributor

@dan131riley as far as I can read from the description, this PR is still a work in progress, as there are things to be understood and bugs to be fixed to allow it run in step4.
Please let us know when it can be considered ready to be tested

@perrotta
Copy link
Contributor

enable gpu

@mmusich
Copy link
Contributor

mmusich commented Jul 26, 2021

@dan131riley please update the PR title with something more descriptive

@mmusich
Copy link
Contributor

mmusich commented Jul 26, 2021

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: ClangBuild
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0a267a/17196/summary.html
COMMIT: 320209c
CMSSW: CMSSW_12_0_X_2021-07-25-2300/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/34618/17196/install.sh to create a dev area with all the needed externals and cmssw changes.

Clang Build

I found compilation warning while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

See details on the summary page.

@slava77
Copy link
Contributor

slava77 commented Jul 26, 2021

assign @cms-sw/trk-dpg-l2

Thank you.

@tvami
Copy link
Contributor

tvami commented Jul 26, 2021

@dan131riley please update the PR title with something more descriptive

Seconding this + @dan131riley would you please clean the commit history with a git squash? "clean ups", "formatting" and "rebase" come up quite frequently.

@mmusich
Copy link
Contributor

mmusich commented Jul 27, 2021

assign @cms-sw/trk-dpg-l2

just to adjust the expectations, since this is declared for 12.1.X we (@pieterdavid @robervalwalsh myself et al) will go through it in detail and try to post a coherent review in the coming days.

@dan131riley dan131riley changed the title Gpu sistripclusterizer CUDA implementation of RecoLocalTracker/SiStripCluster ClustersFromRawProducer Jul 27, 2021
@dan131riley dan131riley force-pushed the gpu-sistripclusterizer branch from 320209c to fbe2b59 Compare July 29, 2021 13:56
@fwyzard
Copy link
Contributor

fwyzard commented Mar 20, 2023

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0a267a/31446/summary.html
COMMIT: 44b2092
CMSSW: CMSSW_13_1_X_2023-03-19-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/34618/31446/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 20 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3550882
  • DQMHistoTests: Total failures: 4
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3550856
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 10
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19852
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@tvami
Copy link
Contributor

tvami commented Mar 20, 2023

+alca

  • resign

@clacaputo
Copy link
Contributor

+reconstruction

  • resign

@fwyzard
Copy link
Contributor

fwyzard commented Mar 27, 2023

+heterogeneous

I did not go through the changes again after the last rebase and fix - I assume no other changes were introduced other than adapting to the new CUDAService interface.

@perrotta
Copy link
Contributor

ping bot

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 1805fa8 into cms-sw:master Mar 27, 2023
@dan131riley
Copy link
Author

I did not go through the changes again after the last rebase and fix - I assume no other changes were introduced other than adapting to the new CUDAService interface.

Correct, the only change was removing the unnecessary CUDAService include.

@dan131riley
Copy link
Author

The trivial comments were addressed, and of the general comments, I did greatly reduce calls to cudaMemcpyAsync(), so I believe there's now only 1 call outside of the debugging code. Adapting for PortableHostCollection and other preparations for migration to Alpaka is next to be addressed now that this is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.