-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA implementation of RecoLocalTracker/SiStripCluster ClustersFromRawProducer #34618
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34618/24197
|
A new Pull Request was created by @dan131riley (Dan Riley) for master. It involves the following packages:
The following packages do not have a category, yet: CUDADataFormats/SiStripCluster @perrotta, @malbouis, @yuanchao, @tlampen, @cmsbuild, @slava77, @jpata, @pohsun, @francescobrivio, @tvami can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
assign heterogeneous |
@dan131riley as far as I can read from the description, this PR is still a work in progress, as there are things to be understood and bugs to be fixed to allow it run in step4. |
enable gpu |
@dan131riley please update the PR title with something more descriptive |
please test |
-1 Failed Tests: ClangBuild Clang BuildI found compilation warning while trying to compile with clang. Command used:
See details on the summary page. |
assign @cms-sw/trk-dpg-l2 Thank you. |
Seconding this + @dan131riley would you please clean the commit history with a git squash? "clean ups", "formatting" and "rebase" come up quite frequently. |
just to adjust the expectations, since this is declared for 12.1.X we (@pieterdavid @robervalwalsh myself et al) will go through it in detail and try to post a coherent review in the coming days. |
320209c
to
fbe2b59
Compare
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0a267a/31446/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
+alca
|
+reconstruction
|
+heterogeneous I did not go through the changes again after the last rebase and fix - I assume no other changes were introduced other than adapting to the new |
ping bot |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1
|
Correct, the only change was removing the unnecessary CUDAService include. |
The trivial comments were addressed, and of the general comments, I did greatly reduce calls to cudaMemcpyAsync(), so I believe there's now only 1 call outside of the debugging code. Adapting for PortableHostCollection and other preparations for migration to Alpaka is next to be addressed now that this is merged. |
PR description:
Elements are:
Structure of Arrays (SoA) version of SiStripClusters. The number of array elements for the cluster ADCs is is configurable at runtime via the PSET:
DataFormats/SiStripCluster/interface/SiStripClustersSOABase.h
DataFormats/SiStripCluster/interface/SiStripClustersSOA.h
DataFormats/SiStripCluster/src/SiStripClustersSOA.cc
CUDADataFormats/SiStripCluster/interface/SiStripClustersCUDA.h
CUDADataFormats/SiStripCluster/src/SiStripClustersCUDA.cc
Some associated typedefs:
DataFormats/SiStripCluster/interface/SiStripTypes.h
SoA version of SiStripClusterizerConditions plus an auxiliary class to map FedID & channel to DetID & APV pair:
CalibFormats/SiStripObjects/interface/SiStripClusterizerConditionsGPU.h
CalibFormats/SiStripObjects/src/SiStripClusterizerConditionsGPU.cc
Event setup dependency records:
RecoLocalTracker/Records/interface/SiStripClusterizerConditionsGPURcd.h
RecoLocalTracker/Records/src/SiStripClusterizerConditionsGPURcd.cc
Data structure for mapping raw data into DetID/strip ordering:
RecoLocalTracker/SiStripClusterizer/plugins/ChannelLocsGPU.h
RecoLocalTracker/SiStripClusterizer/plugins/ChannelLocsGPU.cc
Producer that consumes raw data and produces SiStripClustersCUDA:
RecoLocalTracker/SiStripClusterizer/plugins/ClustersFromRawProducerGPU.cc
Top level algorithm implementation that sequences kernels:
RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.h
RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.cc
CUDA implementation of the ThreeThresholdAlgorithm. Includes kernels that identify candidate seed strips, finds the left and right boundaries of cluster candidates, and checks the cluster cuts and calculates the charge and barycenter, and CUDA kernel to reorganize raw data on the GPU into DetID/strip order:
RecoLocalTracker/SiStripClusterizer/plugins/SiStripRawToClusterGPUKernel.cu
Producer that copies SiStripClustersSOA from device to host, and producer that consumes SiStripClustersCUDAHost and produces edmNew::DetSetVector:
RecoLocalTracker/SiStripClusterizer/plugins/SiStripClustersSOAtoHost.cc
RecoLocalTracker/SiStripClusterizer/plugins/SiStripClustersFromSOA.cc
ESProducer for SiStripClusterizerConditionsGPU:
RecoLocalTracker/SiStripClusterizer/plugins/SiStripClusterizerConditionsGPUESProducer.cc
Customize function to replace the RECO siStripClusters from DIGIs producer with the siStripClusters from RawData producer. Intended to only be used for validation:
RecoLocalTracker/SiStripClusterizer/python/customizeStripClustersFromRaw.py
Modified:
Config for siStripClusters from RawData producer modified to use the gpu process modifier and the SwitchProducerCUDA to switch between SiStripClusterizerFromRaw and SiStripClusterizerFromRawGPU. As far as I know this config is not used in production:
RecoLocalTracker/SiStripClusterizer/python/SiStripClusterizerOnDemand_cfi.py
ThreeThresholdAlgorithm and associated factory modified to take a cluster size limit cut to match the limit in the SOA/CUDA version:
RecoLocalTracker/SiStripClusterizer/interface/ThreeThresholdAlgorithm.h
RecoLocalTracker/SiStripClusterizer/src/ThreeThresholdAlgorithm.cc
Various BuildFile.xml and event setup registration.
PR validation:
Validation plots from TTBar with pu flat55-75 comparing the standard clusterizer with the CPU and GPU versions discarding clusters larger than the max (16 strips) can be found here:
https://www.classe.cornell.edu/~dsr/mic-track/validation/PR34618/
CPU and GPU versions are close enough to identical that the differences could be ascribed to platform and implementation differences. For this data sample, there is no significant loss of efficiency and a 10% reduction in fakes in the barrel.
Comparison of various cluster size cuts for the same data sample:
https://www.classe.cornell.edu/~dsr/mic-track/validation/PR34618-cluster-size/
and similarly for a QCD Pt1800-2400 with pu flat55-75:
https://www.classe.cornell.edu/~dsr/mic-track/validation/PR34618-QCD-HighPt/
This sample does show a loss of efficiency for cluster size cuts < 24, primarily in the jetCore iteration.