Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 2D grid for inner loop parallelization #260

Merged
merged 3 commits into from
Jan 24, 2019

Conversation

VinInn
Copy link

@VinInn VinInn commented Jan 23, 2019

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

w/tr/ #242 here we keep the same number of threads per block as in the baseline
(this was already the case for the doublet finder)

Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

Performances

HEAD

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1418.2 ±   0.8 ev/s (4000 events)
  1415.4 ±   1.0 ev/s (4000 events)
  1417.7 ±   1.0 ev/s (4000 events)
  1411.6 ±   0.7 ev/s (4000 events)
 --------------------
  1415.7 ±   3.0 ev/s

#242

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1402.7 ±   0.9 ev/s (4000 events)
  1397.4 ±   0.9 ev/s (4000 events)
  1406.1 ±   0.8 ev/s (4000 events)
  1404.2 ±   1.1 ev/s (4000 events)
 --------------------
  1402.6 ±   3.7 ev/s

#260 - this PR

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1346.9 ±   0.9 ev/s (4000 events)
  1353.8 ±   0.8 ev/s (4000 events)
  1354.1 ±   0.9 ev/s (4000 events)
  1355.6 ±   0.9 ev/s (4000 events)
 --------------------
  1352.6 ±   3.9 ev/s

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

If we look to nvprof instead
HEAD

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.14%  3.69037s     20000  184.52us  16.768us  813.53us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.10%  3.41894s    100000  34.189us     960ns  497.76us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*, Rfit::circle_fit*, unsigned int)
                   10.12%  2.64223s     20000  132.11us  7.0080us  695.74us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    9.96%  2.59988s     20000  129.99us  6.7840us  501.47us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    8.00%  2.08762s     20000  104.38us  24.287us  878.68us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    7.95%  2.07452s     20000  103.73us  5.8560us  672.67us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    6.41%  1.67444s     20000  83.722us  4.8320us  356.80us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, unsigned int, bool)
                    4.63%  1.20861s     20000  60.430us  1.3120us  271.10us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

#242

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   13.81%  3.65639s     20000  182.82us  11.008us  572.63us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.36%  3.53540s    100000  35.354us     960ns  557.75us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*,Rfit::circle_fit*, unsigned int)
                   10.60%  2.80627s     20000  140.31us  7.0070us  735.80us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    8.38%  2.21769s     20000  110.88us  5.2800us  902.20us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.06%  2.13352s     20000  106.68us  23.840us  1.3569ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    6.84%  1.80937s     20000  90.468us  5.2160us  446.87us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    6.76%  1.78889s     20000  89.444us  7.0400us  482.33us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.62%  1.22299s     20000  61.149us  1.3440us  262.88us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

#260 - this PR

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   13.77%  3.65572s     20000  182.79us  10.944us  558.46us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.24%  3.51483s    100000  35.148us     928ns  581.88us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*,Rfit::circle_fit*, unsigned int)
                   11.24%  2.98521s     20000  149.26us  24.416us  670.74us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    8.54%  2.26732s     20000  113.37us  4.2880us  1.3603ms  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.03%  2.13146s     20000  106.57us  22.720us  915.80us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    6.61%  1.75482s     20000  87.740us  5.2480us  441.50us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    6.22%  1.65153s     20000  82.576us  24.351us  451.10us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.55%  1.20717s     20000  60.358us  1.1200us  234.17us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

what is difficult to understand is

  1. that fishbone is faster in hand-made even with identical blocksize
  2. why, despite kernel_connect be much faster, throughput is worse

@fwyzard
Copy link

fwyzard commented Jan 23, 2019

The total time spent in the kernels decreases (both with #242 and #260), so something else must be taking longer:

branch time
HEAD 19.40 s
#242 19.17 s
#260 19.17 s

For example, I do not know if the time spent in creating and launching the blocks is accounted, nor what happens to the overall GPU occupancy.

@fwyzard
Copy link

fwyzard commented Jan 23, 2019

Here is summary of the comparison of the time spent in the kernels, with the modified ones in bold:

HEAD #242 #260   kernel
2.09 s 2.13 s 2.13 s   gpuClustering::findClus
1.67 s 1.81 s 1.75 s   gpuPixelDoublets::fishbone
3.69 s 3.66 s 3.66 s   gpuPixelDoublets::getDoubletsFromHisto
2.07 s 2.22 s 2.27 s   gpuVertexFinder::clusterTracks
1.21 s 1.22 s 1.21 s   gpuVertexFinder::sortByPt2
2.64 s 1.79 s 1.65 s   kernel_connect
2.60 s 2.81 s 2.99 s   kernel_find_ntuplets
3.42 s 3.54 s 3.51 s   kernelCircleFitAllHits
         
19.40 s 19.17 s 19.17 s   total

It seems that kernel_connect is significanrtly improved, while fishbone and getDoubletsFromHisto are slightly worse.

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

we need to run regression on MC to verify that MTV is identical.
will do.

@fwyzard
Copy link

fwyzard commented Jan 23, 2019

Here are the improvements I see if I apply only the changes to kernel_connect, and if I apply the whole PR #260:

on a pair of P100

p100

on a pair of V100

v100

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

MTV http://innocent.home.cern.ch/innocent/RelVal/pixOnlyPU50_gpuPR260/plots_summary.html

  A B C
Efficiency 0.7717 0.7717 0.7717
Number of TrackingParticles (after cuts) 55709 55709 55709
Number of matched TrackingParticles 42992 42992 42992
Fake rate 0.0289 0.0289 0.0289
Duplicate rate 0.0003 0.0003 0.0003
Number of tracks 574001 574001 574001
Number of true tracks 557413 557413 557416
Number of fake tracks 16588 16588 16585
Number of pileup tracks 494027 494027 494030
Number of duplicate tracks 164 164 164

A: HEAD
B: PR242
C: PR260

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

ok with

--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cu
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cu
@@ -325,7 +325,7 @@ void CAHitQuadrupletGeneratorKernels::launchKernels( // here goes algoparms....
 void CAHitQuadrupletGeneratorKernels::buildDoublets(HitsOnCPU const & hh, cudaStream_t stream) {
   auto nhits = hh.nHits;

-  int stride=4;
+  int stride=2;
   int threadsPerBlock = gpuPixelDoublets::getDoubletsFromHistoMaxBlockSize/stride;
   int blocks = (3 * nhits + threadsPerBlock - 1) / threadsPerBlock;
   dim3 blks(1,blocks,1);

I get from this PR

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1406.2 ±   0.8 ev/s (4000 events)
  1417.0 ±   1.0 ev/s (4000 events)
  1398.9 ±   0.9 ev/s (4000 events)
  1407.2 ±   0.7 ev/s (4000 events)
 --------------------
  1407.3 ±   7.4 ev/s

and

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.12%  3.89220s    100000  38.921us  1.0560us  546.27us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*, Rfit::circle_fit*, unsigned int)
                   12.61%  3.47615s     20000  173.81us  15.423us  702.36us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   11.31%  3.11798s     20000  155.90us  27.328us  759.83us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    9.35%  2.57776s     20000  128.89us  5.0240us  742.52us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.88%  2.44788s     20000  122.39us  27.487us  990.10us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    5.87%  1.61928s     20000  80.964us  5.8560us  410.43us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    5.63%  1.55245s     20000  77.622us  26.944us  428.67us kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.86%  1.34084s     20000  67.042us  1.1520us  265.02us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

so it seems that , at least for this workflow on V100, 2D parallelization is not worth for Doublet Building

with "stride=1" we reach "1420.1 ± 5.3 ev/s"

with stride=8" in fishbone is "1426.3 ± 4.8 ev/s"
at this point I think we are in overtuning regime

@fwyzard
Copy link

fwyzard commented Jan 23, 2019

OK... does #261 look good ?

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

#261 "looks" good , maybe I should test it....

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

#261 gives me "1419.6 ± 3.7 ev/s" equivalent to this PR and stride=1 in the doublet finder.

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

This version should be equivalent to #261 with the advantage of having the ability to modify the 2D grid params for fishbone and doublet finder as well

@VinInn
Copy link
Author

VinInn commented Jan 23, 2019

for completeness the nvprof report for this version

 GPU activities:   14.39%  3.96637s     20000  198.32us  19.456us  735.73us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.92%  3.83733s    100000  38.373us  1.0560us  525.53us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*, Rfit::circle_fit*, unsigned int)
                   11.13%  3.06847s     20000  153.42us  27.007us  709.11us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    9.66%  2.66226s     20000  133.11us  3.9680us  718.96us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.83%  2.43328s     20000  121.66us  27.328us  1.0604ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    5.49%  1.51340s     20000  75.669us  26.912us  373.50us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.90%  1.35110s     20000  67.555us  5.6000us  340.89us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    4.81%  1.32533s     20000  66.266us  1.1840us  285.05us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)
``

@VinInn
Copy link
Author

VinInn commented Jan 24, 2019

@fwyzard , will you have the time to update your curves with the latest version of this PR?

@fwyzard
Copy link

fwyzard commented Jan 24, 2019

sure

@fwyzard
Copy link

fwyzard commented Jan 24, 2019

Updated plots on V100:
v100

Updated plots on P100:
p100

@fwyzard
Copy link

fwyzard commented Jan 24, 2019

Validation summary

Reference release CMSSW_10_4_0 at b8365c6
Development branch CMSSW_10_4_X_Patatrack at fb0dbd9
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/77e988860822d0345820155725670b95d43c2554/log .

@VinInn
Copy link
Author

VinInn commented Jan 24, 2019

cuda-memcheck --tool synccheck successful? how comes?

@fwyzard
Copy link

fwyzard commented Jan 24, 2019

I'm running the validation on the P100 to avoid synccheck false positives.

@fwyzard
Copy link

fwyzard commented Jan 24, 2019

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW summary

  reference-10824.5 development-10824.5 development-10824.8 testing-10824.8
Efficiency 0.4818 0.4824 0.5727 0.5727
Number of TrackingParticles (after cuts) 5556 5556 5556 5556
Number of matched TrackingParticles 2677 2680 3182 3182
Fake rate 0.0519 0.0517 0.0344 0.0344
Duplicate rate 0.0168 0.0175 0.0002 0.0003
Number of tracks 32452 32480 43906 43907
Number of true tracks 30769 30801 42394 42395
Number of fake tracks 1683 1679 1512 1512
Number of pileup tracks 27093 27118 37688 37689
Number of duplicate tracks 546 567 10 12

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW summary

  reference-10824.5 development-10824.5 development-10824.8 testing-10824.8
Efficiency 0.5594 0.5591 0.6302 0.6302
Number of TrackingParticles (after cuts) 3899 3899 3899 3899
Number of matched TrackingParticles 2181 2180 2457 2457
Fake rate 0.0076 0.0073 0.0065 0.0065
Duplicate rate 0.0136 0.0114 0.0000 0.0000
Number of tracks 3679 3682 4593 4593
Number of true tracks 3651 3655 4563 4563
Number of fake tracks 28 27 30 30
Number of pileup tracks 0 0 0 0
Number of duplicate tracks 50 42 0 0

@fwyzard fwyzard merged commit 00e8cf4 into cms-patatrack:CMSSW_10_4_X_Patatrack Jan 24, 2019
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard added a commit that referenced this pull request Nov 27, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Dec 26, 2020
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Jan 15, 2021
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Mar 23, 2021
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
fwyzard pushed a commit that referenced this pull request Apr 1, 2021
Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants