Use 2D grid for inner loop parallelization #260

VinInn · 2019-01-23T11:22:09Z

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

w/tr/ #242 here we keep the same number of threads per block as in the baseline
(this was already the case for the doublet finder)

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

VinInn · 2019-01-23T11:23:35Z

Performances

HEAD

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1418.2 ±   0.8 ev/s (4000 events)
  1415.4 ±   1.0 ev/s (4000 events)
  1417.7 ±   1.0 ev/s (4000 events)
  1411.6 ±   0.7 ev/s (4000 events)
 --------------------
  1415.7 ±   3.0 ev/s

#242

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1402.7 ±   0.9 ev/s (4000 events)
  1397.4 ±   0.9 ev/s (4000 events)
  1406.1 ±   0.8 ev/s (4000 events)
  1404.2 ±   1.1 ev/s (4000 events)
 --------------------
  1402.6 ±   3.7 ev/s

#260 - this PR

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1346.9 ±   0.9 ev/s (4000 events)
  1353.8 ±   0.8 ev/s (4000 events)
  1354.1 ±   0.9 ev/s (4000 events)
  1355.6 ±   0.9 ev/s (4000 events)
 --------------------
  1352.6 ±   3.9 ev/s

VinInn · 2019-01-23T12:46:06Z

If we look to nvprof instead
HEAD

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.14%  3.69037s     20000  184.52us  16.768us  813.53us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.10%  3.41894s    100000  34.189us     960ns  497.76us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*, Rfit::circle_fit*, unsigned int)
                   10.12%  2.64223s     20000  132.11us  7.0080us  695.74us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    9.96%  2.59988s     20000  129.99us  6.7840us  501.47us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    8.00%  2.08762s     20000  104.38us  24.287us  878.68us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    7.95%  2.07452s     20000  103.73us  5.8560us  672.67us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    6.41%  1.67444s     20000  83.722us  4.8320us  356.80us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, unsigned int, bool)
                    4.63%  1.20861s     20000  60.430us  1.3120us  271.10us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

#242

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   13.81%  3.65639s     20000  182.82us  11.008us  572.63us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.36%  3.53540s    100000  35.354us     960ns  557.75us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*,Rfit::circle_fit*, unsigned int)
                   10.60%  2.80627s     20000  140.31us  7.0070us  735.80us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    8.38%  2.21769s     20000  110.88us  5.2800us  902.20us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.06%  2.13352s     20000  106.68us  23.840us  1.3569ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    6.84%  1.80937s     20000  90.468us  5.2160us  446.87us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    6.76%  1.78889s     20000  89.444us  7.0400us  482.33us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.62%  1.22299s     20000  61.149us  1.3440us  262.88us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

#260 - this PR

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   13.77%  3.65572s     20000  182.79us  10.944us  558.46us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.24%  3.51483s    100000  35.148us     928ns  581.88us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*,Rfit::circle_fit*, unsigned int)
                   11.24%  2.98521s     20000  149.26us  24.416us  670.74us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    8.54%  2.26732s     20000  113.37us  4.2880us  1.3603ms  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.03%  2.13146s     20000  106.57us  22.720us  915.80us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    6.61%  1.75482s     20000  87.740us  5.2480us  441.50us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    6.22%  1.65153s     20000  82.576us  24.351us  451.10us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.55%  1.20717s     20000  60.358us  1.1200us  234.17us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

what is difficult to understand is

that fishbone is faster in hand-made even with identical blocksize
why, despite kernel_connect be much faster, throughput is worse

fwyzard · 2019-01-23T13:02:21Z

The total time spent in the kernels decreases (both with #242 and #260), so something else must be taking longer:

branch	time
HEAD	19.40 s
#242	19.17 s
#260	19.17 s

For example, I do not know if the time spent in creating and launching the blocks is accounted, nor what happens to the overall GPU occupancy.

fwyzard · 2019-01-23T13:10:35Z

Here is summary of the comparison of the time spent in the kernels, with the modified ones in bold:

HEAD	#242	#260	kernel
2.09 s	2.13 s	2.13 s	`gpuClustering::findClus`
1.67 s	1.81 s	1.75 s	`gpuPixelDoublets::fishbone`
3.69 s	3.66 s	3.66 s	`gpuPixelDoublets::getDoubletsFromHisto`
2.07 s	2.22 s	2.27 s	`gpuVertexFinder::clusterTracks`
1.21 s	1.22 s	1.21 s	`gpuVertexFinder::sortByPt2`
2.64 s	1.79 s	1.65 s	`kernel_connect`
2.60 s	2.81 s	2.99 s	`kernel_find_ntuplets`
3.42 s	3.54 s	3.51 s	`kernelCircleFitAllHits`

19.40 s	19.17 s	19.17 s	total

It seems that kernel_connect is significanrtly improved, while fishbone and getDoubletsFromHisto are slightly worse.

VinInn · 2019-01-23T13:18:05Z

we need to run regression on MC to verify that MTV is identical.
will do.

fwyzard · 2019-01-23T14:46:17Z

Here are the improvements I see if I apply only the changes to kernel_connect, and if I apply the whole PR #260:

on a pair of P100

on a pair of V100

VinInn · 2019-01-23T14:52:36Z

MTV http://innocent.home.cern.ch/innocent/RelVal/pixOnlyPU50_gpuPR260/plots_summary.html

	A	B	C
Efficiency	0.7717	0.7717	0.7717
Number of TrackingParticles (after cuts)	55709	55709	55709
Number of matched TrackingParticles	42992	42992	42992
Fake rate	0.0289	0.0289	0.0289
Duplicate rate	0.0003	0.0003	0.0003
Number of tracks	574001	574001	574001
Number of true tracks	557413	557413	557416
Number of fake tracks	16588	16588	16585
Number of pileup tracks	494027	494027	494030
Number of duplicate tracks	164	164	164

A: HEAD
B: PR242
C: PR260

VinInn · 2019-01-23T15:33:06Z

ok with

--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cu
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cu
@@ -325,7 +325,7 @@ void CAHitQuadrupletGeneratorKernels::launchKernels( // here goes algoparms....
 void CAHitQuadrupletGeneratorKernels::buildDoublets(HitsOnCPU const & hh, cudaStream_t stream) {
   auto nhits = hh.nHits;

-  int stride=4;
+  int stride=2;
   int threadsPerBlock = gpuPixelDoublets::getDoubletsFromHistoMaxBlockSize/stride;
   int blocks = (3 * nhits + threadsPerBlock - 1) / threadsPerBlock;
   dim3 blks(1,blocks,1);

I get from this PR

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1406.2 ±   0.8 ev/s (4000 events)
  1417.0 ±   1.0 ev/s (4000 events)
  1398.9 ±   0.9 ev/s (4000 events)
  1407.2 ±   0.7 ev/s (4000 events)
 --------------------
  1407.3 ±   7.4 ev/s

and

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.12%  3.89220s    100000  38.921us  1.0560us  546.27us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*, Rfit::circle_fit*, unsigned int)
                   12.61%  3.47615s     20000  173.81us  15.423us  702.36us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   11.31%  3.11798s     20000  155.90us  27.328us  759.83us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    9.35%  2.57776s     20000  128.89us  5.0240us  742.52us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.88%  2.44788s     20000  122.39us  27.487us  990.10us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    5.87%  1.61928s     20000  80.964us  5.8560us  410.43us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    5.63%  1.55245s     20000  77.622us  26.944us  428.67us kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.86%  1.34084s     20000  67.042us  1.1520us  265.02us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)

so it seems that , at least for this workflow on V100, 2D parallelization is not worth for Doublet Building

with "stride=1" we reach "1420.1 ± 5.3 ev/s"

with stride=8" in fishbone is "1426.3 ± 4.8 ev/s"
at this point I think we are in overtuning regime

fwyzard · 2019-01-23T17:46:29Z

OK... does #261 look good ?

VinInn · 2019-01-23T17:51:12Z

#261 "looks" good , maybe I should test it....

VinInn · 2019-01-23T18:22:25Z

#261 gives me "1419.6 ± 3.7 ev/s" equivalent to this PR and stride=1 in the doublet finder.

VinInn · 2019-01-23T18:30:45Z

This version should be equivalent to #261 with the advantage of having the ability to modify the 2D grid params for fishbone and doublet finder as well

VinInn · 2019-01-23T18:42:20Z

for completeness the nvprof report for this version

 GPU activities:   14.39%  3.96637s     20000  198.32us  19.456us  735.73us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPU::VecArray<unsigned int, int=128>*)
                   13.92%  3.83733s    100000  38.373us  1.0560us  525.53us  kernelCircleFitAllHits(HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1> const *, int, double, double*, float*, double*, Rfit::circle_fit*, unsigned int)
                   11.13%  3.06847s     20000  153.42us  27.007us  709.11us  kernel_find_ntuplets(GPUCACell*, unsigned int const *, HistoContainer<unsigned int, unsigned int=10000, unsigned int=50000, unsigned int=32, unsigned short, unsigned int=1>*, AtomicPairCounter*, unsigned int)
                    9.66%  2.66226s     20000  133.11us  3.9680us  718.96us  gpuVertexFinder::clusterTracks(gpuVertexFinder::OnGPU*, int, float, float, float)
                    8.83%  2.43328s     20000  121.66us  27.328us  1.0604ms  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const *, unsigned int*, unsigned int*, int*, int)
                    5.49%  1.51340s     20000  75.669us  26.912us  373.50us  kernel_connect(AtomicPairCounter*, AtomicPairCounter*, siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *)
                    4.90%  1.35110s     20000  67.555us  5.6000us  340.89us  gpuPixelDoublets::fishbone(siPixelRecHitsHeterogeneousProduct::HitsOnGPU const *, GPUCACell*, unsigned int const *, GPU::VecArray<unsigned int, int=128> const *, unsigned int, bool)
                    4.81%  1.32533s     20000  66.266us  1.1840us  285.05us  gpuVertexFinder::sortByPt2(gpuVertexFinder::OnGPU*)
``

VinInn · 2019-01-24T08:48:45Z

@fwyzard , will you have the time to update your curves with the latest version of this PR?

fwyzard · 2019-01-24T09:15:03Z

sure

fwyzard · 2019-01-24T10:45:25Z

Updated plots on V100:

Updated plots on P100:

fwyzard · 2019-01-24T12:06:05Z

Validation summary

Reference release CMSSW_10_4_0 at b8365c6
Development branch CMSSW_10_4_X_Patatrack at fb0dbd9
Testing PRs:

Use 2D grid for inner loop parallelization #260 at 6df7c96

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/77e988860822d0345820155725670b95d43c2554/log .

VinInn · 2019-01-24T12:56:50Z

cuda-memcheck --tool synccheck successful? how comes?

fwyzard · 2019-01-24T13:17:52Z

I'm running the validation on the P100 to avoid synccheck false positives.

fwyzard · 2019-01-24T13:20:43Z

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW summary

	reference-10824.5	development-10824.5	development-10824.8	testing-10824.8
Efficiency	0.4818	0.4824	0.5727	0.5727
Number of TrackingParticles (after cuts)	5556	5556	5556	5556
Number of matched TrackingParticles	2677	2680	3182	3182
Fake rate	0.0519	0.0517	0.0344	0.0344
Duplicate rate	0.0168	0.0175	0.0002	0.0003
Number of tracks	32452	32480	43906	43907
Number of true tracks	30769	30801	42394	42395
Number of fake tracks	1683	1679	1512	1512
Number of pileup tracks	27093	27118	37688	37689
Number of duplicate tracks	546	567	10	12

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW summary

	reference-10824.5	development-10824.5	development-10824.8	testing-10824.8
Efficiency	0.5594	0.5591	0.6302	0.6302
Number of TrackingParticles (after cuts)	3899	3899	3899	3899
Number of matched TrackingParticles	2181	2180	2457	2457
Fake rate	0.0076	0.0073	0.0065	0.0065
Duplicate rate	0.0136	0.0114	0.0000	0.0000
Number of tracks	3679	3682	4593	4593
Number of true tracks	3651	3655	4563	4563
Number of fake tracks	28	27	30	30
Number of pileup tracks	0	0	0	0
Number of duplicate tracks	50	42	0	0

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

VinInn added 2 commits January 22, 2019 12:54

Speed up the doublet finder

c8a6157

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fix 2dGrid

f755883

fwyzard added the enhancement label Jan 23, 2019

This was referenced Jan 23, 2019

Speed up the ca connect kernel #261

Closed

Speed up the doublet finder #242

Closed

overtuned params

6df7c96

fwyzard merged commit 00e8cf4 into cms-patatrack:CMSSW_10_4_X_Patatrack Jan 24, 2019

fwyzard pushed a commit that referenced this pull request Oct 8, 2020

Speed up the doublet finder (#260)

3f6e850

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard mentioned this pull request Oct 8, 2020

Patatrack integration - Pixel track reconstruction (10/N) cms-sw/cmssw#31722

Merged

fwyzard pushed a commit that referenced this pull request Oct 20, 2020

Speed up the doublet finder (#260)

654b677

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Oct 20, 2020

Speed up the doublet finder (#260)

c81574e

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Oct 23, 2020

Speed up the doublet finder (#260)

7f73369

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Nov 6, 2020

Speed up the doublet finder (#260)

14042c9

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Nov 6, 2020

Speed up the doublet finder (#260)

61a11b8

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Nov 16, 2020

Speed up the doublet finder (#260)

2a9e66b

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard added a commit that referenced this pull request Nov 27, 2020

Speed up the doublet finder (#260)

426c231

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Dec 26, 2020

Speed up the doublet finder (#260)

665d7da

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard mentioned this pull request Dec 26, 2020

Patatrack integration - Pixel vertex reconstruction (11/N) cms-sw/cmssw#31723

Merged

fwyzard pushed a commit that referenced this pull request Jan 15, 2021

Speed up the doublet finder (#260)

d0469c6

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Mar 23, 2021

Speed up the doublet finder (#260)

735f33b

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard pushed a commit that referenced this pull request Apr 1, 2021

Speed up the doublet finder (#260)

64b28b4

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use 2D grid for inner loop parallelization #260

Use 2D grid for inner loop parallelization #260

VinInn commented Jan 23, 2019

VinInn commented Jan 23, 2019 •

edited by fwyzard

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

fwyzard commented Jan 23, 2019

fwyzard commented Jan 23, 2019

VinInn commented Jan 23, 2019

fwyzard commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

fwyzard commented Jan 23, 2019

VinInn commented Jan 23, 2019

VinInn commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 23, 2019

VinInn commented Jan 23, 2019

VinInn commented Jan 24, 2019 •

edited

Loading

fwyzard commented Jan 24, 2019

fwyzard commented Jan 24, 2019

fwyzard commented Jan 24, 2019

VinInn commented Jan 24, 2019

fwyzard commented Jan 24, 2019

fwyzard commented Jan 24, 2019

Use 2D grid for inner loop parallelization #260

Use 2D grid for inner loop parallelization #260

Conversation

VinInn commented Jan 23, 2019

VinInn commented Jan 23, 2019 • edited by fwyzard Loading

VinInn commented Jan 23, 2019 • edited Loading

fwyzard commented Jan 23, 2019

fwyzard commented Jan 23, 2019

VinInn commented Jan 23, 2019

fwyzard commented Jan 23, 2019 • edited Loading

on a pair of P100

on a pair of V100

VinInn commented Jan 23, 2019 • edited Loading

VinInn commented Jan 23, 2019 • edited Loading

fwyzard commented Jan 23, 2019

VinInn commented Jan 23, 2019

VinInn commented Jan 23, 2019 • edited Loading

VinInn commented Jan 23, 2019

VinInn commented Jan 23, 2019

VinInn commented Jan 24, 2019 • edited Loading

fwyzard commented Jan 24, 2019

fwyzard commented Jan 24, 2019

fwyzard commented Jan 24, 2019

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

VinInn commented Jan 24, 2019

fwyzard commented Jan 24, 2019

fwyzard commented Jan 24, 2019

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW summary

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW summary

VinInn commented Jan 23, 2019 •

edited by fwyzard

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

fwyzard commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 23, 2019 •

edited

Loading

VinInn commented Jan 24, 2019 •

edited

Loading

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles