Speed up in clusterizer and doubletFinder #238

VinInn · 2018-12-23T11:57:46Z

This PR (on top of #216 and #236) improves two combinatorial algorithms

the clusterizer is now limited to nearest neighbor: it is faster for large occupancy and/or many isolated pixels
introduce inner loop parallelization in the doubletFinder using the stride pattern already experimented in the "fishbone"

physics performance (MTV) identical o #197

fwyzard · 2019-01-08T13:28:56Z

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 68f320f
Testing PRs:

Speed up in clusterizer and doubletFinder #238 at 1c2f268

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found 306724 errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ❌ cuda-memcheck --tool synccheck (report, log) found no CUDA-MEMCHECK results
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/ab9d7780c201225dc4f7573ddda91c816c793cb3/log .

fwyzard · 2019-01-08T14:32:07Z

Here is a summary of the throughput from #197, #216 and #238, running on

2 CPUs:
  0: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (18 cores, 18 threads)
  1: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (18 cores, 18 threads)

4 NVIDIA GPUs:
  0: Tesla V100-SXM2-32GB
  1: Tesla V100-SXM2-32GB
  2: Tesla V100-SXM2-32GB
  3: Tesla V100-SXM2-32GB

development branch

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1723.5 ±   1.3 ev/s (4000 events)
  1728.2 ±   2.2 ev/s (4000 events)
  1734.0 ±   1.6 ev/s (4000 events)
  1726.4 ±   1.7 ev/s (4000 events)

#197

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1605.8 ±   1.2 ev/s (4000 events)
  1615.6 ±   1.5 ev/s (4000 events)
  1616.2 ±   1.8 ev/s (4000 events)
  1617.3 ±   1.3 ev/s (4000 events)

#216

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1738.5 ±   2.3 ev/s (4000 events)
  1732.6 ±   1.4 ev/s (4000 events)
  1743.1 ±   1.8 ev/s (4000 events)
  1746.6 ±   1.2 ev/s (4000 events)

#238

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1822.6 ±   1.5 ev/s (4000 events)
  1822.8 ±   1.2 ev/s (4000 events)
  1837.4 ±   1.2 ev/s (4000 events)
  1823.7 ±   1.2 ev/s (4000 events)

only I/O, for reference

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  5397.7 ±  51.1 ev/s (4000 events)
  6235.2 ±   4.1 ev/s (4000 events)
  6035.6 ±   2.9 ev/s (4000 events)
  6031.7 ±   5.1 ev/s (4000 events)

fwyzard · 2019-01-08T14:37:09Z

Performance and quality-wise:

Full workflow on GPU #197 introduces the physics-related changes, but takes a performance-related hit
faster track fit and faster clustering #216 fixes the performance, but suffers from initcheck issues and synccheck false (?) positives
Speed up in clusterizer and doubletFinder #238 fixes the initcheck issues, and improves he performance above the pre-Full workflow on GPU #197 levels

So, I'd rather not merge #197 and #216 as they are: either we backport the initcheck fix from #238 to #216 and merge that, and merge #238 separately; or we merge #238 directly.

fwyzard · 2019-01-08T15:54:25Z

@rovere running the 9 threads/streams seems to give the higher throughput, with 8 or 10 performaing only marginally worse:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3616.6             2.0        4000    99.9%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3636.8             2.3        4000    99.8%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3618.0             2.3        4000    99.5%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3575.3             2.5        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
  3562.5             2.0        4000    99.9%

felicepantaleo · 2019-01-08T16:46:34Z

I propose we merge #238 and "promise" not to make any other changes to physics/speedup, before a PR dedicated only to cleanup is submitted and merged.

…-sw#216) Port and optimise the full workflow from pixel raw data to pixel tracks and vertices to GPUs. Clean the pixel n-tuplets with the "fishbone" algorithm (only on GPUs). Other changes: - recover the Riemann fit updates lost during the merge with CMSSW 10.4.x; - speed up clustering and track fitting; - minor bug fix to avoid trivial regression with the optimized fit.

fwyzard · 2019-01-08T19:52:10Z

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 7067416
Testing PRs:

Speed up in clusterizer and doubletFinder #238 at e15a883

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.8
tracking validation plots and summary for workflow 10824.7
tracking validation plots and summary for workflow 10824.9

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.7
- step3.py: log, visual profile and summary
development release, workflow 10824.9
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.8
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.7
- step3.py: log, visual profile and summary
testing release, workflow 10824.9
- step3.py: log, visual profile and summary

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/5db9b0d7a411df2a665fd83643b0449940eab28a/log .

fwyzard · 2019-01-08T22:56:08Z

Physics results are unchanged, as expected:

RelValTTbar_13	reference-10824.5	development-10824.5	development-10824.8	testing-10824.8
Efficiency	0.4818	0.4824	0.5727	0.5727
Number of TrackingParticles (after cuts)	5556	5556	5556	5556
Number of matched TrackingParticles	2677	2680	3182	3182
Fake rate	0.0519	0.0517	0.0344	0.0344
Duplicate rate	0.0168	0.0175	0.0003	0.0002
Number of tracks	32452	32480	43907	43906
Number of true tracks	30769	30801	42395	42394
Number of fake tracks	1683	1679	1512	1512
Number of pileup tracks	27093	27118	37689	37688
Number of duplicate tracks	546	567	12	10

Throughput (on data) improves by 5%.

fwyzard · 2019-01-08T23:05:15Z

RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cu

-    numberOfBlocks *=stride;
-
-    fishbone<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
+    dim3 blks(1,numberOfBlocks,1);


here and later, and in the kernel code: do we expect any differences using

dim3 blks(1,numberOfBlocks,1); dim3 thrs(stride,blockSize,1);

or

dim3 blks(numberOfBlocks,1,1); dim3 thrs(blockSize,stride,1);

assuming the .x and .y are swapped accordingly inside the kernels ?

In fact, do we expect any performance difference using

kernel<<<(1, blocks, 1), (stride, size, 1)>>>(...);

or

kernel<<<blocks, size*stride>>>(..., stride);

?

Thanks for spitting the PR.

Answer to first question:
According to CUDA doc and examples "x" run faster then "y" so swapping "x" with "y" will NOT achieve the desired result of having the inner loop run in contiguous cuda thread:
The current implementation should be in my intentions equivalent to the hand-made one in terms of thread assignment.

second question:
IN PRINCIPLE the two approaches should be fully equivalent: the use of a 2D grid is clearly more CUDA-style, and does not require the percolation of the stride.
I should have coded directly using the 2D grid.
IN PRACTICE: I cannot exclude a different overhead between the two implementations.
I have simple unit tests/examples

https://github.com/VinInn/ctest/blob/master/cuda/combiHM.cu

https://github.com/VinInn/ctest/blob/master/cuda/combiXY.cu

The hand-made seems a bit faster.

My opinion is that the 2D grid is the way to code it in CUDA: It is surely more easy to understand and maintain. (is like in C using 1D arrays and computing the offset by hands instead of using a 2D array...)
We could investigate with cuda/nvcc experts: not sure we get anywhere.

fwyzard · 2019-01-09T15:46:48Z

Here is a breakdown of the performance changes with respect to #216, on a P100 and on a V100.

The measurements were done running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs, measuring the throughput from the 101st and the 4101st event, and taking the average.

Changes considered	P100 throughput	V100 throughput
doublet finder	+5.5%	-0.5%
clusteriser	+5.7%	+5.0%
both	+12.0%	+4.5%

So, the doublet finder changes seem to have a small negative impact on the V100.

fwyzard · 2019-01-09T17:34:32Z

I have split the changes to the clusteriser (which are an improvement both on Pascal and on Volta) into #241, and the changes to the doublet finder and the fishbone into #242.

fwyzard · 2019-01-09T18:15:30Z

Replaced by #241 and #242.

VinInn added 30 commits October 5, 2018 11:47

use gpu vertices

d15d0dd

add vertex spitting

fc8ffad

fix iterations

9e3a3aa

apply outlier rejection, tune error

c9418f6

fix duplicate cleaning

5e9d7cf

sort and clean

2e7910f

fishbone works

2d155f1

fishbone works

aad5235

fishbone works

3244703

add layerid

d6508a8

copy layer on gpu

e2fd6d2

efficient

c68f413

optimize parallelization

9dc2184

update notebook to include fishbone

b3ed9d0

silence it

cc973f6

mark magic 2

06365df

remove magic 256, reduce it to 128

f2439af

reduce size

ae7fc3f

remove duplicate code lines

9030763

narrow cut to avoid inefficiency for realistic

d79faf5

Merged gpuVertexRedux from repository VinInn with cms-merge-topic

376a0d4

build pentuplets

ad06e33

simplify

7bea72e

align to offline

932fec9

simplify histogrammer: no need of ws in fill

d0f3adf

test cuda_assert

6a192fd

use more stable and gpu friendly version of circle

fce72dc

assoc tested

f7dbc25

check cosdir

483d591

clean clode

384465e

VinInn added 8 commits December 20, 2018 18:45

Merged GPUFastTracksOptFix from repository VinInn with cms-merge-topic

fa4a912

add NN to clustering

8af902a

Merged GPUFastTracksNNClus from repository VinInn with cms-merge-topic

705d218

revert to use topology

c190bd2

parallelize inner loop

048af11

Merged GPUFastTracksNNClus from repository VinInn with cms-merge-topic

3e1a505

use 2D grid instead of hand-made stride

940fe95

use 2D grid instead of hand-made stride

8f5d217

VinInn changed the title ~~Speed up in clusterizer and doubletFilder~~ Speed up in clusterizer and doubletFinder Dec 26, 2018

widen cell_connect as well

1c2f268

fwyzard added the enhancement label Jan 7, 2019

fwyzard added this to the CMSSW_10_4_X_Patatrack milestone Jan 8, 2019

Merge branch 'CMSSW_10_4_X_Patatrack' into GPUFastTracksNNClus

e15a883

fwyzard reviewed Jan 8, 2019

View reviewed changes

fwyzard force-pushed the CMSSW_10_4_X_Patatrack branch from 59fe318 to db3e6f8 Compare January 9, 2019 14:14

Merge branch 'CMSSW_10_4_X_Patatrack' into GPUFastTracksNNClus

8fff3ec

This was referenced Jan 9, 2019

Limit the pixel clusteriser to the nearest-neighbours #241

Merged

Speed up the doublet finder #242

Closed

fwyzard closed this Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up in clusterizer and doubletFinder #238

Speed up in clusterizer and doubletFinder #238

VinInn commented Dec 23, 2018

fwyzard commented Jan 8, 2019

fwyzard commented Jan 8, 2019 •

edited

Loading

fwyzard commented Jan 8, 2019 •

edited

Loading

fwyzard commented Jan 8, 2019

felicepantaleo commented Jan 8, 2019

fwyzard commented Jan 8, 2019

fwyzard commented Jan 8, 2019

fwyzard Jan 8, 2019

fwyzard Jan 9, 2019

VinInn Jan 21, 2019

fwyzard commented Jan 9, 2019

fwyzard commented Jan 9, 2019 •

edited

Loading

fwyzard commented Jan 9, 2019

Speed up in clusterizer and doubletFinder #238

Speed up in clusterizer and doubletFinder #238

Conversation

VinInn commented Dec 23, 2018

fwyzard commented Jan 8, 2019

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Jan 8, 2019 • edited Loading

development branch

#197

#216

#238

only I/O, for reference

fwyzard commented Jan 8, 2019 • edited Loading

fwyzard commented Jan 8, 2019

felicepantaleo commented Jan 8, 2019

fwyzard commented Jan 8, 2019

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Jan 8, 2019

fwyzard Jan 8, 2019

Choose a reason for hiding this comment

fwyzard Jan 9, 2019

Choose a reason for hiding this comment

VinInn Jan 21, 2019

Choose a reason for hiding this comment

fwyzard commented Jan 9, 2019

fwyzard commented Jan 9, 2019 • edited Loading

fwyzard commented Jan 9, 2019

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

fwyzard commented Jan 8, 2019 •

edited

Loading

fwyzard commented Jan 8, 2019 •

edited

Loading

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

fwyzard commented Jan 9, 2019 •

edited

Loading