Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up in clusterizer and doubletFinder #238

Closed

Conversation

VinInn
Copy link

@VinInn VinInn commented Dec 23, 2018

This PR (on top of #216 and #236) improves two combinatorial algorithms

  1. the clusterizer is now limited to nearest neighbor: it is faster for large occupancy and/or many isolated pixels
  2. introduce inner loop parallelization in the doubletFinder using the stride pattern already experimented in the "fishbone"

physics performance (MTV) identical o #197

@VinInn VinInn changed the title Speed up in clusterizer and doubletFilder Speed up in clusterizer and doubletFinder Dec 26, 2018
@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 68f320f
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/ab9d7780c201225dc4f7573ddda91c816c793cb3/log .

@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Here is a summary of the throughput from #197, #216 and #238, running on

2 CPUs:
  0: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (18 cores, 18 threads)
  1: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (18 cores, 18 threads)

4 NVIDIA GPUs:
  0: Tesla V100-SXM2-32GB
  1: Tesla V100-SXM2-32GB
  2: Tesla V100-SXM2-32GB
  3: Tesla V100-SXM2-32GB

development branch

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1723.5 ±   1.3 ev/s (4000 events)
  1728.2 ±   2.2 ev/s (4000 events)
  1734.0 ±   1.6 ev/s (4000 events)
  1726.4 ±   1.7 ev/s (4000 events)

#197

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1605.8 ±   1.2 ev/s (4000 events)
  1615.6 ±   1.5 ev/s (4000 events)
  1616.2 ±   1.8 ev/s (4000 events)
  1617.3 ±   1.3 ev/s (4000 events)

#216

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1738.5 ±   2.3 ev/s (4000 events)
  1732.6 ±   1.4 ev/s (4000 events)
  1743.1 ±   1.8 ev/s (4000 events)
  1746.6 ±   1.2 ev/s (4000 events)

#238

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  1822.6 ±   1.5 ev/s (4000 events)
  1822.8 ±   1.2 ev/s (4000 events)
  1837.4 ±   1.2 ev/s (4000 events)
  1823.7 ±   1.2 ev/s (4000 events)

only I/O, for reference

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
  5397.7 ±  51.1 ev/s (4000 events)
  6235.2 ±   4.1 ev/s (4000 events)
  6035.6 ±   2.9 ev/s (4000 events)
  6031.7 ±   5.1 ev/s (4000 events)

@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Performance and quality-wise:

So, I'd rather not merge #197 and #216 as they are: either we backport the initcheck fix from #238 to #216 and merge that, and merge #238 separately; or we merge #238 directly.

@fwyzard
Copy link

fwyzard commented Jan 8, 2019

@rovere running the 9 threads/streams seems to give the higher throughput, with 8 or 10 performaing only marginally worse:

Running 1 times over 4200 events with 2 jobs, each with 8 threads, 8 streams and 1 GPUs
  3616.6             2.0        4000    99.9%
Running 1 times over 4200 events with 2 jobs, each with 9 threads, 9 streams and 1 GPUs
  3636.8             2.3        4000    99.8%
Running 1 times over 4200 events with 2 jobs, each with 10 threads, 10 streams and 1 GPUs
  3618.0             2.3        4000    99.5%
Running 1 times over 4200 events with 2 jobs, each with 11 threads, 11 streams and 1 GPUs
  3575.3             2.5        4000    99.7%
Running 1 times over 4200 events with 2 jobs, each with 12 threads, 12 streams and 1 GPUs
  3562.5             2.0        4000    99.9%

@felicepantaleo
Copy link

I propose we merge #238 and "promise" not to make any other changes to physics/speedup, before a PR dedicated only to cleanup is submitted and merged.

…-sw#216)

Port and optimise the full workflow from pixel raw data to pixel tracks and vertices to GPUs.
Clean the pixel n-tuplets with the "fishbone" algorithm (only on GPUs).

Other changes:
  - recover the Riemann fit updates lost during the merge with CMSSW 10.4.x;
  - speed up clustering and track fitting;
  - minor bug fix to avoid trivial regression with the optimized fit.
@fwyzard fwyzard added this to the CMSSW_10_4_X_Patatrack milestone Jan 8, 2019
@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Validation summary

Reference release CMSSW_10_4_0_pre4 at d74dd18
Development branch CMSSW_10_4_X_Patatrack at 7067416
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_4_0_pre3-PU25ns_103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_4_0_pre3-103X_upgrade2018_realistic_v8-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/5db9b0d7a411df2a665fd83643b0449940eab28a/log .

@fwyzard
Copy link

fwyzard commented Jan 8, 2019

Physics results are unchanged, as expected:

RelValTTbar_13 reference-10824.5 development-10824.5 development-10824.8 testing-10824.8
Efficiency 0.4818 0.4824 0.5727 0.5727
Number of TrackingParticles (after cuts) 5556 5556 5556 5556
Number of matched TrackingParticles 2677 2680 3182 3182
Fake rate 0.0519 0.0517 0.0344 0.0344
Duplicate rate 0.0168 0.0175 0.0003 0.0002
Number of tracks 32452 32480 43907 43906
Number of true tracks 30769 30801 42395 42394
Number of fake tracks 1683 1679 1512 1512
Number of pileup tracks 27093 27118 37689 37688
Number of duplicate tracks 546 567 12 10

Throughput (on data) improves by 5%.

numberOfBlocks *=stride;

fishbone<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
dim3 blks(1,numberOfBlocks,1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and later, and in the kernel code: do we expect any differences using

dim3 blks(1,numberOfBlocks,1);
dim3 thrs(stride,blockSize,1);

or

dim3 blks(numberOfBlocks,1,1);
dim3 thrs(blockSize,stride,1);

assuming the .x and .y are swapped accordingly inside the kernels ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, do we expect any performance difference using

kernel<<<(1, blocks, 1), (stride, size,  1)>>>(...);

or

kernel<<<blocks, size*stride>>>(..., stride);

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spitting the PR.

Answer to first question:
According to CUDA doc and examples "x" run faster then "y" so swapping "x" with "y" will NOT achieve the desired result of having the inner loop run in contiguous cuda thread:
The current implementation should be in my intentions equivalent to the hand-made one in terms of thread assignment.

second question:
IN PRINCIPLE the two approaches should be fully equivalent: the use of a 2D grid is clearly more CUDA-style, and does not require the percolation of the stride.
I should have coded directly using the 2D grid.
IN PRACTICE: I cannot exclude a different overhead between the two implementations.
I have simple unit tests/examples

https://github.com/VinInn/ctest/blob/master/cuda/combiHM.cu

https://github.com/VinInn/ctest/blob/master/cuda/combiXY.cu

The hand-made seems a bit faster.

My opinion is that the 2D grid is the way to code it in CUDA: It is surely more easy to understand and maintain. (is like in C using 1D arrays and computing the offset by hands instead of using a 2D array...)
We could investigate with cuda/nvcc experts: not sure we get anywhere.

@fwyzard fwyzard force-pushed the CMSSW_10_4_X_Patatrack branch from 59fe318 to db3e6f8 Compare January 9, 2019 14:14
@fwyzard
Copy link

fwyzard commented Jan 9, 2019

Here is a breakdown of the performance changes with respect to #216, on a P100 and on a V100.

The measurements were done running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs, measuring the throughput from the 101st and the 4101st event, and taking the average.

Changes considered P100 throughput V100 throughput
doublet finder +5.5% -0.5%
clusteriser +5.7% +5.0%
both +12.0% +4.5%

So, the doublet finder changes seem to have a small negative impact on the V100.

@fwyzard
Copy link

fwyzard commented Jan 9, 2019

I have split the changes to the clusteriser (which are an improvement both on Pascal and on Volta) into #241, and the changes to the doublet finder and the fishbone into #242.

@fwyzard
Copy link

fwyzard commented Jan 9, 2019

Replaced by #241 and #242.

@fwyzard fwyzard closed this Jan 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants