-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up in clusterizer and doubletFinder #238
Speed up in clusterizer and doubletFinder #238
Conversation
Validation summaryReference release CMSSW_10_4_0_pre4 at d74dd18
|
Here is a summary of the throughput from #197, #216 and #238, running on
development branch
#197
#216
#238
only I/O, for reference
|
Performance and quality-wise:
So, I'd rather not merge #197 and #216 as they are: either we backport the |
@rovere running the 9 threads/streams seems to give the higher throughput, with 8 or 10 performaing only marginally worse:
|
I propose we merge #238 and "promise" not to make any other changes to physics/speedup, before a PR dedicated only to cleanup is submitted and merged. |
…-sw#216) Port and optimise the full workflow from pixel raw data to pixel tracks and vertices to GPUs. Clean the pixel n-tuplets with the "fishbone" algorithm (only on GPUs). Other changes: - recover the Riemann fit updates lost during the merge with CMSSW 10.4.x; - speed up clustering and track fitting; - minor bug fix to avoid trivial regression with the optimized fit.
Validation summaryReference release CMSSW_10_4_0_pre4 at d74dd18
|
Physics results are unchanged, as expected:
Throughput (on data) improves by 5%. |
numberOfBlocks *=stride; | ||
|
||
fishbone<<<numberOfBlocks, blockSize, 0, cudaStream>>>( | ||
dim3 blks(1,numberOfBlocks,1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and later, and in the kernel code: do we expect any differences using
dim3 blks(1,numberOfBlocks,1);
dim3 thrs(stride,blockSize,1);
or
dim3 blks(numberOfBlocks,1,1);
dim3 thrs(blockSize,stride,1);
assuming the .x
and .y
are swapped accordingly inside the kernels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, do we expect any performance difference using
kernel<<<(1, blocks, 1), (stride, size, 1)>>>(...);
or
kernel<<<blocks, size*stride>>>(..., stride);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spitting the PR.
Answer to first question:
According to CUDA doc and examples "x" run faster then "y" so swapping "x" with "y" will NOT achieve the desired result of having the inner loop run in contiguous cuda thread:
The current implementation should be in my intentions equivalent to the hand-made one in terms of thread assignment.
second question:
IN PRINCIPLE the two approaches should be fully equivalent: the use of a 2D grid is clearly more CUDA-style, and does not require the percolation of the stride.
I should have coded directly using the 2D grid.
IN PRACTICE: I cannot exclude a different overhead between the two implementations.
I have simple unit tests/examples
https://github.com/VinInn/ctest/blob/master/cuda/combiHM.cu
https://github.com/VinInn/ctest/blob/master/cuda/combiXY.cu
The hand-made seems a bit faster.
My opinion is that the 2D grid is the way to code it in CUDA: It is surely more easy to understand and maintain. (is like in C using 1D arrays and computing the offset by hands instead of using a 2D array...)
We could investigate with cuda/nvcc experts: not sure we get anywhere.
59fe318
to
db3e6f8
Compare
Here is a breakdown of the performance changes with respect to #216, on a P100 and on a V100. The measurements were done running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs, measuring the throughput from the 101st and the 4101st event, and taking the average.
So, the doublet finder changes seem to have a small negative impact on the V100. |
This PR (on top of #216 and #236) improves two combinatorial algorithms
physics performance (MTV) identical o #197