-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the doublet finder #242
Conversation
Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.
I confirm that on volta hand made is definitively faster than the 2d grid https://github.com/VinInn/ctest/blob/master/cuda/combiXY.cu ) maybe we should investigate with Nvidia HM
2D
|
is it possible the cause is the different number of blocks vs threads in the two cases ?
|
Updating the block size in ==411836== Profiling application: ./combiHM
==411836== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 47.85% 21.660ms 16 1.3537ms 1.3499ms 1.3588ms void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
26.52% 12.007ms 16 750.41us 747.86us 753.27us void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
13.71% 6.2061ms 16 387.88us 386.14us 389.82us void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
7.33% 3.3163ms 16 207.27us 206.17us 208.54us void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
4.21% 1.9058ms 16 119.11us 118.40us 121.15us void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float) ==411864== Profiling application: ./combiXY
==411864== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 46.86% 23.330ms 16 1.4581ms 1.4540ms 1.4618ms void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
24.62% 12.255ms 16 765.95us 765.46us 766.81us void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
14.22% 7.0786ms 16 442.41us 441.72us 443.16us void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
8.77% 4.3651ms 16 272.82us 272.57us 273.69us void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
5.19% 2.5830ms 16 161.44us 159.26us 163.13us void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float) ==411890== Profiling application: ./combiXY2
==411890== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 49.75% 23.314ms 16 1.4571ms 1.4538ms 1.4621ms void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
25.47% 11.936ms 16 745.98us 743.38us 748.34us void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
13.26% 6.2147ms 16 388.42us 386.97us 389.69us void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
7.08% 3.3200ms 16 207.50us 206.56us 208.51us void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
4.07% 1.9070ms 16 119.19us 118.62us 120.99us void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float) |
ok, i think you are fully right. |
I updated the simple test and added more nt,nb combinations (of course the test is small, so there is no register pressure)
|
Second part of @VinInn 's #238.
Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.