Speed up the doublet finder #242

fwyzard · 2019-01-09T18:07:59Z

Second part of @VinInn 's #238.

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

VinInn · 2019-01-22T09:32:06Z

I confirm that on volta hand made is definitively faster than the 2d grid
(Using my simple tests
https://github.com/VinInn/ctest/blob/master/cuda/combiHM.cu

https://github.com/VinInn/ctest/blob/master/cuda/combiXY.cu

)

maybe we should investigate with Nvidia
(btw using clang does not change anything)

HM

 nvprof hm
==2778312== NVPROF is profiling process 2778312, command: hm
==2778312== Profiling application: hm
==2778312== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.67%  20.940ms        16  1.3087ms  1.3046ms  1.3153ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   26.58%  11.676ms        16  729.76us  726.97us  732.41us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   13.83%  6.0769ms        16  379.81us  378.46us  381.12us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    7.31%  3.2094ms        16  200.59us  200.06us  201.05us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    4.19%  1.8410ms        16  115.06us  114.24us  116.19us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.37%  161.25us        32  5.0380us  5.0230us  5.0560us  [CUDA memcpy HtoD]
                    0.05%  20.736us        16  1.2960us  1.2480us  1.6320us  [CUDA memset]

2D

[innocent@workergpu16 cuda]$ nvprof xy
==2778260== NVPROF is profiling process 2778260, command: xy
==2778260== Profiling application: xy
==2778260== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   46.97%  22.544ms        16  1.4090ms  1.4035ms  1.4128ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   24.59%  11.805ms        16  737.84us  736.89us  738.78us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   14.11%  6.7709ms        16  423.18us  422.59us  423.77us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    8.77%  4.2076ms        16  262.97us  262.14us  263.81us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    5.18%  2.4878ms        16  155.49us  154.21us  156.58us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.34%  162.59us        32  5.0800us  5.0240us  5.3760us  [CUDA memcpy HtoD]
                    0.04%  20.864us        16  1.3040us  1.2480us  1.6320us  [CUDA memset]
      API calls:   86.70%  322.71ms         4  80.678ms  3.3450us  322.50ms  cudaMalloc
                   11.94%  44.463ms        32  1.3895ms  13.364us  2.9524ms  cudaMemcpy

fwyzard · 2019-01-22T10:31:58Z

is it possible the cause is the different number of blocks vs threads in the two cases ?

combiHM.cu has 1024*STRIDE blocks, each with 64 threads
combiXY.cu has 1024 blocks, each with STRIDE*64 threads

fwyzard · 2019-01-22T11:08:20Z

Updating the block size in combiXY to match that of combiHM I get more similar performance on a V100 (except for the case STRIDE=1):

==411836== Profiling application: ./combiHM
==411836== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.85%  21.660ms        16  1.3537ms  1.3499ms  1.3588ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   26.52%  12.007ms        16  750.41us  747.86us  753.27us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   13.71%  6.2061ms        16  387.88us  386.14us  389.82us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    7.33%  3.3163ms        16  207.27us  206.17us  208.54us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    4.21%  1.9058ms        16  119.11us  118.40us  121.15us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)

==411864== Profiling application: ./combiXY
==411864== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   46.86%  23.330ms        16  1.4581ms  1.4540ms  1.4618ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   24.62%  12.255ms        16  765.95us  765.46us  766.81us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   14.22%  7.0786ms        16  442.41us  441.72us  443.16us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    8.77%  4.3651ms        16  272.82us  272.57us  273.69us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    5.19%  2.5830ms        16  161.44us  159.26us  163.13us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)

==411890== Profiling application: ./combiXY2
==411890== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   49.75%  23.314ms        16  1.4571ms  1.4538ms  1.4621ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   25.47%  11.936ms        16  745.98us  743.38us  748.34us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   13.26%  6.2147ms        16  388.42us  386.97us  389.69us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    7.08%  3.3200ms        16  207.50us  206.56us  208.51us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    4.07%  1.9070ms        16  119.19us  118.62us  120.99us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)

VinInn · 2019-01-22T12:10:34Z

ok, i think you are fully right.
So we should keep numberOfBlocks the same and "reduce" the blocksize in y (and apparently small blocksize is better)

VinInn · 2019-01-22T12:36:26Z

I updated the simple test and added more nt,nb combinations (of course the test is small, so there is no register pressure)

[innocent@workergpu16 cuda]$ nvprof ./hm
==2808519== NVPROF is profiling process 2808519, command: ./hm
==2808519== Profiling application: ./hm
==2808519== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.98%  20.966ms        16  1.3104ms  1.3053ms  1.3181ms  void nn<int=1, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   20.88%  20.867ms        16  1.3042ms  1.3027ms  1.3066ms  void nn<int=1, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   12.23%  12.229ms        16  764.31us  763.58us  765.50us  void nn<int=2, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   11.71%  11.703ms        16  731.44us  727.58us  733.47us  void nn<int=2, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    6.77%  6.7678ms        16  422.99us  422.49us  423.68us  void nn<int=4, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    6.09%  6.0875ms        16  380.47us  378.65us  381.56us  void nn<int=4, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.72%  3.7134ms        16  232.09us  231.33us  232.64us  void nn<int=8, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.23%  3.2245ms        16  201.53us  200.45us  202.56us  void nn<int=8, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    2.10%  2.0942ms        16  130.89us  130.59us  131.17us  void nn<int=256, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    2.01%  2.0094ms        16  125.58us  124.64us  126.62us  void nn<int=16, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.88%  1.8839ms        16  117.74us  117.37us  118.46us  void nn<int=32, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.85%  1.8469ms        16  115.43us  114.43us  116.54us  void nn<int=16, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.66%  1.6597ms        16  103.73us  103.55us  103.94us  void nn<int=128, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.64%  1.6428ms        16  102.67us  102.53us  102.78us  void nn<int=64, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.62%  1.6230ms        16  101.44us  95.552us  106.08us  void nn<int=32, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.45%  1.4507ms        16  90.667us  90.432us  91.007us  void nn<int=64, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.16%  162.01us        32  5.0620us  5.0230us  5.3760us  [CUDA memcpy HtoD]
                    0.02%  20.768us        16  1.2980us  1.2480us  1.6640us  [CUDA memset]
      API calls:   76.11%  324.21ms         4  81.052ms  3.0660us  324.00ms  cudaMalloc
                   21.70%  92.450ms        32  2.8891ms  13.471us  6.1553ms  cudaMemcpy
                    1.45%  6.1771ms         1  6.1771ms  6.1771ms  6.1771ms  cudaDeviceSynchronize
                    0.42%  1.8039ms       256  7.0460us  4.5870us  28.812us  cudaLaunchKernel
                    0.13%  568.19us        96  5.9180us     113ns  227.21us  cuDeviceGetAttribute
                    0.07%  316.80us         4  79.199us  4.4870us  172.74us  cudaFree
                    0.06%  246.18us         1  246.18us  246.18us  246.18us  cuDeviceTotalMem
                    0.03%  149.01us        16  9.3130us  6.2560us  47.610us  cudaMemset
                    0.01%  48.370us         1  48.370us  48.370us  48.370us  cuDeviceGetName
                    0.00%  17.561us         5  3.5120us     291ns  15.272us  cudaGetDevice
                    0.00%  6.1770us         1  6.1770us  6.1770us  6.1770us  cudaGetDeviceCount
                    0.00%  4.2480us         1  4.2480us  4.2480us  4.2480us  cuDeviceGetPCIBusId
                    0.00%  1.3800us         3     460ns     113ns     995ns  cuDeviceGetCount
                    0.00%     602ns         2     301ns     163ns     439ns  cuDeviceGet
                    0.00%     219ns         1     219ns     219ns     219ns  cuDeviceGetUuid
[innocent@workergpu16 cuda]$ nvprof ./xy
==2808574== NVPROF is profiling process 2808574, command: ./xy
==2808574== Profiling application: ./xy
==2808574== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   21.87%  22.534ms        16  1.4083ms  1.4044ms  1.4146ms  void nn<int=1, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   21.62%  22.274ms        16  1.3921ms  1.3906ms  1.3944ms  void nn<int=1, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   11.79%  12.146ms        16  759.10us  758.30us  760.41us  void nn<int=2, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   11.27%  11.614ms        16  725.86us  723.16us  728.89us  void nn<int=2, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    6.57%  6.7738ms        16  423.36us  422.94us  424.06us  void nn<int=4, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    5.91%  6.0904ms        16  380.65us  378.97us  381.82us  void nn<int=4, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.60%  3.7082ms        16  231.76us  231.23us  233.31us  void nn<int=8, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.13%  3.2211ms        16  201.32us  200.41us  202.62us  void nn<int=8, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    2.17%  2.2351ms        16  139.69us  139.45us  140.00us  void nn<int=256, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.95%  2.0126ms        16  125.78us  124.70us  126.34us  void nn<int=16, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.83%  1.8831ms        16  117.69us  117.15us  118.56us  void nn<int=32, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.80%  1.8524ms        16  115.77us  114.94us  117.02us  void nn<int=16, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.68%  1.7334ms        16  108.34us  108.19us  108.54us  void nn<int=128, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.60%  1.6476ms        16  102.97us  102.78us  103.23us  void nn<int=64, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.59%  1.6427ms        16  102.67us  96.191us  107.52us  void nn<int=32, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.44%  1.4866ms        16  92.913us  92.543us  93.183us  void nn<int=64, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.16%  161.57us        32  5.0480us  5.0240us  5.1840us  [CUDA memcpy HtoD]
                    0.02%  20.800us        16  1.3000us  1.2480us  1.6640us  [CUDA memset]
      API calls:   75.69%  326.87ms         4  81.716ms  2.9700us  326.66ms  cudaMalloc
                   21.99%  94.948ms        32  2.9671ms  11.706us  6.3266ms  cudaMemcpy

fwyzard · 2019-01-23T17:39:12Z

replaced by #260 or #261

Speed up the doublet finder

3dcc94b

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

fwyzard mentioned this pull request Jan 9, 2019

Speed up in clusterizer and doubletFinder #238

Closed

fwyzard added the enhancement label Jan 9, 2019

fwyzard modified the milestones: CMSSW_10_4_X_Patatrack, CMSSW_10_5_X_Patatrack Jan 9, 2019

VinInn mentioned this pull request Jan 23, 2019

Use 2D grid for inner loop parallelization #260

Merged

fwyzard closed this Jan 23, 2019

fwyzard deleted the VinInn_GPUFastTracksNNClus_part2 branch January 23, 2019 17:39

fwyzard removed this from the CMSSW_10_5_X_Patatrack milestone Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the doublet finder #242

Speed up the doublet finder #242

fwyzard commented Jan 9, 2019 •

edited

Loading

VinInn commented Jan 22, 2019

fwyzard commented Jan 22, 2019

fwyzard commented Jan 22, 2019

VinInn commented Jan 22, 2019

VinInn commented Jan 22, 2019 •

edited

Loading

fwyzard commented Jan 23, 2019

Speed up the doublet finder #242

Speed up the doublet finder #242

Conversation

fwyzard commented Jan 9, 2019 • edited Loading

VinInn commented Jan 22, 2019

fwyzard commented Jan 22, 2019

fwyzard commented Jan 22, 2019

VinInn commented Jan 22, 2019

VinInn commented Jan 22, 2019 • edited Loading

fwyzard commented Jan 23, 2019

fwyzard commented Jan 9, 2019 •

edited

Loading

VinInn commented Jan 22, 2019 •

edited

Loading