Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the doublet finder #242

Closed

Conversation

fwyzard
Copy link

@fwyzard fwyzard commented Jan 9, 2019

Second part of @VinInn 's #238.

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

Introduce the inner loop parallelization in the doublet finder using the
stride pattern already used in the "fishbone", and make use of a 2D grid
instead of a hand-made stride.
@VinInn
Copy link

VinInn commented Jan 22, 2019

I confirm that on volta hand made is definitively faster than the 2d grid
(Using my simple tests
https://github.com/VinInn/ctest/blob/master/cuda/combiHM.cu

https://github.com/VinInn/ctest/blob/master/cuda/combiXY.cu

)

maybe we should investigate with Nvidia
(btw using clang does not change anything)

HM

 nvprof hm
==2778312== NVPROF is profiling process 2778312, command: hm
==2778312== Profiling application: hm
==2778312== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.67%  20.940ms        16  1.3087ms  1.3046ms  1.3153ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   26.58%  11.676ms        16  729.76us  726.97us  732.41us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   13.83%  6.0769ms        16  379.81us  378.46us  381.12us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    7.31%  3.2094ms        16  200.59us  200.06us  201.05us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    4.19%  1.8410ms        16  115.06us  114.24us  116.19us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.37%  161.25us        32  5.0380us  5.0230us  5.0560us  [CUDA memcpy HtoD]
                    0.05%  20.736us        16  1.2960us  1.2480us  1.6320us  [CUDA memset]

2D

[innocent@workergpu16 cuda]$ nvprof xy
==2778260== NVPROF is profiling process 2778260, command: xy
==2778260== Profiling application: xy
==2778260== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   46.97%  22.544ms        16  1.4090ms  1.4035ms  1.4128ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   24.59%  11.805ms        16  737.84us  736.89us  738.78us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   14.11%  6.7709ms        16  423.18us  422.59us  423.77us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    8.77%  4.2076ms        16  262.97us  262.14us  263.81us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    5.18%  2.4878ms        16  155.49us  154.21us  156.58us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.34%  162.59us        32  5.0800us  5.0240us  5.3760us  [CUDA memcpy HtoD]
                    0.04%  20.864us        16  1.3040us  1.2480us  1.6320us  [CUDA memset]
      API calls:   86.70%  322.71ms         4  80.678ms  3.3450us  322.50ms  cudaMalloc
                   11.94%  44.463ms        32  1.3895ms  13.364us  2.9524ms  cudaMemcpy

@fwyzard
Copy link
Author

fwyzard commented Jan 22, 2019

is it possible the cause is the different number of blocks vs threads in the two cases ?

  • combiHM.cu has 1024*STRIDE blocks, each with 64 threads
  • combiXY.cu has 1024 blocks, each with STRIDE*64 threads

@fwyzard
Copy link
Author

fwyzard commented Jan 22, 2019

Updating the block size in combiXY to match that of combiHM I get more similar performance on a V100 (except for the case STRIDE=1):

==411836== Profiling application: ./combiHM
==411836== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.85%  21.660ms        16  1.3537ms  1.3499ms  1.3588ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   26.52%  12.007ms        16  750.41us  747.86us  753.27us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   13.71%  6.2061ms        16  387.88us  386.14us  389.82us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    7.33%  3.3163ms        16  207.27us  206.17us  208.54us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    4.21%  1.9058ms        16  119.11us  118.40us  121.15us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)
==411864== Profiling application: ./combiXY
==411864== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   46.86%  23.330ms        16  1.4581ms  1.4540ms  1.4618ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   24.62%  12.255ms        16  765.95us  765.46us  766.81us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   14.22%  7.0786ms        16  442.41us  441.72us  443.16us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    8.77%  4.3651ms        16  272.82us  272.57us  273.69us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    5.19%  2.5830ms        16  161.44us  159.26us  163.13us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)
==411890== Profiling application: ./combiXY2
==411890== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   49.75%  23.314ms        16  1.4571ms  1.4538ms  1.4621ms  void nn<int=1>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   25.47%  11.936ms        16  745.98us  743.38us  748.34us  void nn<int=2>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   13.26%  6.2147ms        16  388.42us  386.97us  389.69us  void nn<int=4>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    7.08%  3.3200ms        16  207.50us  206.56us  208.51us  void nn<int=8>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    4.07%  1.9070ms        16  119.19us  118.62us  120.99us  void nn<int=16>(unsigned int*, float const *, float const *, unsigned int*, int, float)

@VinInn
Copy link

VinInn commented Jan 22, 2019

ok, i think you are fully right.
So we should keep numberOfBlocks the same and "reduce" the blocksize in y (and apparently small blocksize is better)

@VinInn
Copy link

VinInn commented Jan 22, 2019

I updated the simple test and added more nt,nb combinations (of course the test is small, so there is no register pressure)

[innocent@workergpu16 cuda]$ nvprof ./hm
==2808519== NVPROF is profiling process 2808519, command: ./hm
==2808519== Profiling application: ./hm
==2808519== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.98%  20.966ms        16  1.3104ms  1.3053ms  1.3181ms  void nn<int=1, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   20.88%  20.867ms        16  1.3042ms  1.3027ms  1.3066ms  void nn<int=1, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   12.23%  12.229ms        16  764.31us  763.58us  765.50us  void nn<int=2, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   11.71%  11.703ms        16  731.44us  727.58us  733.47us  void nn<int=2, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    6.77%  6.7678ms        16  422.99us  422.49us  423.68us  void nn<int=4, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    6.09%  6.0875ms        16  380.47us  378.65us  381.56us  void nn<int=4, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.72%  3.7134ms        16  232.09us  231.33us  232.64us  void nn<int=8, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.23%  3.2245ms        16  201.53us  200.45us  202.56us  void nn<int=8, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    2.10%  2.0942ms        16  130.89us  130.59us  131.17us  void nn<int=256, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    2.01%  2.0094ms        16  125.58us  124.64us  126.62us  void nn<int=16, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.88%  1.8839ms        16  117.74us  117.37us  118.46us  void nn<int=32, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.85%  1.8469ms        16  115.43us  114.43us  116.54us  void nn<int=16, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.66%  1.6597ms        16  103.73us  103.55us  103.94us  void nn<int=128, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.64%  1.6428ms        16  102.67us  102.53us  102.78us  void nn<int=64, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.62%  1.6230ms        16  101.44us  95.552us  106.08us  void nn<int=32, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.45%  1.4507ms        16  90.667us  90.432us  91.007us  void nn<int=64, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.16%  162.01us        32  5.0620us  5.0230us  5.3760us  [CUDA memcpy HtoD]
                    0.02%  20.768us        16  1.2980us  1.2480us  1.6640us  [CUDA memset]
      API calls:   76.11%  324.21ms         4  81.052ms  3.0660us  324.00ms  cudaMalloc
                   21.70%  92.450ms        32  2.8891ms  13.471us  6.1553ms  cudaMemcpy
                    1.45%  6.1771ms         1  6.1771ms  6.1771ms  6.1771ms  cudaDeviceSynchronize
                    0.42%  1.8039ms       256  7.0460us  4.5870us  28.812us  cudaLaunchKernel
                    0.13%  568.19us        96  5.9180us     113ns  227.21us  cuDeviceGetAttribute
                    0.07%  316.80us         4  79.199us  4.4870us  172.74us  cudaFree
                    0.06%  246.18us         1  246.18us  246.18us  246.18us  cuDeviceTotalMem
                    0.03%  149.01us        16  9.3130us  6.2560us  47.610us  cudaMemset
                    0.01%  48.370us         1  48.370us  48.370us  48.370us  cuDeviceGetName
                    0.00%  17.561us         5  3.5120us     291ns  15.272us  cudaGetDevice
                    0.00%  6.1770us         1  6.1770us  6.1770us  6.1770us  cudaGetDeviceCount
                    0.00%  4.2480us         1  4.2480us  4.2480us  4.2480us  cuDeviceGetPCIBusId
                    0.00%  1.3800us         3     460ns     113ns     995ns  cuDeviceGetCount
                    0.00%     602ns         2     301ns     163ns     439ns  cuDeviceGet
                    0.00%     219ns         1     219ns     219ns     219ns  cuDeviceGetUuid
[innocent@workergpu16 cuda]$ nvprof ./xy
==2808574== NVPROF is profiling process 2808574, command: ./xy
==2808574== Profiling application: ./xy
==2808574== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   21.87%  22.534ms        16  1.4083ms  1.4044ms  1.4146ms  void nn<int=1, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   21.62%  22.274ms        16  1.3921ms  1.3906ms  1.3944ms  void nn<int=1, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   11.79%  12.146ms        16  759.10us  758.30us  760.41us  void nn<int=2, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                   11.27%  11.614ms        16  725.86us  723.16us  728.89us  void nn<int=2, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    6.57%  6.7738ms        16  423.36us  422.94us  424.06us  void nn<int=4, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    5.91%  6.0904ms        16  380.65us  378.97us  381.82us  void nn<int=4, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.60%  3.7082ms        16  231.76us  231.23us  233.31us  void nn<int=8, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    3.13%  3.2211ms        16  201.32us  200.41us  202.62us  void nn<int=8, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    2.17%  2.2351ms        16  139.69us  139.45us  140.00us  void nn<int=256, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.95%  2.0126ms        16  125.78us  124.70us  126.34us  void nn<int=16, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.83%  1.8831ms        16  117.69us  117.15us  118.56us  void nn<int=32, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.80%  1.8524ms        16  115.77us  114.94us  117.02us  void nn<int=16, int=64>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.68%  1.7334ms        16  108.34us  108.19us  108.54us  void nn<int=128, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.60%  1.6476ms        16  102.97us  102.78us  103.23us  void nn<int=64, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.59%  1.6427ms        16  102.67us  96.191us  107.52us  void nn<int=32, int=256>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    1.44%  1.4866ms        16  92.913us  92.543us  93.183us  void nn<int=64, int=1024>(unsigned int*, float const *, float const *, unsigned int*, int, float)
                    0.16%  161.57us        32  5.0480us  5.0240us  5.1840us  [CUDA memcpy HtoD]
                    0.02%  20.800us        16  1.3000us  1.2480us  1.6640us  [CUDA memset]
      API calls:   75.69%  326.87ms         4  81.716ms  2.9700us  326.66ms  cudaMalloc
                   21.99%  94.948ms        32  2.9671ms  11.706us  6.3266ms  cudaMemcpy

@fwyzard
Copy link
Author

fwyzard commented Jan 23, 2019

replaced by #260 or #261

@fwyzard fwyzard closed this Jan 23, 2019
@fwyzard fwyzard deleted the VinInn_GPUFastTracksNNClus_part2 branch January 23, 2019 17:39
@fwyzard fwyzard removed this from the CMSSW_10_5_X_Patatrack milestone Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants