Reduce compile times of distance specializations #1307

ahendriksen · 2023-02-28T18:09:08Z

Following the findings in https://github.com/ahendriksen/raft/tree/investigate-compile-time-reduction-strategies#investigation-of-compile-times, this PR reduces the compile times of the pairwise distance specializations.
This is achieved by:

Reducing the number of included files in the translation units where kernels are instantiated, specifically spdlog and rmm are avoided.
Limiting loop unrolling in kernels with expensive operations in the inner loop.

Additional improvements geared towards iterative development:

The tests do not have to be recompiled when the internals of a pairwise distance kernel change. Before, a rebuilt was triggered due an include of raft/distance/distance.cuh.
Addition of a fine tuning benchmark for the pairwise distance kernels that separates building the kernel from the benchmark code. This dramatically speeds up development. Compiling an empty benchmark takes roughly 18 seconds on my machine. Whereas recompiling a kernel takes ~3.8 seconds. Without this addition, a commit like 35a2ad4 would require substantially more time to make sure that performance is not degraded.

Parallel build time before: 270 seconds (6 cores, SMT, 12 jobs)
Parallel build time before: 147 seconds (6 cores, SMT, 12 jobs)

Sum of compile times before: 3022.6 seconds
Sum of compile times after:  1816.2 seconds

Comparison of compile times between headers and compiled: 
path                                                                         before (s)     after (s)  change (s) change (%)
pairwise_test                                                                    None        0.486       None     None
ance/distance/specializations/detail/lp_unexpanded_double_double_double_int.cu.o  101.1          10.3       -90.8     -89.8%
src/distance/distance/specializations/detail/canberra_float_float_float_int.cu.o   52.9           6.3       -46.6     -88.0%
/distance/distance/specializations/detail/canberra_double_double_double_int.cu.o   48.5           6.4       -42.1     -86.8%
stance/distance/specializations/detail/jensen_shannon_float_float_float_int.cu.o   65.3          10.4       -55.0     -84.1%
istance/distance/specializations/detail/kl_divergence_float_float_float_int.cu.o   70.2          12.6       -57.6     -82.0%
stance/distance/specializations/detail/correlation_double_double_double_int.cu.o   46.7           8.9       -37.8     -80.9%
distance/specializations/detail/hellinger_expanded_double_double_double_int.cu.o   41.6           8.1       -33.5     -80.6%
nce/distance/specializations/detail/jensen_shannon_double_double_double_int.cu.o   74.6          15.1       -59.5     -79.7%
ir/src/distance/distance/specializations/detail/l1_double_double_double_int.cu.o   40.9           8.4       -32.5     -79.4%
ance/distance/specializations/detail/l2_unexpanded_double_double_double_int.cu.o   40.7           8.6       -32.1     -78.8%
distance/specializations/detail/hamming_unexpanded_double_double_double_int.cu.o   40.8           9.0       -31.7     -77.8%
istance/distance/specializations/detail/lp_unexpanded_float_float_float_int.cu.o   45.9          10.2       -35.7     -77.8%
src/distance/distance/specializations/detail/l_inf_double_double_double_int.cu.o   41.2           9.5       -31.8     -77.0%
istance/distance/specializations/detail/russel_rao_double_double_double_int.cu.o   29.5           7.2       -22.3     -75.6%
t.dir/src/distance/distance/specializations/detail/l1_float_float_float_int.cu.o   47.3          13.2       -34.1     -72.2%
ce/distance/specializations/detail/hamming_unexpanded_float_float_float_int.cu.o   47.0          13.3       -33.7     -71.6%
/distance/distance/specializations/detail/correlation_float_float_float_int.cu.o   49.4          14.1       -35.3     -71.5%
ce/distance/specializations/detail/hellinger_expanded_float_float_float_int.cu.o   43.6          12.5       -31.1     -71.4%
c/distance/distance/specializations/detail/russel_rao_float_float_float_int.cu.o   28.5           8.2       -20.3     -71.2%
ance/distance/specializations/detail/kl_divergence_double_double_double_int.cu.o   75.8          21.9       -53.9     -71.1%
istance/distance/specializations/detail/l2_unexpanded_float_float_float_int.cu.o   46.2          13.5       -32.7     -70.7%
ir/src/distance/distance/specializations/detail/l_inf_float_float_float_int.cu.o   43.1          12.7       -30.4     -70.6%
stance/distance/specializations/detail/l2_expanded_double_double_double_int.cu.o   52.3          24.9       -27.3     -52.3%
/distance/distance/specializations/detail/l2_expanded_float_float_float_int.cu.o   75.8          40.3       -35.5     -46.8%
rc/distance/distance/specializations/detail/cosine_double_double_double_int.cu.o   53.5          28.7       -24.8     -46.4%
r/src/distance/distance/specializations/detail/cosine_float_float_float_int.cu.o   83.9          50.1       -33.8     -40.3%
CMakeFiles/pairwise_test.dir/test/distance/fused_l2_nn.cu.o                        85.1          64.1       -21.1     -24.7%
wise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int64.cu.o   56.2          42.9       -13.3     -23.6%
irwise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int.cu.o   52.5          40.2       -12.3     -23.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_lp_unexp.cu.o                      56.3          43.3       -13.0     -23.1%
CMakeFiles/pairwise_test.dir/test/distance/dist_russell_rao.cu.o                   55.7          44.0       -11.7     -21.0%
rwise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int.cu.o   45.3          36.4        -9.0     -19.8%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_unexp.cu.o                      54.6          44.1       -10.6     -19.3%
CMakeFiles/pairwise_test.dir/test/distance/dist_canberra.cu.o                      51.6          42.1        -9.6     -18.6%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_exp.cu.o                        53.1          43.4        -9.6     -18.2%
CMakeFiles/pairwise_test.dir/test/distance/dist_l_inf.cu.o                         53.2          43.9        -9.3     -17.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_hellinger.cu.o                     53.1          44.0        -9.0     -17.0%
CMakeFiles/pairwise_test.dir/test/distance/dist_hamming.cu.o                       52.3          43.4        -8.9     -17.0%
CMakeFiles/pairwise_test.dir/test/distance/dist_l2_sqrt_exp.cu.o                   54.0          45.6        -8.4     -15.6%
CMakeFiles/pairwise_test.dir/test/distance/dist_l1.cu.o                            52.6          44.5        -8.1     -15.4%
CMakeFiles/pairwise_test.dir/test/distance/dist_kl_divergence.cu.o                 52.4          44.7        -7.7     -14.8%
ise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int64.cu.o   43.5          37.2        -6.4     -14.7%
CMakeFiles/pairwise_test.dir/test/distance/dist_cos.cu.o                           52.4          44.8        -7.6     -14.5%
CMakeFiles/pairwise_test.dir/test/distance/dist_jensen_shannon.cu.o                53.2          45.7        -7.6     -14.2%
CMakeFiles/pairwise_test.dir/test/distance/dist_inner_product.cu.o                 51.1          44.8        -6.3     -12.4%
istance/distance/specializations/detail/inner_product_float_float_float_int.cu.o   39.5          35.1        -4.5     -11.3%
CMakeFiles/pairwise_test.dir/test/distance/dist_correlation.cu.o                   51.7          46.8        -4.9     -9.5%
ance/distance/specializations/detail/inner_product_double_double_double_int.cu.o   37.1          33.9        -3.1     -8.5%
src/distance/distance/specializations/detail/kernels/gram_matrix_base_float.cu.o   45.3          41.7        -3.6     -8.0%
rc/distance/distance/specializations/detail/kernels/gram_matrix_base_double.cu.o   42.5          39.6        -2.9     -6.8%
stance/distance/specializations/detail/kernels/polynomial_kernel_double_int.cu.o   40.4          38.5        -1.9     -4.8%
CMakeFiles/pairwise_test.dir/test/distance/dist_adj.cu.o                          123.3         117.8        -5.4     -4.4%
CMakeFiles/pairwise_test.dir/test/distance/gram.cu.o                               55.3          53.4        -1.9     -3.5%
build.ninja                                                                         4.0           4.0        +0.0     +0.1%
istance/distance/specializations/detail/kernels/polynomial_kernel_float_int.cu.o   45.2          45.6        +0.4     +0.8%
.dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_float.cu.o   45.2          46.0        +0.8     +1.7%
dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_double.cu.o   39.0          39.8        +0.8     +2.1%
CMakeFiles/pairwise_test.dir/src/distance/distance/pairwise_distance.cu.o          39.6          50.1       +10.5     +26.6%

The calculation of the tile indices are now performed in ldgXY(). This will make it possible to remove all state related to the tile index out of the class in the next commit. Note that the calculation of the tile index can depend on which overloaded constructor is called(!)

This commit moves all grid and tile indexing logic into the caller. Contractions_NT is now only responsible for *intra*-tile indexing. Due to the complexity of the epilog function, the ldgNextGridStride function is not yet called from within the main loop. That is the next goal so that we have all the grid and tile indexing localized in the loop.

This commit removes the epilog function and moves its functionality into the run loop. The next step might be to see if the ldgNextGridStride() method has to be called the current location, or if performance is the same if its called at the start of the loop.

This results in subtle issues with non-square KernelPolicy, as found in fusedL2KNN.

This is more general than just for L1. Making use of it more is work in progress.

By adding yet another struct ^^

This did remove support for the CUTLASS kernels. Has to be put back.

I wasted a lot of time because I had not replaced the op::core() method of the l2_exp_distance_op after I copied it from l2_unexp_distance_op... If I copy something from the template and forget to fill it in, I get a compile error.

I am testing on CUDA 12, where it does not seem to work. Prior to my commits, the CUTLASS kernels were also not working. So not sure what's up. In any case: consider this untested.

This indicates that the operator uses expensive operations (pow, exp, log) in the inner loop. Therefore, unrolling and or veclen parameters should be adjusted

ahendriksen · 2023-03-21T13:51:23Z

sorry @ahendriksen, we had a high priority to fix the RAFT packaging for 23.04 (in time for GTC keynote) but this caused a bunch of conflicts in this PR. They should be straightforward to fix, though- the libraft source files directory changed from src/distance/.../ to /src/.../ so most of it is just renaming.

No problem! The merge was quite painless actually. Really looking forward to the GTC keynote and all the great progress we will make going forward!

divyegala

Two minor comments that I'll leave to you to decide if you wish to implement! LGTM

cpp/include/raft/distance/detail/fused_l2_nn.cuh

cpp/include/raft/distance/detail/pairwise_matrix/kernel_sm60.cuh

…he suggestions from dask team

…ild-2304-pin_dask

…imes

ahendriksen · 2023-03-22T16:15:32Z

I think we have to wait for #1363 to clear so that the CI passes for this run.

cjnolet · 2023-03-22T16:26:53Z

@ahendriksen yep, and CI hasn't been particularly friendly lately.... Jobs have been getting queued up for several hours at a time. If it helps, you can merge that branch into this one, then at least this one will be up to date and pass through CI so we can merge it right after.

…ializations-reduce-compile-times

cjnolet · 2023-03-22T16:33:13Z

We should be able to remove the need for the ops-codeowners review once the other PR gets merged

This reverts commit e2f1aa2.

…sue w/ the suggestions from dask team" This reverts commit 7958a32.

This reverts commit 6f4e77d.

…imes

cjnolet · 2023-03-22T22:08:29Z

/merge

This should cut compilation time for refine_d_int64_t_float.cu.o et al from ~900 seconds to 29 seconds. The refine specialization contain >100 instances of the ivfflat_interleaved_scan kernel, even though these should be seperately compiled by the ivfflat_search specializations. The call to ivf_flat_interleaved_scan is [here](https://github.com/rapidsai/raft/blob/56ac43ad93a319a61073dce1b3b937f6f13ade63/cpp/include/raft/neighbors/detail/refine.cuh#L121). Depends on (so please merge after) PR #1307. Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1360

ahendriksen added 30 commits February 8, 2023 14:56

pairwise_distance_base: Fix typo

a15d5fc

This results in subtle issues with non-square KernelPolicy, as found in fusedL2KNN.

Remove deprecated header

71c6da6

Replace lambdas by raft::void_op

4bbedf6

Use an operator for L1 distance

c3d1f6e

Add launch function

3e3478b

This is more general than just for L1. Making use of it more is work in progress.

l1: Replace run-time -> compile-time dispatch

264a9d2

pairwise matrix: move files into subdirectories

b232057

pairwise matrix: Untangle dispatching and kernel template parameters

06f6ffa

By adding yet another struct ^^

l2 unexp: Use pairwise matrix dispatch

2f41faa

l2 exp: Use pairwise matrix dispatch

7938614

This did remove support for the CUTLASS kernels. Has to be put back.

Add template for distance operator

7afe6cc

I wasted a lot of time because I had not replaced the op::core() method of the l2_exp_distance_op after I copied it from l2_unexp_distance_op... If I copy something from the template and forget to fill it in, I get a compile error.

Reenable cutlass-based kernels for CUDA 12.0

5fe3292

pairwise matrix l2: Add support for CUTLASS kernels

c623332

I am testing on CUDA 12, where it does not seem to work. Prior to my commits, the CUTLASS kernels were also not working. So not sure what's up. In any case: consider this untested.

Canberra: use dispatching mechanism

27511fc

Chebyshev: use pairwise matrix dispatch

58ce6f8

Correlation: use pairwise matrix dispatch

d397c17

Hamming: use pairwise matrix dispatch

7005a4f

Hellinger: use pairwise matrix dispatch

7831deb

Jensen-Shannon: use pairwise matrix dispatch

4dc72ce

remove old hamming code

b0d36c1

KL divergence: use pairwise matrix dispatch

e95a65b

Minkowski: use pairwise matrix dispatch

f1c105b

Russel-Rao: use pairwise matrix dispatch

ac66e3f

Cosine: use pairwise matrix dispatch

a89896a

Fix include for l1 op

16b2acd

kl_divergence: Use raft::log instead of raft::myLog

1326e34

distance_op: Add expensive_inner_loop marker

0169b26

This indicates that the operator uses expensive operations (pow, exp, log) in the inner loop. Therefore, unrolling and or veclen parameters should be adjusted

distance_ops: Use static shared_mem_size

1df17be

divyegala approved these changes Mar 21, 2023

View reviewed changes

cpp/include/raft/distance/detail/fused_l2_nn.cuh Show resolved Hide resolved

cpp/include/raft/distance/detail/pairwise_matrix/kernel_sm60.cuh Show resolved Hide resolved

cjnolet added 5 - Ready to Merge and removed 5 - Merge After Dependencies Depends on another PR: do not merge out of order labels Mar 21, 2023

ahendriksen mentioned this pull request Mar 21, 2023

[ENH] Add extern template for ivfflat_interleaved_scan #1360

Merged

cjnolet and others added 6 commits March 21, 2023 16:24

Pinning dask temporarily because a recent commit broke things

6f4e77d

Merge branch 'branch-23.04' into build-2304-pin_dask

9931ce3

Updating raft-dask recipe for now. Not yet able to fix the issue w/ t…

7958a32

…he suggestions from dask team

Merge branch 'build-2304-pin_dask' of github.com:cjnolet/raft into bu…

857710e

…ild-2304-pin_dask

Merge branch 'branch-23.04' into enh-specializations-reduce-compile-t…

490c844

…imes

Pinning dask for wheel

e2f1aa2

Merge remote-tracking branch 'rapids/pull-request/1363' into enh-spec…

2991617

…ializations-reduce-compile-times

ahendriksen requested review from a team as code owners March 22, 2023 16:31

github-actions bot added the python label Mar 22, 2023

cjnolet approved these changes Mar 22, 2023

View reviewed changes

cjnolet added 4 commits March 22, 2023 18:06

Revert "Pinning dask for wheel"

d57bca4

This reverts commit e2f1aa2.

Revert "Updating raft-dask recipe for now. Not yet able to fix the is…

b9383e1

…sue w/ the suggestions from dask team" This reverts commit 7958a32.

Revert "Pinning dask temporarily because a recent commit broke things"

4d16e5a

This reverts commit 6f4e77d.

Merge branch 'branch-23.04' into enh-specializations-reduce-compile-t…

3a86004

…imes

github-actions bot removed the python label Mar 22, 2023

rapids-bot bot merged commit 08e7012 into rapidsai:branch-23.04 Mar 23, 2023

ahendriksen mentioned this pull request Mar 24, 2023

[FEA] Have pairwisedistancebase take a distance_op instead of two lambdas #1323

Closed

ahendriksen mentioned this pull request Apr 21, 2023

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce compile times of distance specializations #1307

Reduce compile times of distance specializations #1307

ahendriksen commented Feb 28, 2023 •

edited

Loading

ahendriksen commented Mar 21, 2023

divyegala left a comment

ahendriksen commented Mar 22, 2023

cjnolet commented Mar 22, 2023

cjnolet commented Mar 22, 2023

cjnolet commented Mar 22, 2023

Reduce compile times of distance specializations #1307

Reduce compile times of distance specializations #1307

Conversation

ahendriksen commented Feb 28, 2023 • edited Loading

ahendriksen commented Mar 21, 2023

divyegala left a comment

Choose a reason for hiding this comment

ahendriksen commented Mar 22, 2023

cjnolet commented Mar 22, 2023

cjnolet commented Mar 22, 2023

cjnolet commented Mar 22, 2023

ahendriksen commented Feb 28, 2023 •

edited

Loading