-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward-merge branch-23.04 to branch-23.06 #1368
Conversation
Remove faiss_mr.hpp - in favour of rapidsai/cuml#5281 Authors: - Ben Frederickson (https://github.com/benfred) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1351
This depends on #1202 so those changes are also included here. This should also not be merged until that PR is merged. Authors: - Corey J. Nolet (https://github.com/cjnolet) - Ben Frederickson (https://github.com/benfred) Approvers: - Ben Frederickson (https://github.com/benfred) - Vyas Ramasubramani (https://github.com/vyasr) - Ray Douglass (https://github.com/raydouglass) URL: #1340
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) - Mark Sadang (https://github.com/msadang) URL: #1354
…#1333) Now that we no longer need to rely on the FAISS dependency, this PR: 1. Consolidates the `libraft-distance` and `libraft-nn` targets into a single `libraft` artifact 2. Introduces a new `raft::compiled` target to link against the new `libraft.so` binary. Removes `raft::distance`and `raft::nn` targets. 3. Consolidates `RAFT_DISTANCE_COMPILED` and `RAFT_NN_COMPILED` pre-processor vars into a single `RAFT_COMPILED` (to match similar pattern implementated by `spdlog`) 4. Consolidates `specializations.cuh` headers 5. Updates all docs, scripts, and build infra This change has been a long time coming and is intended to be a 23.04 feature. This is further going to require updates to several projects downstream. Here's a checklist to track that progress: - [x] offline announcement - [x] cuml rapidsai/cuml#5272 - [x] cugraph rapidsai/cugraph#3348 - [x] cuopt rapidsai/cuopt#1023 - [x] cugraph-ops rapidsai/cugraph-ops#429 This PR depended on #1340 (removing FAISS from the build) and on #1202 (replacing the FAISS bfknn w/ our own), both of which have been merged. Closes #824 Authors: - Corey J. Nolet (https://github.com/cjnolet) - Ben Frederickson (https://github.com/benfred) Approvers: - Sevag H (https://github.com/sevagh) - Divye Gala (https://github.com/divyegala) - Ben Frederickson (https://github.com/benfred) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1333
The ivf_flat specialization header declarations used a wrong index type. The specializations for ivf flat are defined for int64_t. The raft_runtime interface also uses the int64_t instances. The ivf_flat specialization header, however, declared an interface using uint64_t. This is fixed with this PR. Should also reduce compile times for `src/distance/neighbors/ivf_flat_search.cu.o` Authors: - Allard Hendriksen (https://github.com/ahendriksen) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Divye Gala (https://github.com/divyegala) URL: #1358
This PR removes modification of the `__init__.py::version` attribute that occurs during the wheel build process. See rapidsai/ops#2592 for more information. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Sevag H (https://github.com/sevagh) URL: #1359
In dask/distributed#7580 get_worker was modified to return the worker of a task, thus it cannot be used by client.run, and we must now use dask_worker as the first argument to client.run to obtain the worker. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - AJ Schmidt (https://github.com/ajschmidt8) URL: #1365
Following the findings in https://github.com/ahendriksen/raft/tree/investigate-compile-time-reduction-strategies#investigation-of-compile-times, this PR reduces the compile times of the pairwise distance specializations. This is achieved by: 1. Reducing the number of included files in the translation units where kernels are instantiated, specifically `spdlog` and `rmm` are avoided. 2. Limiting loop unrolling in kernels with expensive operations in the inner loop. Additional improvements geared towards iterative development: 1. The tests do not have to be recompiled when the internals of a pairwise distance kernel change. Before, a rebuilt was triggered due an include of `raft/distance/distance.cuh`. 2. Addition of a fine tuning benchmark for the pairwise distance kernels that separates building the kernel from the benchmark code. This dramatically speeds up development. Compiling an empty benchmark takes roughly 18 seconds on my machine. Whereas recompiling a kernel takes ~3.8 seconds. Without this addition, a commit like 35a2ad4 would require substantially more time to make sure that performance is not degraded. ![image](https://user-images.githubusercontent.com/4172822/225383120-5f8a82f9-0b46-4c39-bc1d-7b2a0551e881.png) ``` Parallel build time before: 270 seconds (6 cores, SMT, 12 jobs) Parallel build time before: 147 seconds (6 cores, SMT, 12 jobs) Sum of compile times before: 3022.6 seconds Sum of compile times after: 1816.2 seconds Comparison of compile times between headers and compiled: path before (s) after (s) change (s) change (%) pairwise_test None 0.486 None None ance/distance/specializations/detail/lp_unexpanded_double_double_double_int.cu.o 101.1 10.3 -90.8 -89.8% src/distance/distance/specializations/detail/canberra_float_float_float_int.cu.o 52.9 6.3 -46.6 -88.0% /distance/distance/specializations/detail/canberra_double_double_double_int.cu.o 48.5 6.4 -42.1 -86.8% stance/distance/specializations/detail/jensen_shannon_float_float_float_int.cu.o 65.3 10.4 -55.0 -84.1% istance/distance/specializations/detail/kl_divergence_float_float_float_int.cu.o 70.2 12.6 -57.6 -82.0% stance/distance/specializations/detail/correlation_double_double_double_int.cu.o 46.7 8.9 -37.8 -80.9% distance/specializations/detail/hellinger_expanded_double_double_double_int.cu.o 41.6 8.1 -33.5 -80.6% nce/distance/specializations/detail/jensen_shannon_double_double_double_int.cu.o 74.6 15.1 -59.5 -79.7% ir/src/distance/distance/specializations/detail/l1_double_double_double_int.cu.o 40.9 8.4 -32.5 -79.4% ance/distance/specializations/detail/l2_unexpanded_double_double_double_int.cu.o 40.7 8.6 -32.1 -78.8% distance/specializations/detail/hamming_unexpanded_double_double_double_int.cu.o 40.8 9.0 -31.7 -77.8% istance/distance/specializations/detail/lp_unexpanded_float_float_float_int.cu.o 45.9 10.2 -35.7 -77.8% src/distance/distance/specializations/detail/l_inf_double_double_double_int.cu.o 41.2 9.5 -31.8 -77.0% istance/distance/specializations/detail/russel_rao_double_double_double_int.cu.o 29.5 7.2 -22.3 -75.6% t.dir/src/distance/distance/specializations/detail/l1_float_float_float_int.cu.o 47.3 13.2 -34.1 -72.2% ce/distance/specializations/detail/hamming_unexpanded_float_float_float_int.cu.o 47.0 13.3 -33.7 -71.6% /distance/distance/specializations/detail/correlation_float_float_float_int.cu.o 49.4 14.1 -35.3 -71.5% ce/distance/specializations/detail/hellinger_expanded_float_float_float_int.cu.o 43.6 12.5 -31.1 -71.4% c/distance/distance/specializations/detail/russel_rao_float_float_float_int.cu.o 28.5 8.2 -20.3 -71.2% ance/distance/specializations/detail/kl_divergence_double_double_double_int.cu.o 75.8 21.9 -53.9 -71.1% istance/distance/specializations/detail/l2_unexpanded_float_float_float_int.cu.o 46.2 13.5 -32.7 -70.7% ir/src/distance/distance/specializations/detail/l_inf_float_float_float_int.cu.o 43.1 12.7 -30.4 -70.6% stance/distance/specializations/detail/l2_expanded_double_double_double_int.cu.o 52.3 24.9 -27.3 -52.3% /distance/distance/specializations/detail/l2_expanded_float_float_float_int.cu.o 75.8 40.3 -35.5 -46.8% rc/distance/distance/specializations/detail/cosine_double_double_double_int.cu.o 53.5 28.7 -24.8 -46.4% r/src/distance/distance/specializations/detail/cosine_float_float_float_int.cu.o 83.9 50.1 -33.8 -40.3% CMakeFiles/pairwise_test.dir/test/distance/fused_l2_nn.cu.o 85.1 64.1 -21.1 -24.7% wise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int64.cu.o 56.2 42.9 -13.3 -23.6% irwise_test.dir/src/distance/distance/specializations/fused_l2_nn_float_int.cu.o 52.5 40.2 -12.3 -23.5% CMakeFiles/pairwise_test.dir/test/distance/dist_lp_unexp.cu.o 56.3 43.3 -13.0 -23.1% CMakeFiles/pairwise_test.dir/test/distance/dist_russell_rao.cu.o 55.7 44.0 -11.7 -21.0% rwise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int.cu.o 45.3 36.4 -9.0 -19.8% CMakeFiles/pairwise_test.dir/test/distance/dist_l2_unexp.cu.o 54.6 44.1 -10.6 -19.3% CMakeFiles/pairwise_test.dir/test/distance/dist_canberra.cu.o 51.6 42.1 -9.6 -18.6% CMakeFiles/pairwise_test.dir/test/distance/dist_l2_exp.cu.o 53.1 43.4 -9.6 -18.2% CMakeFiles/pairwise_test.dir/test/distance/dist_l_inf.cu.o 53.2 43.9 -9.3 -17.5% CMakeFiles/pairwise_test.dir/test/distance/dist_hellinger.cu.o 53.1 44.0 -9.0 -17.0% CMakeFiles/pairwise_test.dir/test/distance/dist_hamming.cu.o 52.3 43.4 -8.9 -17.0% CMakeFiles/pairwise_test.dir/test/distance/dist_l2_sqrt_exp.cu.o 54.0 45.6 -8.4 -15.6% CMakeFiles/pairwise_test.dir/test/distance/dist_l1.cu.o 52.6 44.5 -8.1 -15.4% CMakeFiles/pairwise_test.dir/test/distance/dist_kl_divergence.cu.o 52.4 44.7 -7.7 -14.8% ise_test.dir/src/distance/distance/specializations/fused_l2_nn_double_int64.cu.o 43.5 37.2 -6.4 -14.7% CMakeFiles/pairwise_test.dir/test/distance/dist_cos.cu.o 52.4 44.8 -7.6 -14.5% CMakeFiles/pairwise_test.dir/test/distance/dist_jensen_shannon.cu.o 53.2 45.7 -7.6 -14.2% CMakeFiles/pairwise_test.dir/test/distance/dist_inner_product.cu.o 51.1 44.8 -6.3 -12.4% istance/distance/specializations/detail/inner_product_float_float_float_int.cu.o 39.5 35.1 -4.5 -11.3% CMakeFiles/pairwise_test.dir/test/distance/dist_correlation.cu.o 51.7 46.8 -4.9 -9.5% ance/distance/specializations/detail/inner_product_double_double_double_int.cu.o 37.1 33.9 -3.1 -8.5% src/distance/distance/specializations/detail/kernels/gram_matrix_base_float.cu.o 45.3 41.7 -3.6 -8.0% rc/distance/distance/specializations/detail/kernels/gram_matrix_base_double.cu.o 42.5 39.6 -2.9 -6.8% stance/distance/specializations/detail/kernels/polynomial_kernel_double_int.cu.o 40.4 38.5 -1.9 -4.8% CMakeFiles/pairwise_test.dir/test/distance/dist_adj.cu.o 123.3 117.8 -5.4 -4.4% CMakeFiles/pairwise_test.dir/test/distance/gram.cu.o 55.3 53.4 -1.9 -3.5% build.ninja 4.0 4.0 +0.0 +0.1% istance/distance/specializations/detail/kernels/polynomial_kernel_float_int.cu.o 45.2 45.6 +0.4 +0.8% .dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_float.cu.o 45.2 46.0 +0.8 +1.7% dir/src/distance/distance/specializations/detail/kernels/tanh_kernel_double.cu.o 39.0 39.8 +0.8 +2.1% CMakeFiles/pairwise_test.dir/src/distance/distance/pairwise_distance.cu.o 39.6 50.1 +10.5 +26.6% ``` Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Divye Gala (https://github.com/divyegala) URL: #1307
raft-dask requires nccl to build, add to the dependencies.yaml so that when creating a clean raft conda environment - we can build all of raft out of the box Authors: - Ben Frederickson (https://github.com/benfred) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - AJ Schmidt (https://github.com/ajschmidt8) URL: #1361
This PR uses dependencies.yaml to generate the dependency lists in pyproject.toml Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) URL: #1364
FAILURE - Unable to forward-merge due to conflicts, manual merge is necessary. Do not use the |
Closes #1289 Authors: - Corey J. Nolet (https://github.com/cjnolet) - Ben Frederickson (https://github.com/benfred) Approvers: - Ben Frederickson (https://github.com/benfred) URL: #1292
UCX 1.14.0 was recently released and conda-forge package was updated in conda-forge/ucx-split-feedstock#111 with several packaging improvements. Relax the pin to allow installing UCX v1.14.x as well. Authors: - Peter Andreas Entschev (https://github.com/pentschev) - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #1366
I noticed some folks on Twitter asking about an architecture diagram for RAFT. I think it's a good idea to provide this in the README. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Ben Frederickson (https://github.com/benfred) - Divye Gala (https://github.com/divyegala) URL: #1370
The main changes are: - Add a one-block version. It uses single thread block for one row of a batch and is used when `len` is relatively small (<= 16384) - Avoid writing candidates to buffers when the number of candidates is larger than buffer length. - Add a parameter to control whether to use a fused filter in the last pass or use a standalone filter kernel. The later case is preferable when the leading bits of inputs are almost same. - Early stopping: when the target bucket contains `k` values, we can stop the computation earlier - Many implementation details are polished, like the initialization of `counter`, calculation of kernel launch parameters, and the scan step - Tests and benchmarks are updated to include the new implementations. New benchmarks are added to demonstrate the advantage of adaptive version. Authors: - Yong Wang (https://github.com/yong-wang) - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #1175
This is a copy and modification of a user's project but I think this is going to be generally useful to users as the same types of challenges are going to come up again. In this case, the user wasn't able to build/link because they weren't using `rapids-cmake` to propagate important configuration settings. I think having a skeleton project available that we build in CI and keep up to date will help new users build more applications on RAFT. TODO: - [x] Make building the template optional - [x] Verify this can build in CMake and reuse already built/installed bits - [x] Add to docs / readme and reference in README.md - [x] Add a little example of invoking an API (maybe `pairwise_distances`?) to `main()` Authors: - Corey J. Nolet (https://github.com/cjnolet) - Ben Frederickson (https://github.com/benfred) Approvers: - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) - Divye Gala (https://github.com/divyegala) - AJ Schmidt (https://github.com/ajschmidt8) URL: #1312
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Ben Frederickson (https://github.com/benfred) URL: #1373
This should cut compilation time for refine_d_int64_t_float.cu.o et al from ~900 seconds to 29 seconds. The refine specialization contain >100 instances of the ivfflat_interleaved_scan kernel, even though these should be seperately compiled by the ivfflat_search specializations. The call to ivf_flat_interleaved_scan is [here](https://github.com/rapidsai/raft/blob/56ac43ad93a319a61073dce1b3b937f6f13ade63/cpp/include/raft/neighbors/detail/refine.cuh#L121). Depends on (so please merge after) PR #1307. Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1360
Add the ability for a user to specify an epilogue function to run after the distance in the brute_force::knn call. This lets us remove faiss from cuml, by updating the hdbscan reachability code (rapidsai/cuml#5293) Authors: - Ben Frederickson (https://github.com/benfred) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1371
New `bench/ann` artifact for comparing (C++ APIs for) GPU-acclerated algorithms end-to-end. Working on this w/ @tfeher but had to squash the original commits into a single commit. Things left to do: - [x] Separate `benchmarks` executables for each different algorithm - [x] Separate build targets for `ggnn` and `hnswlib` - [x] Revise `bench/ann` docs - [x] Break `factory.cuh` abd `benchmark.cu` / `benchmark.cpp` into individual files for each different algorithm to make it easier to plug in new algorithms. - [x] Separate into its own conda package closes #1211 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Ray Douglass (https://github.com/raydouglass) - Ben Frederickson (https://github.com/benfred) - Tamas Bela Feher (https://github.com/tfeher) URL: #1304
@cjnolet FYI it would be good to sync these branches for development on the CUDA 12 recipes -- we want to align with the 23.06 branches of other RAPIDS packages. |
Closes #1377 Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #1378
Add `--time` option to `build.sh` that enables compile time logging of `nvcc`. Also, add a script `cpp/scripts/analyze_nvcc_log.py` to find the translation units that take the longest time. Output looks like: ``` $ cpp/scripts/analyze_nvcc_log.py cpp/build/nvcc_compile_log.csv -- loading data -- analyzing data -- Ten longest translation units: phase index file cicc cudafe++ fatbinary gcc (compiling) gcc (preprocessing 1) gcc (preprocessing 4) ptxas total time 0 10 ions/detail/canberra_double_double_double_int.cu 42.431063 10.601856 0.020979 6.747153 3.721194 2.093567 1618.390375 1684.006186 1 11 zations/detail/canberra_float_float_float_int.cu 36.928960 9.804138 0.011537 6.796088 3.481156 1.790703 1584.262875 1643.075457 2 85 ors/specializations/refine_d_uint64_t_uint8_t.cu 602.935531 14.980877 0.529673 36.300566 6.270717 2.889723 933.622969 1597.530056 3 84 bors/specializations/refine_d_uint64_t_int8_t.cu 606.513281 16.243960 0.729282 39.981113 5.608029 3.028493 897.241469 1569.345628 4 53 stance/neighbors/ivfpq_search_int8_t_uint64_t.cu 841.049750 8.233967 1.025554 24.248578 4.069022 1.747108 631.193734 1511.567713 5 52 istance/neighbors/ivfpq_search_float_uint64_t.cu 837.241437 8.145278 1.042313 24.400606 3.433528 1.882623 627.786672 1503.932457 6 54 tance/neighbors/ivfpq_search_uint8_t_uint64_t.cu 846.706656 8.371286 1.025517 24.094691 3.432749 1.645345 618.319234 1503.595479 7 76 izations/detail/ivfpq_search_uint8_t_uint64_t.cu 698.726266 7.086368 1.050021 39.727723 3.259101 1.333935 406.509937 1157.693351 8 74 alizations/detail/ivfpq_search_float_uint64_t.cu 706.702516 6.905794 1.049731 39.923895 2.814361 2.057154 395.604000 1155.057450 9 75 lizations/detail/ivfpq_search_int8_t_uint64_t.cu 689.390281 6.483386 1.025864 39.865668 3.121696 1.297788 409.099562 1150.284245 10 83 hbors/specializations/refine_d_uint64_t_float.cu 334.705594 15.466444 0.680270 36.551977 5.405133 2.947568 715.708781 1111.465767 -- Plotting absolute compile times -- Wrote absolute compile time plot to cpp/build/nvcc_compile_log.csv.absolute.compile_times.png -- Plotting relative compile times -- Wrote relative compile time plot to cpp/build/nvcc_compile_log.csv.relative.compile_times.png ``` Authors: - Allard Hendriksen (https://github.com/ahendriksen) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1262
Forward-merge triggered by push to
branch-23.04
that creates a PR to keepbranch-23.06
up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge.