CAGRA - separable compilation for distance computation #296

achirkin · 2024-08-16T12:06:04Z

Factor the compute_distance function and related template parameters out of the CAGRA search kernels.
This reduces the total number of kernel instances, thus reducing the binary size and the compile time.

The change, however, has a few drawbacks:

CUDA separable compilation needs to be enabled to allow compute_distance functions being compiled in separate object files. I introduced a static library component for the affected sources to minimize the impact of the change.
The separable compilation and dynamic dispatch of compute_distance function means the compiler cannot optimize across the kernel-compute_distance boundary, which results in higher register usage and occasional register spilling. Most of the cases are optimized in this PR, but some compromises seem unavoidable.
Dynamic dispatch (constructing a dataset descriptor) requires an extra kernel call (xxx_init_kernel) to get the function pointer, which adds extra latency. This is mitigated to some extent by caching the constructed descriptor using raft custom resource.

achirkin · 2024-08-19T11:03:26Z

Current WIP status, with full functionality restored after the refactoring:

CI build size 880 -> 663MB.
Slowdown is up to 2x
Bonus: multi-kernel version now naturally supports CAGRA-Q compression.

…ning)

…eparable-compilation

…e place as compute_distance components

achirkin · 2024-08-26T06:36:01Z

Current WIP status:

CI build size 788 -> 600MB.
Slowdown is up to 20% for standard distance and up to 30% for VPQ distance. The worst case is the single-cta kernel on a big batch; the multi-kernel version actually sees some speedup in a few cases.
Multi-kernel version now supports CAGRA-Q compression.
A few tests failing
Disabled target_link_options(cuvs PRIVATE "${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld") in CMakeLists.txt due to a linker error. Not sure how to resolve this at the moment.

…ble compute_distance_impl and being more explicit about the memory spaces (using lds/ldg)

…team_size = 4)

… team_size*vlen, and avoid bank conflicts in setup_workspace

… instruction count

achirkin · 2024-09-20T10:48:28Z

Update on performance: the worst-case slowdown now is around 6-7% on the deep-100M and wiki-all datasets.
Comparison against #324 (comment)
https://docs.google.com/spreadsheets/d/191f1sYsAUwPidncV4xDzgNtOhKx9sp6jJFvOfpB-nxs

tfeher

Thank you Artem for this PR! It is great what this PR achieves: finally the distance computation functions are decoupled from the search kernels and that reduces the number of times the search kernels are compiled. This significantly decreases of the binary size.

This comes at a price: for some parameter combinations cagra::search will become up to 6% slower. Thank you for investigating another solution in #324. Since that has similar impact on runtime, but results in less reduction in binary size, I am in favor of the current PR.

Fixing CAGRA's binary size will enable us to add more features that will improve the performance (persistent kernel, fp16). Therefore would recommend that we go ahead and merge the current PR.

There is still one aspect that I find unfortunate: the complexity of selecting and dispatching the cagra kernels (and distance functions) is significantly increased. The persistent kernel PR will add another set of complications on top of this. To compensate, we shall improve the developer documentation: I have left a few comments along this line.

Still, I suspect that we could do better, therefore please open an issue (as a follow-up of this an #215) to re-evaluate and simplify CAGRA kernel/distance selection and dispatch logic.

cpp/CMakeLists.txt

cpp/src/neighbors/detail/cagra/cagra_search.cuh

cpp/src/neighbors/detail/cagra/factory.cuh

cpp/src/neighbors/detail/cagra/device_common.hpp

cpp/src/neighbors/detail/cagra/compute_distance-ext.cuh

cpp/src/neighbors/detail/cagra/compute_distance_standard-impl.cuh

cpp/src/neighbors/detail/cagra/factory.cuh

cpp/src/neighbors/detail/cagra/device_common.hpp

cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh

…heaply computed from dim and PQ_LEN

…riptor types

tfeher

Thanks Artem for fixing the issues, the added comments are really useful! The PR looks good to me. Please create a follow up issue to improve naming of descriptors and simplify call hierarchy.

cpp/src/neighbors/detail/cagra/compute_distance.hpp

cpp/src/neighbors/detail/cagra/factory.cuh

cpp/src/neighbors/detail/cagra/compute_distance_standard-impl.cuh

cpp/src/neighbors/detail/cagra/compute_distance.hpp

cpp/CMakeLists.txt

cjnolet · 2024-09-24T19:27:29Z

cpp/CMakeLists.txt

@@ -463,7 +455,7 @@ if(NOT BUILD_CPU_ONLY)
  target_link_libraries(
    cuvs
    PUBLIC rmm::rmm raft::raft ${CUVS_CTK_MATH_DEPENDENCIES}
-    PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
+    PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> cuvs-cagra-search


Was building a new artifact really necessary to improve the perf / binary size? Or was this just done to make the build more modular?

Generally, separable compilation affects performance negatively, so I reduced the AOE by setting cmake CUDA_SEPARABLE_COMPILATION on the relevant files only - via this component.

cjnolet · 2024-09-25T16:06:06Z

/merge

[WIP] CAGRA - separable compilation for distance computation

0dbe5b2

github-actions bot added cpp CMake labels Aug 16, 2024

achirkin and others added 2 commits August 16, 2024 14:06

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

93b0439

Fix style

ba52b13

cjnolet assigned achirkin Aug 16, 2024

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 16, 2024

Add missing multi-kernel implementation

434e50a

achirkin and others added 13 commits August 19, 2024 15:08

Move common code out of virtual functions scope (aiming for more inli…

6352550

…ning)

Make small descriptor functions into fields

d161f79

Minor updates to improve reg count

35c3813

Refactor distance_core -> compute_distance, and update the instance list

4b5dcd3

Merge remote-tracking branch 'rapidsai/branch-24.10' into enh-cagra-s…

e5878db

…eparable-compilation

Make the compute_distance instances controlled from a single place

385a8c4

Refactor usage of init_kernel to make sure it instantiated in the sam…

3f77cda

…e place as compute_distance components

Reduce the register usage in distance functions

ddb0488

Partially implemented manual dispatch

c244ead

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

7eb6a27

Finish manual dispatch

ff2fdbe

Change instance generator to have blockdim/team_size ratio 16

78a9809

Trying various minor things to reduce register spilling

6082bf7

achirkin and others added 6 commits August 26, 2024 15:24

Move the metric parameter to the compute_distance template

fc7d832

Further reduce register pressure by moving code out of the non-inlina…

118808e

…ble compute_distance_impl and being more explicit about the memory spaces (using lds/ldg)

Manually unroll device::team_sum

abec125

Remove the test of a compute_distance instance that is not compiled (…

cf0101c

…team_size = 4)

Hide previously not hidden kernels

b3e6d26

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

f231828

achirkin and others added 5 commits September 18, 2024 16:38

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

27f6581

Don't apply swizzling when the bank conflicts are not possible (small…

b605061

… team_size*vlen, and avoid bank conflicts in setup_workspace

Minor improvements to multi-cta kernel

478a824

Transpose query buffer instead of swizzling in VPQ distance to reduce…

5090ebb

… instruction count

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

9f069af

tfeher requested changes Sep 22, 2024

View reviewed changes

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

6fac19b

achirkin force-pushed the enh-cagra-separable-compilation branch from 4322b2f to 6fac19b Compare September 23, 2024 11:15

achirkin added 6 commits September 23, 2024 13:25

VPQ distance: don't pass n_subspace as parameter, because it can be c…

d0eb9b3

…heaply computed from dim and PQ_LEN

Docs and readability: device_common.hpp and factory.cuh

7bce6da

Remove unused distance instances (with uint64_t index type)

5154892

compute_distance.hpp: document and slightly simplify the dataset desc…

a0c54e3

…riptor types

Document the dataset/distance descriptor selection logic

9ba3e3f

Remove commented-out code sections

f77c1b0

achirkin requested a review from tfeher September 23, 2024 15:58

tfeher approved these changes Sep 23, 2024

View reviewed changes

achirkin removed the request for review from a team September 24, 2024 07:04

This was referenced Sep 24, 2024

Simplify CAGRA search call hierarchy #343

Open

CosineExpanded Distance Metric for CAGRA #197

Draft

achirkin removed the request for review from jameslamb September 24, 2024 08:19

cjnolet reviewed Sep 24, 2024

View reviewed changes

cpp/CMakeLists.txt Outdated Show resolved Hide resolved

cjnolet reviewed Sep 24, 2024

View reviewed changes

Merge branch 'branch-24.10' into enh-cagra-separable-compilation

eabb3ae

achirkin requested review from cjnolet and removed request for cjnolet September 25, 2024 07:54

Remove empty comment

f1426cf

cjnolet approved these changes Sep 25, 2024

View reviewed changes

rapids-bot bot merged commit 0a4298a into rapidsai:branch-24.10 Sep 25, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAGRA - separable compilation for distance computation #296

CAGRA - separable compilation for distance computation #296

achirkin commented Aug 16, 2024 •

edited

Loading

achirkin commented Aug 19, 2024

achirkin commented Aug 26, 2024 •

edited

Loading

achirkin commented Sep 20, 2024

tfeher left a comment •

edited

Loading

tfeher left a comment

cjnolet Sep 24, 2024

achirkin Sep 24, 2024

cjnolet commented Sep 25, 2024

CAGRA - separable compilation for distance computation #296

CAGRA - separable compilation for distance computation #296

Conversation

achirkin commented Aug 16, 2024 • edited Loading

achirkin commented Aug 19, 2024

achirkin commented Aug 26, 2024 • edited Loading

achirkin commented Sep 20, 2024

tfeher left a comment • edited Loading

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

cjnolet Sep 24, 2024

Choose a reason for hiding this comment

achirkin Sep 24, 2024

Choose a reason for hiding this comment

cjnolet commented Sep 25, 2024

achirkin commented Aug 16, 2024 •

edited

Loading

achirkin commented Aug 26, 2024 •

edited

Loading

tfeher left a comment •

edited

Loading