Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAGRA - separable compilation for distance computation #296

Merged

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Aug 16, 2024

Factor the compute_distance function and related template parameters out of the CAGRA search kernels.
This reduces the total number of kernel instances, thus reducing the binary size and the compile time.

The change, however, has a few drawbacks:

  • CUDA separable compilation needs to be enabled to allow compute_distance functions being compiled in separate object files. I introduced a static library component for the affected sources to minimize the impact of the change.
  • The separable compilation and dynamic dispatch of compute_distance function means the compiler cannot optimize across the kernel-compute_distance boundary, which results in higher register usage and occasional register spilling. Most of the cases are optimized in this PR, but some compromises seem unavoidable.
  • Dynamic dispatch (constructing a dataset descriptor) requires an extra kernel call (xxx_init_kernel) to get the function pointer, which adds extra latency. This is mitigated to some extent by caching the constructed descriptor using raft custom resource.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 16, 2024
@achirkin
Copy link
Contributor Author

Current WIP status, with full functionality restored after the refactoring:

  • CI build size 880 -> 663MB.
  • Slowdown is up to 2x
  • Bonus: multi-kernel version now naturally supports CAGRA-Q compression.

@achirkin
Copy link
Contributor Author

achirkin commented Aug 26, 2024

Current WIP status:

  • CI build size 788 -> 600MB.
  • Slowdown is up to 20% for standard distance and up to 30% for VPQ distance. The worst case is the single-cta kernel on a big batch; the multi-kernel version actually sees some speedup in a few cases.
  • Multi-kernel version now supports CAGRA-Q compression.
  • A few tests failing
  • Disabled target_link_options(cuvs PRIVATE "${CMAKE_CURRENT_BINARY_DIR}/fatbin.ld") in CMakeLists.txt due to a linker error. Not sure how to resolve this at the moment.

@achirkin
Copy link
Contributor Author

Update on performance: the worst-case slowdown now is around 6-7% on the deep-100M and wiki-all datasets.
Comparison against #324 (comment)
https://docs.google.com/spreadsheets/d/191f1sYsAUwPidncV4xDzgNtOhKx9sp6jJFvOfpB-nxs

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Artem for this PR! It is great what this PR achieves: finally the distance computation functions are decoupled from the search kernels and that reduces the number of times the search kernels are compiled. This significantly decreases of the binary size.

This comes at a price: for some parameter combinations cagra::search will become up to 6% slower. Thank you for investigating another solution in #324. Since that has similar impact on runtime, but results in less reduction in binary size, I am in favor of the current PR.

Fixing CAGRA's binary size will enable us to add more features that will improve the performance (persistent kernel, fp16). Therefore would recommend that we go ahead and merge the current PR.

There is still one aspect that I find unfortunate: the complexity of selecting and dispatching the cagra kernels (and distance functions) is significantly increased. The persistent kernel PR will add another set of complications on top of this. To compensate, we shall improve the developer documentation: I have left a few comments along this line.

Still, I suspect that we could do better, therefore please open an issue (as a follow-up of this an #215) to re-evaluate and simplify CAGRA kernel/distance selection and dispatch logic.

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/factory.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/factory.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/factory.cuh Show resolved Hide resolved
cpp/src/neighbors/detail/cagra/device_common.hpp Outdated Show resolved Hide resolved
@achirkin achirkin force-pushed the enh-cagra-separable-compilation branch from 4322b2f to 6fac19b Compare September 23, 2024 11:15
@achirkin achirkin requested a review from tfeher September 23, 2024 15:58
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for fixing the issues, the added comments are really useful! The PR looks good to me. Please create a follow up issue to improve naming of descriptors and simplify call hierarchy.

@achirkin achirkin removed the request for review from a team September 24, 2024 07:04
@achirkin achirkin removed the request for review from jameslamb September 24, 2024 08:19
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
@@ -463,7 +455,7 @@ if(NOT BUILD_CPU_ONLY)
target_link_libraries(
cuvs
PUBLIC rmm::rmm raft::raft ${CUVS_CTK_MATH_DEPENDENCIES}
PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> cuvs-cagra-search
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was building a new artifact really necessary to improve the perf / binary size? Or was this just done to make the build more modular?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, separable compilation affects performance negatively, so I reduced the AOE by setting cmake CUDA_SEPARABLE_COMPILATION on the relevant files only - via this component.

@achirkin achirkin requested review from cjnolet and removed request for cjnolet September 25, 2024 07:54
@cjnolet
Copy link
Member

cjnolet commented Sep 25, 2024

/merge

@rapids-bot rapids-bot bot merged commit 0a4298a into rapidsai:branch-24.10 Sep 25, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants