-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAGRA - separable compilation for distance computation #296
CAGRA - separable compilation for distance computation #296
Conversation
…eparable-compilation
…e place as compute_distance components
Current WIP status:
|
…ble compute_distance_impl and being more explicit about the memory spaces (using lds/ldg)
… team_size*vlen, and avoid bank conflicts in setup_workspace
… instruction count
Update on performance: the worst-case slowdown now is around 6-7% on the deep-100M and wiki-all datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Artem for this PR! It is great what this PR achieves: finally the distance computation functions are decoupled from the search kernels and that reduces the number of times the search kernels are compiled. This significantly decreases of the binary size.
This comes at a price: for some parameter combinations cagra::search will become up to 6% slower. Thank you for investigating another solution in #324. Since that has similar impact on runtime, but results in less reduction in binary size, I am in favor of the current PR.
Fixing CAGRA's binary size will enable us to add more features that will improve the performance (persistent kernel, fp16). Therefore would recommend that we go ahead and merge the current PR.
There is still one aspect that I find unfortunate: the complexity of selecting and dispatching the cagra kernels (and distance functions) is significantly increased. The persistent kernel PR will add another set of complications on top of this. To compensate, we shall improve the developer documentation: I have left a few comments along this line.
Still, I suspect that we could do better, therefore please open an issue (as a follow-up of this an #215) to re-evaluate and simplify CAGRA kernel/distance selection and dispatch logic.
4322b2f
to
6fac19b
Compare
…heaply computed from dim and PQ_LEN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for fixing the issues, the added comments are really useful! The PR looks good to me. Please create a follow up issue to improve naming of descriptors and simplify call hierarchy.
@@ -463,7 +455,7 @@ if(NOT BUILD_CPU_ONLY) | |||
target_link_libraries( | |||
cuvs | |||
PUBLIC rmm::rmm raft::raft ${CUVS_CTK_MATH_DEPENDENCIES} | |||
PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> | |||
PRIVATE nvidia::cutlass::cutlass $<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX> cuvs-cagra-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was building a new artifact really necessary to improve the perf / binary size? Or was this just done to make the build more modular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, separable compilation affects performance negatively, so I reduced the AOE by setting cmake CUDA_SEPARABLE_COMPILATION on the relevant files only - via this component.
/merge |
Factor the
compute_distance
function and related template parameters out of the CAGRA search kernels.This reduces the total number of kernel instances, thus reducing the binary size and the compile time.
The change, however, has a few drawbacks:
compute_distance
functions being compiled in separate object files. I introduced a static library component for the affected sources to minimize the impact of the change.compute_distance
function means the compiler cannot optimize across the kernel-compute_distance
boundary, which results in higher register usage and occasional register spilling. Most of the cases are optimized in this PR, but some compromises seem unavoidable.xxx_init_kernel
) to get the function pointer, which adds extra latency. This is mitigated to some extent by caching the constructed descriptor using raft custom resource.