Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate CAGRA index type from internal idx type #1664

Merged
merged 7 commits into from
Jul 27, 2023

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Jul 23, 2023

This PR decouples CAGRA IdxT from the mdspan IndexType used in CAGRA API, by making the following change
mdspan<Type, IdxT> -> mdspan<Type, int64_t>.

In cagra::index<T, IdxT> the index type IdxT is the type that represents the neighbor indices returned during the ANN search, and the same type is also used as to store the neighborhood graph.

The build and search method take mdspan objects as input/output argument. These mdspans have an IndexType that is used during offset calculation when indexing the matrix. Prior to this PR, we used IdxT as the mdspan index type.

There is a strong motivation to keep IdxT as uint32_t in order to minimize the memory footprint of the KNN graph. At the same time, the size and offset calculations of mdspans that represent dataset and knn graph require 64-bit indexing for large datasets. This PR changes the mdspan extent IndexType to int64_t.

This PR does not affect the performance of CAGRA: the search kernels already used correct index type. Only a few operations are affected by this change, and these are outside perf sensitive regions.

@tfeher tfeher requested a review from a team as a code owner July 23, 2023 22:17
@github-actions github-actions bot added the cpp label Jul 23, 2023
@tfeher tfeher self-assigned this Jul 23, 2023
@tfeher tfeher added breaking Breaking change bug Something isn't working improvement Improvement / enhancement to an existing function and removed improvement Improvement / enhancement to an existing function labels Jul 23, 2023
@tfeher
Copy link
Contributor Author

tfeher commented Jul 23, 2023

This change is partially consistent with the IVF-Flat API, that also uses int64_t as mdspan index type (although there IdxT is still used as mdspan extent IndexType type).

I recommend that we update IVF-PQ user facing API to use the same int64_t as mdspan IndexType. @achirkin.

@achirkin
Copy link
Contributor

Oh, I actually wanted to change IVF-Flat api to match IVF-PQ for quite some time. The logic of the API in IVF-PQ is simply to use the extent types (IndexType) that reflect the allowed input sizes to avoid extra dimension checks at runtime:

  • build / extend types support input sizes that match IdxT, hence IndexType == IdxT there; the only limit is the number of possible values of IdxT.
  • search does not support batches larger than uint32_t internally, hence IndexType == uint32_t. Note also, the integer inputs k and n_queries in the raw pointer API also have the same uint32_t (in any case they should be the same as extent types of queries, neighbors, and distances mdspan types).

I strongly suggest to follow the same logic in IVF-Flat. In fact, it almost follows the same logic, except with bugs. (1) it does not support larger-than-uint32_t batch sizes as well, (2) its n_queries and k arguments are uint32_t. The outer API layer simply casts the IdxT extents to uint32_t, which is very wrong.

I'm not sure if that's the case with CAGRA though, as I am not so familiar with this code. However, I believe the logic I outline above makes most sense independent of implementation: the extent types guide the user, telling them what range of inputs is supported without unexpected failures at runtime (when e.g. the input size suddenly goes beyond the magical 4B after months of use). There's also not much sense in supporting search batch sizes beyond uint32_t.

@tfeher
Copy link
Contributor Author

tfeher commented Jul 24, 2023

Thanks @achirkin for the detailed comment!

However, I believe the logic I outline above makes most sense independent of implementation: the extent types guide the user, telling them what range of inputs is supported without unexpected failures at runtime (when e.g. the input size suddenly goes beyond the magical 4B after months of use).

It is logical what you write, but I am not convinced that this is the best way to communicate the restrictions to both our users and developers.

search does not support batches larger than uint32_t internally, hence IndexType == uint32_t.

Theoretically we still support a queries array with n_rows * n_cols > 4B, right? Representing such input with matrix_view<T, uint32_t> problematic:

  • if we would call queries.size() then it would lead to integer overflow,
  • if we would use the indexing operator of mdspan (which CAGRA does), that would also lead to integer overflow.

The same issue is present with ivf_pq::build: if someone would instantiate IVF-PQ build with IdxT = uint32_t, then we have the same problem that I have described above (although libraft.so only instantiates that API with int64_t index type).

The points above tell me that it for IVF-PQ, we shall fix mdspan::index_type as int64_t. The current status could lead to potential bugs further down the road. One reason we get away with this in IVF-PQ, is that its implementation was written before we embraced mdspan, and therefore it only uses mdspan as a struct to store the data pointer and shape values.

The logic of the API in IVF-PQ is simply to use the extent types (IndexType) that reflect the allowed input sizes to avoid extra dimension checks at runtime

Adding few runtime checks is a small price to pay for a less error prone implementation.

@achirkin
Copy link
Contributor

I believe encoding restrictions in types is always the best way, because it guarantees correctness and saves everybody from tracking tons of assertions across the codebase and duplicating (and forgetting to do so) these assertions in the docs.

The issue with having overflows in while using mdspan::size or indexing is easily avoided by transforming the extent type from 32-bit to 64-bit on the implementation side. Having uint32_t on the API side leaves you, as a developer, the choice of either casting the extents to a wider type for safety or doing the raw pointer arithmetic manually to maximize performance in a critical code path.

Therefore I think the runtime checks are neither a small price nor the less error-prone option.

@tfeher
Copy link
Contributor Author

tfeher commented Jul 24, 2023

The issue with having overflows in while using mdspan::size or indexing is easily avoided by transforming the extent type from 32-bit to 64-bit on the implementation side.

While I agree with the above statement, I would also add that this is not only a raft implementation detail:

Having uint32_t on the API side leaves our users with structs that are in inconsistent state: they cannot call mdspan::size (or rather required_span_size) on these objects any more.

I do not agree that having uint32_t index_type in the API is less error prone. We are effectively forcing a wrapper structure on our users that promises other features (indexing op, size queries) that would break when they try to use it on their side.

Having uint32_t on the API side leaves you, as a developer, the choice of either casting the extents to a wider type for safety or doing the raw pointer arithmetic manually to maximize performance in a critical code path.

But that would also need to be documented in developer docs and diligently followed by raft developers. This will definitely lead to errors further down the road.

Note that we still keep uint32_t everywhere it is applicable in the critical code path.

@achirkin
Copy link
Contributor

Ok then if we don't want to encode the possible range of inputs in the type of extents, I agree, we should just replace extents type (mdspan IndexType) with int64_t and never use templates for it in the public API (decouple it completely from the data type IdxT that we use to represent indices/keys). This includes both IVF-Flat and IVF-PQ as neither of them fix the extent types to int64_t in any of the public api functions (I'm not talking about instantiations).

@dantegd dantegd mentioned this pull request Jul 24, 2023
8 tasks
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cjnolet
Copy link
Member

cjnolet commented Jul 27, 2023

/merge

@rapids-bot rapids-bot bot merged commit a20f497 into rapidsai:branch-23.08 Jul 27, 2023
raydouglass pushed a commit that referenced this pull request Jul 31, 2023
First verstion of a CAGRA API in pylibraft. 

Todos:

- [x] C++ raft_runtime instantiations and void overloads
- [x] Cython API
- [x] Solve issue of `cagra_types.hpp` including `#include <raft/util/pow2_utils.cuh>` that makes it need nvcc, blocking a clean C++ only cython build
- [x] Check in pytests
- [x] Add examples to docstrings 
- [x] Accommodate for parameter rename of #1676
- [x] Accomodate changes of #1664 
- [x] Move out of experimental namespace

Authors:
   - Dante Gama Dessavre (https://github.com/dantegd)
   - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
   - Corey J. Nolet (https://github.com/cjnolet)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working cpp Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

3 participants