Separate CAGRA index type from internal idx type #1664

tfeher · 2023-07-23T22:17:22Z

This PR decouples CAGRA IdxT from the mdspan IndexType used in CAGRA API, by making the following change
mdspan<Type, IdxT> -> mdspan<Type, int64_t>.

In cagra::index<T, IdxT> the index type IdxT is the type that represents the neighbor indices returned during the ANN search, and the same type is also used as to store the neighborhood graph.

The build and search method take mdspan objects as input/output argument. These mdspans have an IndexType that is used during offset calculation when indexing the matrix. Prior to this PR, we used IdxT as the mdspan index type.

There is a strong motivation to keep IdxT as uint32_t in order to minimize the memory footprint of the KNN graph. At the same time, the size and offset calculations of mdspans that represent dataset and knn graph require 64-bit indexing for large datasets. This PR changes the mdspan extent IndexType to int64_t.

This PR does not affect the performance of CAGRA: the search kernels already used correct index type. Only a few operations are affected by this change, and these are outside perf sensitive regions.

tfeher · 2023-07-23T22:31:45Z

This change is partially consistent with the IVF-Flat API, that also uses int64_t as mdspan index type (although there IdxT is still used as mdspan extent IndexType type).

I recommend that we update IVF-PQ user facing API to use the same int64_t as mdspan IndexType. @achirkin.

achirkin · 2023-07-24T06:26:36Z

Oh, I actually wanted to change IVF-Flat api to match IVF-PQ for quite some time. The logic of the API in IVF-PQ is simply to use the extent types (IndexType) that reflect the allowed input sizes to avoid extra dimension checks at runtime:

build / extend types support input sizes that match IdxT, hence IndexType == IdxT there; the only limit is the number of possible values of IdxT.
search does not support batches larger than uint32_t internally, hence IndexType == uint32_t. Note also, the integer inputs k and n_queries in the raw pointer API also have the same uint32_t (in any case they should be the same as extent types of queries, neighbors, and distances mdspan types).

I strongly suggest to follow the same logic in IVF-Flat. In fact, it almost follows the same logic, except with bugs. (1) it does not support larger-than-uint32_t batch sizes as well, (2) its n_queries and k arguments are uint32_t. The outer API layer simply casts the IdxT extents to uint32_t, which is very wrong.

I'm not sure if that's the case with CAGRA though, as I am not so familiar with this code. However, I believe the logic I outline above makes most sense independent of implementation: the extent types guide the user, telling them what range of inputs is supported without unexpected failures at runtime (when e.g. the input size suddenly goes beyond the magical 4B after months of use). There's also not much sense in supporting search batch sizes beyond uint32_t.

tfeher · 2023-07-24T11:37:17Z

Thanks @achirkin for the detailed comment!

However, I believe the logic I outline above makes most sense independent of implementation: the extent types guide the user, telling them what range of inputs is supported without unexpected failures at runtime (when e.g. the input size suddenly goes beyond the magical 4B after months of use).

It is logical what you write, but I am not convinced that this is the best way to communicate the restrictions to both our users and developers.

search does not support batches larger than uint32_t internally, hence IndexType == uint32_t.

Theoretically we still support a queries array with n_rows * n_cols > 4B, right? Representing such input with matrix_view<T, uint32_t> problematic:

if we would call queries.size() then it would lead to integer overflow,
if we would use the indexing operator of mdspan (which CAGRA does), that would also lead to integer overflow.

The same issue is present with ivf_pq::build: if someone would instantiate IVF-PQ build with IdxT = uint32_t, then we have the same problem that I have described above (although libraft.so only instantiates that API with int64_t index type).

The points above tell me that it for IVF-PQ, we shall fix mdspan::index_type as int64_t. The current status could lead to potential bugs further down the road. One reason we get away with this in IVF-PQ, is that its implementation was written before we embraced mdspan, and therefore it only uses mdspan as a struct to store the data pointer and shape values.

The logic of the API in IVF-PQ is simply to use the extent types (IndexType) that reflect the allowed input sizes to avoid extra dimension checks at runtime

Adding few runtime checks is a small price to pay for a less error prone implementation.

achirkin · 2023-07-24T11:56:36Z

I believe encoding restrictions in types is always the best way, because it guarantees correctness and saves everybody from tracking tons of assertions across the codebase and duplicating (and forgetting to do so) these assertions in the docs.

The issue with having overflows in while using mdspan::size or indexing is easily avoided by transforming the extent type from 32-bit to 64-bit on the implementation side. Having uint32_t on the API side leaves you, as a developer, the choice of either casting the extents to a wider type for safety or doing the raw pointer arithmetic manually to maximize performance in a critical code path.

Therefore I think the runtime checks are neither a small price nor the less error-prone option.

tfeher · 2023-07-24T12:36:37Z

The issue with having overflows in while using mdspan::size or indexing is easily avoided by transforming the extent type from 32-bit to 64-bit on the implementation side.

While I agree with the above statement, I would also add that this is not only a raft implementation detail:

Having uint32_t on the API side leaves our users with structs that are in inconsistent state: they cannot call mdspan::size (or rather required_span_size) on these objects any more.

I do not agree that having uint32_t index_type in the API is less error prone. We are effectively forcing a wrapper structure on our users that promises other features (indexing op, size queries) that would break when they try to use it on their side.

Having uint32_t on the API side leaves you, as a developer, the choice of either casting the extents to a wider type for safety or doing the raw pointer arithmetic manually to maximize performance in a critical code path.

But that would also need to be documented in developer docs and diligently followed by raft developers. This will definitely lead to errors further down the road.

Note that we still keep uint32_t everywhere it is applicable in the critical code path.

achirkin · 2023-07-24T14:28:08Z

Ok then if we don't want to encode the possible range of inputs in the type of extents, I agree, we should just replace extents type (mdspan IndexType) with int64_t and never use templates for it in the public API (decouple it completely from the data type IdxT that we use to represent indices/keys). This includes both IVF-Flat and IVF-PQ as neither of them fix the extent types to int64_t in any of the public api functions (I'm not talking about instantiations).

cjnolet

LGTM.

cjnolet · 2023-07-27T14:48:05Z

/merge

First verstion of a CAGRA API in pylibraft. Todos: - [x] C++ raft_runtime instantiations and void overloads - [x] Cython API - [x] Solve issue of `cagra_types.hpp` including `#include <raft/util/pow2_utils.cuh>` that makes it need nvcc, blocking a clean C++ only cython build - [x] Check in pytests - [x] Add examples to docstrings - [x] Accommodate for parameter rename of #1676 - [x] Accomodate changes of #1664 - [x] Move out of experimental namespace Authors: - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

Separate cagra index type from internal idx type

d1b53b1

tfeher requested a review from a team as a code owner July 23, 2023 22:17

github-actions bot added the cpp label Jul 23, 2023

Merge branch 'branch-23.08' into cagra_matrix_idx

a617988

tfeher self-assigned this Jul 23, 2023

tfeher added breaking Breaking change bug Something isn't working improvement Improvement / enhancement to an existing function and removed improvement Improvement / enhancement to an existing function labels Jul 23, 2023

dantegd mentioned this pull request Jul 24, 2023

CAGRA Python wrappers #1665

Merged

8 tasks

tfeher added 2 commits July 25, 2023 14:32

Merge remote-tracking branch 'origin/branch-23.08' into cagra_matrix_idx

69effcd

Merge branch 'branch-23.08' into cagra_matrix_idx

6436d96

tfeher added the Vector Search label Jul 26, 2023

tfeher added 3 commits July 26, 2023 17:31

Merge branch 'branch-23.08' into cagra_matrix_idx

f43d8e9

Merge branch 'branch-23.08' into cagra_matrix_idx

a00202a

Merge branch 'branch-23.08' into cagra_matrix_idx

d7d40bb

cjnolet approved these changes Jul 27, 2023

View reviewed changes

rapids-bot bot merged commit a20f497 into rapidsai:branch-23.08 Jul 27, 2023

cjnolet mentioned this pull request Jul 27, 2023

[BUG] IVF-PQ search should use int64_t extent and index type #1684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate CAGRA index type from internal idx type #1664

Separate CAGRA index type from internal idx type #1664

tfeher commented Jul 23, 2023

tfeher commented Jul 23, 2023

achirkin commented Jul 24, 2023

tfeher commented Jul 24, 2023

achirkin commented Jul 24, 2023

tfeher commented Jul 24, 2023

achirkin commented Jul 24, 2023

cjnolet left a comment

cjnolet commented Jul 27, 2023

Separate CAGRA index type from internal idx type #1664

Separate CAGRA index type from internal idx type #1664

Conversation

tfeher commented Jul 23, 2023

tfeher commented Jul 23, 2023

achirkin commented Jul 24, 2023

tfeher commented Jul 24, 2023

achirkin commented Jul 24, 2023

tfeher commented Jul 24, 2023

achirkin commented Jul 24, 2023

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Jul 27, 2023