-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IVF-PQ: manipulating individual lists #1298
IVF-PQ: manipulating individual lists #1298
Conversation
template <typename T, typename IdxT> | ||
void reconstruct_list_data(raft::device_resources const& res, | ||
const index<IdxT>& index, | ||
device_matrix_view<T, uint32_t, row_major> out_vectors, | ||
uint32_t label, | ||
uint32_t n_skip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjnolet , could you please check if this interface is suitable for integration with FAISS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple important points that the FAISS developers have pointed out:
- We need to be sure that we don't assume indexes are drawn from a strictly monotonically increasing set.
- Following from 1, what's been asked for is a
reconstruct_batch
API which accepts a collection of ids (which are not necessarily contiguous) and reconstructs those vectors.
From what I can tell, the API you are proposing here doesn't implement the above, so I would say it's not suitable for the FAISS integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick clarification: the ids we should pass are (A) the same indices we pass during construction or (B) internal ids of the ivf-pq index?
If (B) then should these be a pair (label, collection of in-cluster-ids), or a collection of (label, in-cluster-id)? The former, perhaps can be slightly faster.
If (A) then I hardly can imagine an efficient function that can do this: we don't sort or organize the input indices in any way, so a full search in the database needs to be done for every index. Perhaps we can split the task in two: search indices -> then call interface (B)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also don't you need something like this interface to be able to convert our GPU index to their CPU index type (more-or-less efficiently, converting one cluster at a time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also don't you need something like this interface to be able to convert our GPU index to their CPU index type (more-or-less efficiently, converting one cluster at a time)?
There's a couple different asks here. We have an ask to be able to convert from one interleaved (and potentially non-interleaved) encoded format to another, which is required for being able to convert between cpu and gpu indices. That ask is for us to have functions available that can be used by their CodePacker API.
In addition to this, there's the ask to be able to reconstruct a batch of vectors given a set of indices, which might not be in order. This is the corresponding issue for that ask (which is directly from Jeff Johnson). The indices passed into the reconstruct_batch
would also coincide w/ the indices which would be specified when invoking extend
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That says to the world "We will not change these APIs, nor the way they function, without deprecation first".
Oh, now I see what you mean. I agree, we should export detail::pack_list_data
and detail::unpack_list_data
into a public namespace, to make sure we don't change their signature and break faiss integration by accident. I'm also not opposed to the idea of moving the construct/reconstruct and other similar functions to the ivf_pq::helpers
public namespace.
The only thing that bothers me, is that the overloads we need for the codepacker go really against the current design of the public api. Maybe, as an exception, we export these two into something like ivf_pq::codepacker
to discourage other users from using them? Or even implement the whole codepacker api in raft in that namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to discourage other users from using them?
I really don't think this is a good way to approach any public API. Either the API is public and users can use it, or it's not and they should touch that code at their own risk. Trying to force this middle ground for some users and not others is not sustainable and I'm not sure why there's a problem just exposing it outright.
I also don't agree that these functions should not just be part of the public API- if users need the ability to call a function, it needs to wrapped into the public API so it can be clearly documented, not just exported.
We had tried the exporting initially when we first started exposing formal public APIs and that was a mistake. Anything the user should be interacting directly with should be available and clearly documented (which for most almost all of the functions in RAFT just means wrapped).
I'm not opposed to using a nested "CodePacker" namespace but I'd also prefer that it is also nested inside helpers, so we be consistent there wrt the standard API namespaces and the "extra" functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a public API for pack_list_data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, for the codepacker, we'd probably need to expose this overload
raft/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh
Lines 575 to 580 in 926f510
inline void unpack_list_data( | |
device_matrix_view<uint8_t, uint32_t, row_major> codes, | |
device_mdspan<const uint8_t, list_spec<uint32_t, uint32_t>::list_extents, row_major> list_data, | |
std::variant<uint32_t, const uint32_t*> offset_or_indices, | |
uint32_t pq_bits, | |
rmm::cuda_stream_view stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have missed this message before. I have exposed the pack/unpack versions that take list_data directly under ivf_pq::helpers::codepacker
namespace.
In our implementation we still need to provide the pq_bits
parameter. Otherwise it is very similar to the CodePacker API.
One question: what does block
refer to in CodePacker? In our API it refers to a set of interleaved vectors, which can be the data for a whole cluster, or less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the PR! As far as I understand we are still waiting for additional helpers to enable looking up vectors using the original index. Until that is ready, I have looked at the implementation details. It looks good so far, below you will find my comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generally looks good, though I'll need to go back over the codepacker API just to make sure we'll be able to fill it in with these functions. Just a few other (mostly minor) API-ish things from my side.
Note, although the public api provides now a way to read raw flat PQ codes, it's not suitable for the CodePacker, because it's not raw pointers. However, one of the corresponding functions in detail namespace hopefully should work for this purpose (detail::pack_list_data and detail::unpack_list_data).
I think we should use raft::span
here to expose these in raft::neighbors::ivf_pq::helpers
, especially since we know the block
will be of size block_size
and the flat_codes
will be size nvec * code_size
.
I'm not sure, what do you mean by using |
@cjnolet Since Artem is on leave, I am addressing the open issues. Will push update later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the open issues.
template <typename T, typename IdxT> | ||
void reconstruct_list_data(raft::device_resources const& res, | ||
const index<IdxT>& index, | ||
device_matrix_view<T, uint32_t, row_major> out_vectors, | ||
uint32_t label, | ||
uint32_t n_skip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a public API for pack_list_data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Tamas!
/merge |
Add public functions for reading and writing into individual ivf-pq lists (clusters), in the input space (reconstructed data) and in flat PQ codes. Partially solves (IVF-PQ) rapidsai#1205 Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#1298
Add public functions for reading and writing into individual ivf-pq lists (clusters), in the input space (reconstructed data) and in flat PQ codes.
Partially solves (IVF-PQ) #1205