Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support vector deletion in ANN IVF #1831

Merged
merged 12 commits into from
Nov 6, 2023

Conversation

lowener
Copy link
Contributor

@lowener lowener commented Sep 18, 2023

PR based on the new Bitset feature (#1803) to support vector deletion in ANN.
Closes #1177.
Closes #1620.
This PR adds ivf_to_sample_filter that acts as an intermediate filter to use an IVF index with a bitset filter.

@cjnolet
Copy link
Member

cjnolet commented Sep 19, 2023

@lowener it looks like these changes are making great progress and should be wrapped up soon. I think it's time to start proactively gathering benchmarks for the CAGRA deletion all the way through. Please focus exclusively on CAGRA for 23.10 since that's what we've promised. We can add deletion to the others in 23.12 (or 23.10 if there's time after CAGRA).

Benchmarks will be super important. I would make sure to benchmark various different cases. Maybe stepping up in powers of 2 would be helpful? (2%, 4%, 8% 16% 32% 64%).

@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 26, 2023
@lowener lowener changed the title [FEA] Support vector deletion in ANN [FEA] Support vector deletion in ANN IVF Sep 27, 2023
@lowener lowener force-pushed the 23.10-ann-deletion branch from 9661814 to 6c6008e Compare October 15, 2023 22:51
@lowener lowener changed the base branch from branch-23.10 to branch-23.12 October 15, 2023 22:51
@github-actions github-actions bot added the CMake label Oct 17, 2023
@lowener
Copy link
Contributor Author

lowener commented Oct 17, 2023

Benchmark result on my local machine.
10M rows, 128 dim, 1k queries, 255k.
Deletion ratio of 0%, 2%, 4%, 8%, 16%, 32%, 64%

IVF-Flat:

KNN/float/int64_t/ivf_flat_filter_knn/0/0/0/manual_time       4860 ms         5640 ms            1 10000000#128#1000#255#0#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_flat_filter_knn/1/0/0/manual_time       4920 ms         5699 ms            1 10000000#128#1000#255#0.02#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_flat_filter_knn/2/0/0/manual_time       4941 ms         5720 ms            1 10000000#128#1000#255#0.04#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_flat_filter_knn/3/0/0/manual_time       4973 ms         5752 ms            1 10000000#128#1000#255#0.08#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_flat_filter_knn/4/0/0/manual_time       4979 ms         5758 ms            1 10000000#128#1000#255#0.16#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_flat_filter_knn/5/0/0/manual_time       4985 ms         5764 ms            1 10000000#128#1000#255#0.32#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_flat_filter_knn/6/0/0/manual_time       4968 ms         5747 ms            1 10000000#128#1000#255#0.64#NO_COPY#BUILD_SEARCH

IVF-PQ:

KNN/float/int64_t/ivf_pq_filter_knn/0/0/0/manual_time        20101 ms        20880 ms            1 10000000#128#1000#255#0#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_pq_filter_knn/1/0/0/manual_time        20292 ms        21071 ms            1 10000000#128#1000#255#0.02#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_pq_filter_knn/2/0/0/manual_time        20215 ms        20993 ms            1 10000000#128#1000#255#0.04#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_pq_filter_knn/3/0/0/manual_time        20336 ms        21113 ms            1 10000000#128#1000#255#0.08#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_pq_filter_knn/4/0/0/manual_time        20371 ms        21149 ms            1 10000000#128#1000#255#0.16#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_pq_filter_knn/5/0/0/manual_time        20331 ms        21110 ms            1 10000000#128#1000#255#0.32#NO_COPY#BUILD_SEARCH
KNN/float/int64_t/ivf_pq_filter_knn/6/0/0/manual_time        20383 ms        21161 ms            1 10000000#128#1000#255#0.64#NO_COPY#BUILD_SEARCH

@lowener lowener marked this pull request as ready for review October 17, 2023 23:10
@lowener lowener requested review from a team as code owners October 17, 2023 23:10
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question in hopes of making the filtering API consistent across index types, but otherwise this looks very good. The minimal overhead in the benchmark results is very impressive.

* @tparam filter_t
*/
template <typename index_t, typename filter_t>
struct ivf_to_sample_filter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to provide a unified filtering interface, would it make sense to wrap any filter that is passed to search_with_filtering for IVF methods in this struct? I have trouble imagining a scenario in which the outer search would really want to make use of a filter that takes into account the in-cluster ID rather than the overall ID, and that would establish a rule that the filter's signature is just (query_index, sample_index) -> bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexanderguzhva is the original creator of this filtering for IVF indexes. Do you have any examples or use-cases of filtering applied to a cluster/ids in the cluster?
If we can't find a use-case then I think it make sense to wrap every filter with ivf_to_sample_filter for a consistent filtering API.

Copy link
Contributor

@alexanderguzhva alexanderguzhva Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lowener Yes. https://github.com/rapidsai/raft/blob/branch-23.08/cpp/include/raft/neighbors/sample_filter_types.hpp contains commented out samples, which resemble the way it was originally used in Meta. If I remember correctly, there was some additional semantic information, associated with every sample in every cluster, and a cluster id was necessary to be known.
Technically, it is trivial to write a template wrapper that accepts two types of delegates, both (query_index, sample_index) -> bool and (query_index, cluster_index, sample_index) -> bool and put one inside ivfflat / ivfpq code. This is how I would do it.

@wphicks
Copy link
Contributor

wphicks commented Nov 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit 28eb0b3 into rapidsai:branch-23.12 Nov 6, 2023
57 checks passed
@lowener lowener deleted the 23.10-ann-deletion branch November 6, 2023 21:38
benfred pushed a commit to benfred/raft that referenced this pull request Nov 8, 2023
PR based on the new Bitset feature (rapidsai#1803) to support vector deletion in ANN.
Closes rapidsai#1177. 
Closes rapidsai#1620.
This PR adds `ivf_to_sample_filter` that acts as an intermediate filter to use an IVF index with a bitset filter.

Authors:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - William Hicks (https://github.com/wphicks)

URL: rapidsai#1831
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

IVF-Flat sample filter is not applied in all required places [FEA] Remove vectors from IVF Index
4 participants