Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce result collector for HNSW #3161

Closed
wants to merge 1 commit into from

Conversation

heemin32
Copy link

@heemin32 heemin32 commented Dec 7, 2023

Introduce result collector for HNSW

The result collector is to be used by caller to implement their own logic to collect result. One of example is deduplicating result based on group id. #3087

@heemin32 heemin32 marked this pull request as ready for review December 8, 2023 18:33
@mdouze
Copy link
Contributor

mdouze commented Dec 12, 2023

Thanks for the PR.

It would indeed be useful to have a ResultCollector object to support knn / range search transparently, and possibly deduplication.
The result collector would be a class of the form:

struct ResultCollector {
    float threshold; // result will be added if distance < threshold 
    virtual bool add_result(idx_t id, float dis) = 0; // returns whether the result was added and the threshold was updated.
};

It can be provided to the Index's search function for a given query (like HNSW::search_from_candidates).

The advantage of having an abstract class for ResultCollector rather than a template is that it can be provided from an external caller. Besides, since the comparison with the threshold (with C::cmp) is performed prior to calling the virtual function, there is no possibly slow virtual function call for all computed distances, only for the ones that pass the comparison.

There are already ResultCollector objects in IndexIVFPQ and IndexIVFPQFastScan.

The ResultCollector in this PR is of too limited scope to be useful, since it does not enable the range search support and calling externally.

@heemin32
Copy link
Author

heemin32 commented Dec 12, 2023

Thanks for the comment. I get the point that it does not enable range search. However, what do you mean by saying "calling externally"? Are you saying that we should compare the distance before calling the virtual function to avoid possible performance degradation?

Also, there are already two methods in Index class: search and range_search. With your proposal of supporting ResultCollector for both knn and range search, a new method covering both knn and range search should be introduced. I assume you are okay with that?

@mdouze
Copy link
Contributor

mdouze commented Dec 13, 2023

I don't think the resultcollector could replace search and range_search functions because these take batches of queries while the resultcollector is for lower level calls and not thread-safe.
It would remain an internal function of the indexes.

@heemin32
Copy link
Author

  1. If ResultCollector is not passed through top level method like search and range_search, it cannot be overridden by another collector.
  2. Because it is not thread-safe, we need factory or something else to create ResultCollector per thread anyway

Another option is passing ResultCollector factory through HNSWSearchParameter as this PR and make its interface more generic so that both knn search and range search can be supported transparently.

@mdouze
Copy link
Contributor

mdouze commented Dec 14, 2023

Please hold on implementing anything. We are currently discussing how to refactor the code with more ResultHandlers.

@heemin32
Copy link
Author

Good to know that discussion is going on regarding ResultHandler. Hope that new plan also include the deduplication use case.

@heemin32 heemin32 closed this Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants