Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]Vector deduplication #3087

Open
heemin32 opened this issue Oct 10, 2023 · 11 comments
Open

[Feature Request]Vector deduplication #3087

heemin32 opened this issue Oct 10, 2023 · 11 comments

Comments

@heemin32
Copy link

heemin32 commented Oct 10, 2023

Summary

I would like to see vector deduplication support in Faiss.

There will be a set of vectors which are under a same group. During KNN search, I would like to get k nearest vectors for each group but not more than one from the same group.

For example, let's say a user have a set of large documents and it wants to search the document using vector representation of the document. Because the document is big, user might need to split the original document(parent document) into smaller documents(child document) and generate vector for each smaller document using ml model. The vector data of child document will be indexed and searched using faiss library.

The problem is that, because search is happening in the vector date of child documents, it does not guarantee to return k vector of child documents with distinct parent document. Ideally, it should dedupe the search result per parent document and return k top vector of child document of which parent document is all distinct.

We could extend https://github.com/facebookresearch/faiss/blob/main/faiss/impl/IDSelector.h or add new IDDeduper to deduplicate the results.

@mdouze
Copy link
Contributor

mdouze commented Oct 11, 2023

That's an interesting database functionality.
I don't see immediately how this could be supported in Faiss, but let's keep it in mind.

@heemin32 heemin32 changed the title [Feature Request]Parent join support [Feature Request]Vector deduplication support Nov 6, 2023
@heemin32 heemin32 changed the title [Feature Request]Vector deduplication support [Feature Request]Vector deduplication Nov 6, 2023
@navneet1v
Copy link

navneet1v commented Nov 7, 2023

@mdouze I see @heemin32 has updated the proposal, I think this will be a good feature to go in side by side of IDSelector Interface or even extending that interface.

This will provide the a better way to get top K similar results, if vectors are related in some fashion. Please provide your thoughts.

@heemin32
Copy link
Author

heemin32 commented Nov 15, 2023

@mdouze Any further thought on this? As @navneet1v mentioned, we can add something like IDDeduper(or IDGroup) and pass it through search parameter. During the result collection, we can dedupe the ID and only store the nearest ID/Distance.

@mdouze
Copy link
Contributor

mdouze commented Nov 20, 2023

AFAICS this is of broad enough interest to put into main Faiss.
You have to realize that adding a new result filtering criterion means touching ~20 index implementations, at least to raise an error signaling it is not supported.
Workarounds are:

  • over-fetch and dedup afterwards
  • implement search externally.

@heemin32
Copy link
Author

I think we can support the feature incrementally starting hnsw index. For all other index, we can throw an error message like what we do for search parameter. FAISS_THROW_IF_NOT_MSG(!params, "search params not supported for this index");

@navneet1v
Copy link

@mdouze
The workarounds provided won't work well.

Thanks for pointing this, I agree that we might need to touch more than ~20 index implementation, but as @heemin32 pointed out we will start with HNSW and for all other we will make sure that we are sending errors. Apart from this will there be anything that we are missing?

@mdouze
Copy link
Contributor

mdouze commented Nov 20, 2023

Sorry for my message above, I meant this is not of broad enough interest to put in Faiss.

@navneet1v
Copy link

navneet1v commented Nov 20, 2023

@mdouze is there any faiss mailing list on which we can send out this feature and see if there are users interested in this feature?

Reason why I saying this is because the use case mentioned in the description is very popular use case in today's world with RAG/Semantic Search use cases. Given that embeddings models cannot create embedding for large documents in 1 call, so we need to create M embeddings for 1 document by splitting the larger document in smaller chunks. Similarly while doing search we want to make sure that chunks of same documents are not returned back as K results.

Please let us know your thoughts.

@heemin32
Copy link
Author

heemin32 commented Nov 22, 2023

@mdouze I raised a draft PR for vector deduplication in HNSW. #3140
I am going to continue to work to make the PR to meet the bar to be merged in the main branch. Could you take a look and provide a feedback? Any objection on continuing the work?

@heemin32
Copy link
Author

heemin32 commented Dec 7, 2023

@mdouze I added a new PR which introduce a result collector so that caller can implement deduplication logic outside of faiss repo. Could you take a look and provide a comment? #3161

@heemin32
Copy link
Author

heemin32 commented Jan 25, 2024

After the refactoring around ResultHandler, #3190, I came up with new plan to support grouping of result.

  1. Create IDGrouper similar to IDSelector. Add two grouper to start with: Bitmap, SparseBitmap
  2. Add IDGrouper to SearchParameter
  3. Create GroupedHeapBlockResultHandler similar to HeapBlockResultHandler. This class will have two additional data: an array of group id corresponding to an array of ids, a map from group id to index in an array of ids
  4. Implement a method to handle heap with an array of group id to be used inside GroupedHeapBlockResultHandler
  5. Modify search method to switch result handler if search parameter has IDGrouper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants