[Feature Request]Vector deduplication #3087

heemin32 · 2023-10-10T18:21:19Z

Summary

I would like to see vector deduplication support in Faiss.

There will be a set of vectors which are under a same group. During KNN search, I would like to get k nearest vectors for each group but not more than one from the same group.

For example, let's say a user have a set of large documents and it wants to search the document using vector representation of the document. Because the document is big, user might need to split the original document(parent document) into smaller documents(child document) and generate vector for each smaller document using ml model. The vector data of child document will be indexed and searched using faiss library.

The problem is that, because search is happening in the vector date of child documents, it does not guarantee to return k vector of child documents with distinct parent document. Ideally, it should dedupe the search result per parent document and return k top vector of child document of which parent document is all distinct.

We could extend https://github.com/facebookresearch/faiss/blob/main/faiss/impl/IDSelector.h or add new IDDeduper to deduplicate the results.

mdouze · 2023-10-11T15:06:07Z

That's an interesting database functionality.
I don't see immediately how this could be supported in Faiss, but let's keep it in mind.

navneet1v · 2023-11-07T00:40:57Z

@mdouze I see @heemin32 has updated the proposal, I think this will be a good feature to go in side by side of IDSelector Interface or even extending that interface.

This will provide the a better way to get top K similar results, if vectors are related in some fashion. Please provide your thoughts.

heemin32 · 2023-11-15T21:58:02Z

@mdouze Any further thought on this? As @navneet1v mentioned, we can add something like IDDeduper(or IDGroup) and pass it through search parameter. During the result collection, we can dedupe the ID and only store the nearest ID/Distance.

mdouze · 2023-11-20T07:53:43Z

AFAICS this is of broad enough interest to put into main Faiss.
You have to realize that adding a new result filtering criterion means touching ~20 index implementations, at least to raise an error signaling it is not supported.
Workarounds are:

over-fetch and dedup afterwards
implement search externally.

heemin32 · 2023-11-20T17:35:48Z

I think we can support the feature incrementally starting hnsw index. For all other index, we can throw an error message like what we do for search parameter. FAISS_THROW_IF_NOT_MSG(!params, "search params not supported for this index");

navneet1v · 2023-11-20T19:37:18Z

@mdouze
The workarounds provided won't work well.

Thanks for pointing this, I agree that we might need to touch more than ~20 index implementation, but as @heemin32 pointed out we will start with HNSW and for all other we will make sure that we are sending errors. Apart from this will there be anything that we are missing?

mdouze · 2023-11-20T23:03:55Z

Sorry for my message above, I meant this is not of broad enough interest to put in Faiss.

navneet1v · 2023-11-20T23:29:27Z

@mdouze is there any faiss mailing list on which we can send out this feature and see if there are users interested in this feature?

Reason why I saying this is because the use case mentioned in the description is very popular use case in today's world with RAG/Semantic Search use cases. Given that embeddings models cannot create embedding for large documents in 1 call, so we need to create M embeddings for 1 document by splitting the larger document in smaller chunks. Similarly while doing search we want to make sure that chunks of same documents are not returned back as K results.

Please let us know your thoughts.

heemin32 · 2023-11-22T20:27:09Z

@mdouze I raised a draft PR for vector deduplication in HNSW. #3140
I am going to continue to work to make the PR to meet the bar to be merged in the main branch. Could you take a look and provide a feedback? Any objection on continuing the work?

heemin32 · 2023-12-07T00:38:06Z

@mdouze I added a new PR which introduce a result collector so that caller can implement deduplication logic outside of faiss repo. Could you take a look and provide a comment? #3161

heemin32 · 2024-01-25T23:03:21Z

After the refactoring around ResultHandler, #3190, I came up with new plan to support grouping of result.

Create IDGrouper similar to IDSelector. Add two grouper to start with: Bitmap, SparseBitmap
Add IDGrouper to SearchParameter
Create GroupedHeapBlockResultHandler similar to HeapBlockResultHandler. This class will have two additional data: an array of group id corresponding to an array of ids, a map from group id to index in an array of ids
Implement a method to handle heap with an array of group id to be used inside GroupedHeapBlockResultHandler
Modify search method to switch result handler if search parameter has IDGrouper

mdouze added the question label Oct 11, 2023

heemin32 changed the title ~~[Feature Request]Parent join support~~ [Feature Request]Vector deduplication support Nov 6, 2023

heemin32 changed the title ~~[Feature Request]Vector deduplication support~~ [Feature Request]Vector deduplication Nov 6, 2023

mdouze mentioned this issue Nov 23, 2023

Add vector deduplication for HNSW #3140

Closed

This was referenced Nov 30, 2023

Support for ANN Indexing in Nested k-NN Vectors in OpenSearch opensearch-project/k-NN#1333

Closed

Introduce result collector for HNSW #3161

Closed

heemin32 mentioned this issue Feb 7, 2024

Support grouping of vector data in HNSW #3227

Open

asadoughi added feature request Implementation Integration and removed question labels Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]Vector deduplication #3087

[Feature Request]Vector deduplication #3087

heemin32 commented Oct 10, 2023 •

edited

Loading

mdouze commented Oct 11, 2023

navneet1v commented Nov 7, 2023 •

edited

Loading

heemin32 commented Nov 15, 2023 •

edited

Loading

mdouze commented Nov 20, 2023

heemin32 commented Nov 20, 2023

navneet1v commented Nov 20, 2023

mdouze commented Nov 20, 2023

navneet1v commented Nov 20, 2023 •

edited

Loading

heemin32 commented Nov 22, 2023 •

edited

Loading

heemin32 commented Dec 7, 2023

heemin32 commented Jan 25, 2024 •

edited

Loading

[Feature Request]Vector deduplication #3087

[Feature Request]Vector deduplication #3087

Comments

heemin32 commented Oct 10, 2023 • edited Loading

Summary

mdouze commented Oct 11, 2023

navneet1v commented Nov 7, 2023 • edited Loading

heemin32 commented Nov 15, 2023 • edited Loading

mdouze commented Nov 20, 2023

heemin32 commented Nov 20, 2023

navneet1v commented Nov 20, 2023

mdouze commented Nov 20, 2023

navneet1v commented Nov 20, 2023 • edited Loading

heemin32 commented Nov 22, 2023 • edited Loading

heemin32 commented Dec 7, 2023

heemin32 commented Jan 25, 2024 • edited Loading

heemin32 commented Oct 10, 2023 •

edited

Loading

navneet1v commented Nov 7, 2023 •

edited

Loading

heemin32 commented Nov 15, 2023 •

edited

Loading

navneet1v commented Nov 20, 2023 •

edited

Loading

heemin32 commented Nov 22, 2023 •

edited

Loading

heemin32 commented Jan 25, 2024 •

edited

Loading