-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]Vector deduplication #3087
Comments
That's an interesting database functionality. |
@mdouze I see @heemin32 has updated the proposal, I think this will be a good feature to go in side by side of IDSelector Interface or even extending that interface. This will provide the a better way to get top K similar results, if vectors are related in some fashion. Please provide your thoughts. |
@mdouze Any further thought on this? As @navneet1v mentioned, we can add something like IDDeduper(or IDGroup) and pass it through search parameter. During the result collection, we can dedupe the ID and only store the nearest ID/Distance. |
AFAICS this is of broad enough interest to put into main Faiss.
|
I think we can support the feature incrementally starting hnsw index. For all other index, we can throw an error message like what we do for search parameter. |
@mdouze Thanks for pointing this, I agree that we might need to touch more than ~20 index implementation, but as @heemin32 pointed out we will start with HNSW and for all other we will make sure that we are sending errors. Apart from this will there be anything that we are missing? |
Sorry for my message above, I meant this is not of broad enough interest to put in Faiss. |
@mdouze is there any faiss mailing list on which we can send out this feature and see if there are users interested in this feature? Reason why I saying this is because the use case mentioned in the description is very popular use case in today's world with RAG/Semantic Search use cases. Given that embeddings models cannot create embedding for large documents in 1 call, so we need to create M embeddings for 1 document by splitting the larger document in smaller chunks. Similarly while doing search we want to make sure that chunks of same documents are not returned back as Please let us know your thoughts. |
After the refactoring around ResultHandler, #3190, I came up with new plan to support grouping of result.
|
Summary
I would like to see vector deduplication support in Faiss.
There will be a set of vectors which are under a same group. During KNN search, I would like to get k nearest vectors for each group but not more than one from the same group.
For example, let's say a user have a set of large documents and it wants to search the document using vector representation of the document. Because the document is big, user might need to split the original document(parent document) into smaller documents(child document) and generate vector for each smaller document using ml model. The vector data of child document will be indexed and searched using faiss library.
The problem is that, because search is happening in the vector date of child documents, it does not guarantee to return k vector of child documents with distinct parent document. Ideally, it should dedupe the search result per parent document and return k top vector of child document of which parent document is all distinct.
We could extend https://github.com/facebookresearch/faiss/blob/main/faiss/impl/IDSelector.h or add new IDDeduper to deduplicate the results.
The text was updated successfully, but these errors were encountered: