-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using multiple indices in FAISS seems broken #3327
Comments
Hi @einarbmag I was able to reproduce the bug that you described. Thank you for the detailed description. I think you are right about why the bug occurs, which already points to a way to fix it. 🙂 If you want, you would be more than welcome to try to create a pull request with a fix yourself. Otherwise, I'll discuss with the team how to address this issue in our next sprint. |
Hey guys, as you know, I like to ⛏️ dig deep into code history.
|
vector_id = Column(String(100), unique=True, nullable=True) |
Is this behavior still desired? I don't see why vector_id
with the same value cannot exist in different indexes. But I'm not aware of all the internals...
Anyway, this uniqueness generated bugs, as reported in #1954.
So, in #1961, in update_embeddings
, vector_id = self.faiss_indexes[index].ntotal
was replaced by line 347:
haystack/haystack/document_stores/faiss.py
Lines 331 to 347 in db6e575
if update_existing_embeddings is True: | |
if filters is None: | |
self.faiss_indexes[index].reset() | |
self.reset_vector_ids(index) | |
else: | |
raise Exception("update_existing_embeddings=True is not supported with filters.") | |
if not self.faiss_indexes.get(index): | |
raise ValueError("Couldn't find a FAISS index. Try to init the FAISSDocumentStore() again ...") | |
document_count = self.get_document_count(index=index) | |
if document_count == 0: | |
logger.warning("Calling DocumentStore.update_embeddings() on an empty index") | |
return | |
logger.info("Updating embeddings for %s docs...", document_count) | |
vector_id = sum(index.ntotal for index in self.faiss_indexes.values()) |
Current bug
It is caused by vector_id = sum(index.ntotal for index in self.faiss_indexes.values())
.
In fact, in update_embeddings
, index2
is reset, but vector_id
starts from 3, making the retrieval ineffective.
Possible solutions
If you agree that vector_id
may not be unique, a quite easy solution is to revert #1961.
In any case, I am eager to hear your opinion and I am ready for the challenge of opening a PR 😄
@ZanSara any thoughts?
Hey @anakin87! 😊 Thank you once more for this research! I believe your analysis is correct and I don't see why we should have such mismatch between the So:
|
Describe the bug
I'm trying to use FAISS document store and want to store multiple use cases in the same document store, i.e. specify different "index" during writing and querying. Everything works as expected for the first index I create, but for the subsequent ones, the query results are unexpectedly limited.
I looked into it, and I think what's happening is:
in the FAISS db, documents in each new index are keyed by vector_id, and this gets reset to 0 in each index. However, in the SQL db, the "vector_id" key isn't reset to 0 for each index. So when
get_documents_by_vector_ids
gets called with the vector_id's from the FAISS lookup, you get the wrong results, and most likely no results at all, when querying any of the subsequently created indices.Error message
No error
Expected behavior
The vector ids in the SQL db should match those in the FAISS db for a given index.
Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce
FAQ Check
System:
The text was updated successfully, but these errors were encountered: