-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental update of FAISSDocumentStore leads to incomplete query results #5071
Comments
Btw, if you change the parameter update_existing_embeddings to True, it works flawlessly. However, I do not consider it as a promising workaround as it leads to recalculating the embeddings of the whole document store. In my case, this takes way too long and breaks the idea of incremental updates/deletes of the document store. |
Ok so the issue is reproducible and seems to be strictly related to the use of import logging
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
from haystack import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline
document_store = FAISSDocumentStore(
faiss_index_factory_str="Flat",
embedding_field="embedding",
embedding_dim=384,
similarity="cosine",
duplicate_documents="overwrite", # using skip works, but the meta-data of the documents is not properly stored
)
retriever = EmbeddingRetriever(
document_store=document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", scale_score=False
)
pipe = DocumentSearchPipeline(retriever=retriever)
docs = [Document(f"doc {i}") for i in range(100)]
# First write
print("Docs to write: ", len(docs))
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)
print("Docs count: ", len(document_store.get_all_documents())) # expected 100, returned 100
prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})
print("Result count: ", len(prediction["documents"])) # expected 10, returned 10
# document_store.delete_documents()
# print("Docs count: ", len(document_store.get_all_documents())) # expected 0, returned 0
# Second write
print("Docs to write: ", len(docs))
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)
print("Docs count: ", len(document_store.get_all_documents())) # expected 100, returned 100
prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})
print("Result count: ", len(prediction["documents"])) # expected 10, returned 5 Changing Interestingly, emptying the document store in between the two writes also fixes the issue (see commented code). |
However, when using skip, there is also a bug and that's why I changed my
code. If I remember correctly, meta data was not stored properly when using
skip.
But this is a different story for another bug report.
ZanSara ***@***.***> schrieb am Mi., 21. Juni 2023, 12:56:
… Ok so the issue is reproducible and seems to be strictly related to the
use of overwrite. Here is the reduced minimal example:
import logging
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)logging.getLogger("haystack").setLevel(logging.INFO)
from haystack import Documentfrom haystack.document_stores import FAISSDocumentStorefrom haystack.nodes import EmbeddingRetrieverfrom haystack.pipelines import DocumentSearchPipeline
document_store = FAISSDocumentStore(
faiss_index_factory_str="Flat",
embedding_field="embedding",
embedding_dim=384,
similarity="cosine",
duplicate_documents="overwrite", # using skip works, but the meta-data of the documents is not properly stored
)retriever = EmbeddingRetriever(
document_store=document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", scale_score=False
)pipe = DocumentSearchPipeline(retriever=retriever)docs = [Document(f"doc {i}") for i in range(100)]
# First writeprint("Docs to write: ", len(docs))document_store.write_documents(docs)document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)print("Docs count: ", len(document_store.get_all_documents())) # expected 209, returned 209
prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})print("Result count: ", len(prediction["documents"])) # expected 10, returned 10
# Second writeprint("Docs to write: ", len(docs))document_store.write_documents(docs)document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)print("Docs count: ", len(document_store.get_all_documents())) # expected 209, returned 209
prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})print("Result count: ", len(prediction["documents"])) # expected 10, returned 5
Changing overwrite to skip solves the issue, a sign that there is
definitely a bug on how overwrite works. I haven't changed any other
faiss parameter, which might be also involved in the failure, nor the model
used.
—
Reply to this email directly, view it on GitHub
<#5071 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUZ7HM6RO5QE55T4K7ZDDTXMLHMLANCNFSM6AAAAAAYY2LNL4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Describe the bug
I'm using the FAISSDocumentStore and its index to handle incremental updates of the document base. For that, I must be able to delete old documents and add new documents. Since updating the embeddings takes some time, I want to only update embeddings for newly added documents. However, after deleting a document from the FAISSDocumentStore and adding a new document, querying the FAISSDocumentStore leads to incomplete query results. Instead of returning the top_k documents, less documents get returned although there are enough documents in the FAISSDocumentStore
To Reproduce
As a working example, I have changed the covid-FAQ-example a bit:
`
Expected behavior
Both prints of the number of returned query results should return 10 as specified by the top_k parameter. However, the last one returns only 6.
FAQ Check
System:
The text was updated successfully, but these errors were encountered: