Incremental update of FAISSDocumentStore leads to incomplete query results #5071

hansblafoo · 2023-06-02T21:42:15Z

Describe the bug
I'm using the FAISSDocumentStore and its index to handle incremental updates of the document base. For that, I must be able to delete old documents and add new documents. Since updating the embeddings takes some time, I want to only update embeddings for newly added documents. However, after deleting a document from the FAISSDocumentStore and adding a new document, querying the FAISSDocumentStore leads to incomplete query results. Instead of returning the top_k documents, less documents get returned although there are enough documents in the FAISSDocumentStore

To Reproduce
As a working example, I have changed the covid-FAQ-example a bit:

import logging
import pandas as pd

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

from haystack.telemetry import disable_telemetry

from haystack.document_stores import FAISSDocumentStore


from haystack.nodes import EmbeddingRetriever
from haystack.utils import fetch_archive_from_http

from haystack.pipelines import FAQPipeline
from haystack.utils import print_answers


disable_telemetry()

document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    embedding_field="embedding",
    embedding_dim=384,
    similarity="cosine",
    duplicate_documents="overwrite" #using skip does not work neither as then meta-data of the documents is not properly stored
)


retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    scale_score=False,
)

# Download
doc_dir = "data/covid"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir, proxies={'http': '127.0.0.1:3129', 'https': '127.0.0.1:3129'})

# Get dataframe with columns "question", "answer" and some custom metadata
df = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")
# Minimal cleaning
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

df = df.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)

pipe = FAQPipeline(retriever=retriever)

prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 10}})

#expected result: 209, returned result: 209
print(len(document_store.get_all_documents()))
#expected result: 10, returned result: 10
print("Result count: " + str(len(prediction['answers'])))

#delete first found document
idToDelete = prediction["documents"][0].id
document_store.delete_documents(ids=[idToDelete])
#expected result: 208, returned result: 208
print(len(document_store.get_all_documents()))

#re-add all documents and update embeddings for new documents
document_store.write_documents(docs_to_index)

document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)

prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 10}})
#expected result: 10, returned result: 6
print("Result count: " + str(len(prediction['answers'])))

`

Expected behavior
Both prints of the number of returned query results should return 10 as specified by the top_k parameter. However, the last one returns only 6.

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: Windows 10
GPU/CPU: Intel Core i7-6600, Haystack run as CPU-only
Haystack version (commit or version number): 1.16.1
DocumentStore:
Reader:
Retriever:

hansblafoo · 2023-06-05T07:08:42Z

Btw, if you change the parameter update_existing_embeddings to True, it works flawlessly. However, I do not consider it as a promising workaround as it leads to recalculating the embeddings of the whole document store. In my case, this takes way too long and breaks the idea of incremental updates/deletes of the document store.

ZanSara · 2023-06-21T10:55:55Z

Ok so the issue is reproducible and seems to be strictly related to the use of overwrite. Here is the reduced minimal example:

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

from haystack import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline


document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    embedding_field="embedding",
    embedding_dim=384,
    similarity="cosine",
    duplicate_documents="overwrite",  # using skip works, but the meta-data of the documents is not properly stored
)
retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", scale_score=False
)
pipe = DocumentSearchPipeline(retriever=retriever)
docs = [Document(f"doc {i}") for i in range(100)]

# First write
print("Docs to write: ", len(docs))
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)
print("Docs count: ", len(document_store.get_all_documents()))  # expected 100, returned 100

prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})
print("Result count: ", len(prediction["documents"]))  # expected 10, returned 10

# document_store.delete_documents()
# print("Docs count: ", len(document_store.get_all_documents()))  # expected 0, returned 0

# Second write
print("Docs to write: ", len(docs))
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)
print("Docs count: ", len(document_store.get_all_documents()))  # expected 100, returned 100

prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})
print("Result count: ", len(prediction["documents"]))  # expected 10, returned 5

Changing overwrite to skip solves the issue, a sign that there is definitely a bug on how overwrite works. I haven't changed any other faiss parameter, which might be also involved in the failure, nor the model used.

Interestingly, emptying the document store in between the two writes also fixes the issue (see commented code).

hansblafoo · 2023-06-21T11:24:55Z

However, when using skip, there is also a bug and that's why I changed my code. If I remember correctly, meta data was not stored properly when using skip. But this is a different story for another bug report. ZanSara ***@***.***> schrieb am Mi., 21. Juni 2023, 12:56:

…

Ok so the issue is reproducible and seems to be strictly related to the use of overwrite. Here is the reduced minimal example: import logging logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)logging.getLogger("haystack").setLevel(logging.INFO) from haystack import Documentfrom haystack.document_stores import FAISSDocumentStorefrom haystack.nodes import EmbeddingRetrieverfrom haystack.pipelines import DocumentSearchPipeline document_store = FAISSDocumentStore( faiss_index_factory_str="Flat", embedding_field="embedding", embedding_dim=384, similarity="cosine", duplicate_documents="overwrite", # using skip works, but the meta-data of the documents is not properly stored )retriever = EmbeddingRetriever( document_store=document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", scale_score=False )pipe = DocumentSearchPipeline(retriever=retriever)docs = [Document(f"doc {i}") for i in range(100)] # First writeprint("Docs to write: ", len(docs))document_store.write_documents(docs)document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)print("Docs count: ", len(document_store.get_all_documents())) # expected 209, returned 209 prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})print("Result count: ", len(prediction["documents"])) # expected 10, returned 10 # Second writeprint("Docs to write: ", len(docs))document_store.write_documents(docs)document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)print("Docs count: ", len(document_store.get_all_documents())) # expected 209, returned 209 prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})print("Result count: ", len(prediction["documents"])) # expected 10, returned 5 Changing overwrite to skip solves the issue, a sign that there is definitely a bug on how overwrite works. I haven't changed any other faiss parameter, which might be also involved in the failure, nor the model used. — Reply to this email directly, view it on GitHub <#5071 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIUZ7HM6RO5QE55T4K7ZDDTXMLHMLANCNFSM6AAAAAAYY2LNL4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

anakin87 · 2023-06-21T12:33:29Z

Hey @ZanSara...
It seems very similar to #2834.

julian-risch added topic:document_store topic:faiss labels Jun 5, 2023

julian-risch assigned ZanSara Jun 7, 2023

ZanSara changed the title ~~Incremental delete and update of FAISSDocumentStore leads to incomplete query results~~ Incremental update of FAISSDocumentStore leads to incomplete query results Jun 21, 2023

bogdankostic assigned anakin87 and unassigned ZanSara Jun 21, 2023

anakin87 mentioned this issue Jun 28, 2023

Fix: FAISSDocumentStore - make write_documents properly work in combination w update_embeddings #5221

Merged

anakin87 closed this as completed in #5221 Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental update of FAISSDocumentStore leads to incomplete query results #5071

Incremental update of FAISSDocumentStore leads to incomplete query results #5071

hansblafoo commented Jun 2, 2023 •

edited

Loading

hansblafoo commented Jun 5, 2023

ZanSara commented Jun 21, 2023 •

edited

Loading

hansblafoo commented Jun 21, 2023 via email

anakin87 commented Jun 21, 2023

Incremental update of FAISSDocumentStore leads to incomplete query results #5071

Incremental update of FAISSDocumentStore leads to incomplete query results #5071

Comments

hansblafoo commented Jun 2, 2023 • edited Loading

hansblafoo commented Jun 5, 2023

ZanSara commented Jun 21, 2023 • edited Loading

hansblafoo commented Jun 21, 2023 via email

anakin87 commented Jun 21, 2023

hansblafoo commented Jun 2, 2023 •

edited

Loading

ZanSara commented Jun 21, 2023 •

edited

Loading