Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental update of FAISSDocumentStore leads to incomplete query results #5071

Closed
1 task done
hansblafoo opened this issue Jun 2, 2023 · 4 comments · Fixed by #5221
Closed
1 task done

Incremental update of FAISSDocumentStore leads to incomplete query results #5071

hansblafoo opened this issue Jun 2, 2023 · 4 comments · Fixed by #5221

Comments

@hansblafoo
Copy link

hansblafoo commented Jun 2, 2023

Describe the bug
I'm using the FAISSDocumentStore and its index to handle incremental updates of the document base. For that, I must be able to delete old documents and add new documents. Since updating the embeddings takes some time, I want to only update embeddings for newly added documents. However, after deleting a document from the FAISSDocumentStore and adding a new document, querying the FAISSDocumentStore leads to incomplete query results. Instead of returning the top_k documents, less documents get returned although there are enough documents in the FAISSDocumentStore

To Reproduce
As a working example, I have changed the covid-FAQ-example a bit:

import logging
import pandas as pd

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

from haystack.telemetry import disable_telemetry

from haystack.document_stores import FAISSDocumentStore


from haystack.nodes import EmbeddingRetriever
from haystack.utils import fetch_archive_from_http

from haystack.pipelines import FAQPipeline
from haystack.utils import print_answers


disable_telemetry()

document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    embedding_field="embedding",
    embedding_dim=384,
    similarity="cosine",
    duplicate_documents="overwrite" #using skip does not work neither as then meta-data of the documents is not properly stored
)


retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    scale_score=False,
)

# Download
doc_dir = "data/covid"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir, proxies={'http': '127.0.0.1:3129', 'https': '127.0.0.1:3129'})

# Get dataframe with columns "question", "answer" and some custom metadata
df = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")
# Minimal cleaning
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

df = df.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)

pipe = FAQPipeline(retriever=retriever)

prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 10}})

#expected result: 209, returned result: 209
print(len(document_store.get_all_documents()))
#expected result: 10, returned result: 10
print("Result count: " + str(len(prediction['answers'])))

#delete first found document
idToDelete = prediction["documents"][0].id
document_store.delete_documents(ids=[idToDelete])
#expected result: 208, returned result: 208
print(len(document_store.get_all_documents()))

#re-add all documents and update embeddings for new documents
document_store.write_documents(docs_to_index)

document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)

prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 10}})
#expected result: 10, returned result: 6
print("Result count: " + str(len(prediction['answers'])))

`

Expected behavior
Both prints of the number of returned query results should return 10 as specified by the top_k parameter. However, the last one returns only 6.

FAQ Check

System:

  • OS: Windows 10
  • GPU/CPU: Intel Core i7-6600, Haystack run as CPU-only
  • Haystack version (commit or version number): 1.16.1
  • DocumentStore:
  • Reader:
  • Retriever:
@hansblafoo
Copy link
Author

Btw, if you change the parameter update_existing_embeddings to True, it works flawlessly. However, I do not consider it as a promising workaround as it leads to recalculating the embeddings of the whole document store. In my case, this takes way too long and breaks the idea of incremental updates/deletes of the document store.

@ZanSara ZanSara changed the title Incremental delete and update of FAISSDocumentStore leads to incomplete query results Incremental update of FAISSDocumentStore leads to incomplete query results Jun 21, 2023
@ZanSara
Copy link
Contributor

ZanSara commented Jun 21, 2023

Ok so the issue is reproducible and seems to be strictly related to the use of overwrite. Here is the reduced minimal example:

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

from haystack import Document
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline


document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    embedding_field="embedding",
    embedding_dim=384,
    similarity="cosine",
    duplicate_documents="overwrite",  # using skip works, but the meta-data of the documents is not properly stored
)
retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/all-MiniLM-L6-v2", scale_score=False
)
pipe = DocumentSearchPipeline(retriever=retriever)
docs = [Document(f"doc {i}") for i in range(100)]

# First write
print("Docs to write: ", len(docs))
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)
print("Docs count: ", len(document_store.get_all_documents()))  # expected 100, returned 100

prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})
print("Result count: ", len(prediction["documents"]))  # expected 10, returned 10

# document_store.delete_documents()
# print("Docs count: ", len(document_store.get_all_documents()))  # expected 0, returned 0

# Second write
print("Docs to write: ", len(docs))
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever, update_existing_embeddings=False)
print("Docs count: ", len(document_store.get_all_documents()))  # expected 100, returned 100

prediction = pipe.run(query="doc 30", params={"Retriever": {"top_k": 10}})
print("Result count: ", len(prediction["documents"]))  # expected 10, returned 5

Changing overwrite to skip solves the issue, a sign that there is definitely a bug on how overwrite works. I haven't changed any other faiss parameter, which might be also involved in the failure, nor the model used.

Interestingly, emptying the document store in between the two writes also fixes the issue (see commented code).

@hansblafoo
Copy link
Author

hansblafoo commented Jun 21, 2023 via email

@anakin87
Copy link
Member

Hey @ZanSara...
It seems very similar to #2834.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants