Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding Dimension of FAISSDocumentStore not working #3090

Closed
kalki7 opened this issue Aug 23, 2022 · 8 comments · Fixed by #3183
Closed

Embedding Dimension of FAISSDocumentStore not working #3090

kalki7 opened this issue Aug 23, 2022 · 8 comments · Fixed by #3183
Labels
Contributions wanted! Looking for external contributions topic:document_store

Comments

@kalki7
Copy link
Contributor

kalki7 commented Aug 23, 2022

Describe the bug
While calling the FAISSDocumentStore, the embedding_dim is used to create the FAISS Index of the respective dimension, but while generating the embeddings at line 363, faiss.py

embeddings = retriever.embed_documents(document_batch)

the embedding_dim is not carried and causes the following AssertionError while trying to add the embeddings to the FAISS index which was created with the given embedding_dim at line 371, faiss.py

self.faiss_indexes[index].add(embeddings_to_index)

The default embedding_dim is 768 which also throws the same assertion error. The embeddings from the retriever.embed_documents(documnt_batch) are of the size 394. Thus when the document store is initialized as follows with embedding_dim=394, it works as expected

document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",embedding_dim=384,similarity="cosine",return_embedding=True)
Upon further inspection, I noticed that the
document_store_faiss.update_embeddings(retriever=retriever_faiss)

That calls the function to generate the embeddings for each document with the respective sentence-transformer at like 363, faiss.py
embeddings = retriever.embed_documents(document_batch)

The length of these generated embeddings (384 as that's the output of the model chosen) doesn't match the embedding_dim which was used to initialize the FAISS index. This when the FAISSDocumentStore is initialized with embedding_dim=384, the desired output is achieved.

I'm not sure if I'm doing something wrong or if it's an actual issue, please do advise on the same.

Error message
Traceback (most recent call last):
File ".\test.py", line 17, in
document_store_faiss.update_embeddings(retriever=retriever_faiss)
File "C:\Users\Employee\Downloads\iCog\elastic\ttt\haystack\haystack\document_stores\faiss.py", line 371, in update_embeddings
self.faiss_indexes[index].add(embeddings_to_index)
File "C:\Users\Employee\anaconda3\envs\faiss\lib\site-packages\faiss_init_.py", line 247, in replacement_add
assert d == self.d
AssertionError

To Reproduce

from haystack.document_stores import FAISSDocumentStore
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",similarity="cosine",return_embedding=True)
document_store_faiss.write_documents(diction)
retriever_faiss = EmbeddingRetriever(document_store_faiss, embedding_model='all-MiniLM-L6-v2',model_format='sentence_transformers')
document_store_faiss.update_embeddings(retriever=retriever_faiss)

System:

  • OS: Windows 11
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.7.2rc0
  • DocumentStore: FAISSDocumentStore
@bglearning
Copy link
Contributor

Hi @kalki7,

Ah yes, I think you have it correct that the embedding_dim set during FAISSDocumentStore initialization must match the dimension produced by the retriever model which will update the embeddings later. So one must know it during initialization.

@bogdankostic Is the above understanding correct? I guess it means we can't arbitrarily use any model in the retriever once the store is initialized which is probably not the end of the world but maybe inconvenient. Or is there an approach to make this a non-issue?

@bogdankostic
Copy link
Contributor

Yes, the dimensions of the embeddings that will be produced by the retriever must be known prior to initializing the document store. If we want to change the embedding model that is using a different embedding size later, we would need to reinstantiate the document store.

@kalki7
Copy link
Contributor Author

kalki7 commented Aug 24, 2022

Understood @bglearning, so essentially it's a mismatch. Would it be okay for me to raise a PR just to add a small check for this to ensure that the dimension of the embeddings are the same as the initialized FAISS index embedding_dim ?

@bglearning
Copy link
Contributor

bglearning commented Aug 25, 2022

@kalki7 Ya, better error messaging is always good. But before you start the PR, probably best to align on: where do you propose putting in the check and what error message might be best to communicate the situation?

@kalki7
Copy link
Contributor Author

kalki7 commented Aug 25, 2022

@bglearning I would think adding a check after the embeddings of the first document_batch are generated in the update_embeddings function. We can have a check to see if the shape of the FAISS Index matches the shape of the embeddings, if not we can raise a Value Error that states
"The shape of the FAISS Index doesn't match with the shape of the embeddings from the Transformer. Please reinitiate your FAISSDocumentStore wth embedding_dim={shape_of_embeddings}."

@bglearning
Copy link
Contributor

So after faiss.py#L363? I guess makes sense. I tried seeing if there is a way to know the embedding_dim of the retriever before then (early exit would be great) but doesn't seem apparent. @bogdankostic What do you think? i) Is it worth having the check+message? ii) Where best to put it?

@bogdankostic
Copy link
Contributor

I think having the check + an informative message makes sense :) We could also add this to the other DocumentStores.

@ZanSara ZanSara added the Contributions wanted! Looking for external contributions label Sep 5, 2022
kalki7 added a commit to kalki7/haystack that referenced this issue Sep 8, 2022
@kalki7
Copy link
Contributor Author

kalki7 commented Sep 8, 2022

Understood, I'll raise a PR

kalki7 added a commit to kalki7/haystack that referenced this issue Sep 10, 2022
kalki7 added a commit to kalki7/haystack that referenced this issue Sep 14, 2022
masci pushed a commit that referenced this issue Sep 15, 2022
* match index dim with embed dim (#3090)

* aligned messages across all docstores

* aligned messages across all docstores (#3090)

* aligned messages across all docstores (#3090)
brandenchan pushed a commit that referenced this issue Sep 21, 2022
* match index dim with embed dim (#3090)

* aligned messages across all docstores

* aligned messages across all docstores (#3090)

* aligned messages across all docstores (#3090)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions topic:document_store
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants