Embedding Dimension of FAISSDocumentStore not working #3090

kalki7 · 2022-08-23T16:14:04Z

Describe the bug
While calling the FAISSDocumentStore, the embedding_dim is used to create the FAISS Index of the respective dimension, but while generating the embeddings at line 363, faiss.py

embeddings = retriever.embed_documents(document_batch)

the embedding_dim is not carried and causes the following AssertionError while trying to add the embeddings to the FAISS index which was created with the given embedding_dim at line 371, faiss.py

self.faiss_indexes[index].add(embeddings_to_index)

The default embedding_dim is 768 which also throws the same assertion error. The embeddings from the retriever.embed_documents(documnt_batch) are of the size 394. Thus when the document store is initialized as follows with embedding_dim=394, it works as expected

document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",embedding_dim=384,similarity="cosine",return_embedding=True)
Upon further inspection, I noticed that the
document_store_faiss.update_embeddings(retriever=retriever_faiss)

That calls the function to generate the embeddings for each document with the respective sentence-transformer at like 363, faiss.py
embeddings = retriever.embed_documents(document_batch)

The length of these generated embeddings (384 as that's the output of the model chosen) doesn't match the embedding_dim which was used to initialize the FAISS index. This when the FAISSDocumentStore is initialized with embedding_dim=384, the desired output is achieved.

I'm not sure if I'm doing something wrong or if it's an actual issue, please do advise on the same.

Error message
Traceback (most recent call last):
File ".\test.py", line 17, in
document_store_faiss.update_embeddings(retriever=retriever_faiss)
File "C:\Users\Employee\Downloads\iCog\elastic\ttt\haystack\haystack\document_stores\faiss.py", line 371, in update_embeddings
self.faiss_indexes[index].add(embeddings_to_index)
File "C:\Users\Employee\anaconda3\envs\faiss\lib\site-packages\faiss_init_.py", line 247, in replacement_add
assert d == self.d
AssertionError

To Reproduce

from haystack.document_stores import FAISSDocumentStore
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",similarity="cosine",return_embedding=True)
document_store_faiss.write_documents(diction)
retriever_faiss = EmbeddingRetriever(document_store_faiss, embedding_model='all-MiniLM-L6-v2',model_format='sentence_transformers')
document_store_faiss.update_embeddings(retriever=retriever_faiss)

System:

OS: Windows 11
GPU/CPU: CPU
Haystack version (commit or version number): 1.7.2rc0
DocumentStore: FAISSDocumentStore

The text was updated successfully, but these errors were encountered:

bglearning · 2022-08-24T14:22:34Z

Hi @kalki7,

Ah yes, I think you have it correct that the embedding_dim set during FAISSDocumentStore initialization must match the dimension produced by the retriever model which will update the embeddings later. So one must know it during initialization.

@bogdankostic Is the above understanding correct? I guess it means we can't arbitrarily use any model in the retriever once the store is initialized which is probably not the end of the world but maybe inconvenient. Or is there an approach to make this a non-issue?

bogdankostic · 2022-08-24T14:51:06Z

Yes, the dimensions of the embeddings that will be produced by the retriever must be known prior to initializing the document store. If we want to change the embedding model that is using a different embedding size later, we would need to reinstantiate the document store.

kalki7 · 2022-08-24T17:32:10Z

Understood @bglearning, so essentially it's a mismatch. Would it be okay for me to raise a PR just to add a small check for this to ensure that the dimension of the embeddings are the same as the initialized FAISS index embedding_dim ?

bglearning · 2022-08-25T10:10:57Z

@kalki7 Ya, better error messaging is always good. But before you start the PR, probably best to align on: where do you propose putting in the check and what error message might be best to communicate the situation?

kalki7 · 2022-08-25T13:41:59Z

@bglearning I would think adding a check after the embeddings of the first document_batch are generated in the update_embeddings function. We can have a check to see if the shape of the FAISS Index matches the shape of the embeddings, if not we can raise a Value Error that states
"The shape of the FAISS Index doesn't match with the shape of the embeddings from the Transformer. Please reinitiate your FAISSDocumentStore wth embedding_dim={shape_of_embeddings}."

bglearning · 2022-08-25T15:39:58Z

So after faiss.py#L363? I guess makes sense. I tried seeing if there is a way to know the embedding_dim of the retriever before then (early exit would be great) but doesn't seem apparent. @bogdankostic What do you think? i) Is it worth having the check+message? ii) Where best to put it?

bogdankostic · 2022-08-29T14:22:54Z

I think having the check + an informative message makes sense :) We could also add this to the other DocumentStores.

kalki7 · 2022-09-08T05:42:05Z

Understood, I'll raise a PR

* match index dim with embed dim (#3090) * aligned messages across all docstores * aligned messages across all docstores (#3090) * aligned messages across all docstores (#3090)

bglearning added the journey:first steps label Aug 24, 2022

bogdankostic added the topic:document_store label Aug 24, 2022

ZanSara added the Contributions wanted! Looking for external contributions label Sep 5, 2022

kalki7 added a commit to kalki7/haystack that referenced this issue Sep 8, 2022

match index dim with embed dim (deepset-ai#3090)

24341a4

kalki7 mentioned this issue Sep 8, 2022

bug: fix embedding_dim mismatch in DocumentStore #3183

Merged

6 tasks

kalki7 added a commit to kalki7/haystack that referenced this issue Sep 10, 2022

aligned messages across all docstores (deepset-ai#3090)

884f379

kalki7 added a commit to kalki7/haystack that referenced this issue Sep 14, 2022

aligned messages across all docstores (deepset-ai#3090)

ae0f041

masci closed this as completed in #3183 Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Dimension of FAISSDocumentStore not working #3090

Embedding Dimension of FAISSDocumentStore not working #3090

kalki7 commented Aug 23, 2022 •

edited

Loading

bglearning commented Aug 24, 2022

bogdankostic commented Aug 24, 2022

kalki7 commented Aug 24, 2022 •

edited

Loading

bglearning commented Aug 25, 2022 •

edited

Loading

kalki7 commented Aug 25, 2022

bglearning commented Aug 25, 2022

bogdankostic commented Aug 29, 2022

kalki7 commented Sep 8, 2022

Embedding Dimension of FAISSDocumentStore not working #3090

Embedding Dimension of FAISSDocumentStore not working #3090

Comments

kalki7 commented Aug 23, 2022 • edited Loading

bglearning commented Aug 24, 2022

bogdankostic commented Aug 24, 2022

kalki7 commented Aug 24, 2022 • edited Loading

bglearning commented Aug 25, 2022 • edited Loading

kalki7 commented Aug 25, 2022

bglearning commented Aug 25, 2022

bogdankostic commented Aug 29, 2022

kalki7 commented Sep 8, 2022

kalki7 commented Aug 23, 2022 •

edited

Loading

kalki7 commented Aug 24, 2022 •

edited

Loading

bglearning commented Aug 25, 2022 •

edited

Loading