-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embedding Dimension of FAISSDocumentStore not working #3090
Comments
Hi @kalki7, Ah yes, I think you have it correct that the @bogdankostic Is the above understanding correct? I guess it means we can't arbitrarily use any model in the retriever once the store is initialized which is probably not the end of the world but maybe inconvenient. Or is there an approach to make this a non-issue? |
Yes, the dimensions of the embeddings that will be produced by the retriever must be known prior to initializing the document store. If we want to change the embedding model that is using a different embedding size later, we would need to reinstantiate the document store. |
Understood @bglearning, so essentially it's a mismatch. Would it be okay for me to raise a PR just to add a small check for this to ensure that the dimension of the embeddings are the same as the initialized FAISS index embedding_dim ? |
@kalki7 Ya, better error messaging is always good. But before you start the PR, probably best to align on: where do you propose putting in the check and what error message might be best to communicate the situation? |
@bglearning I would think adding a check after the embeddings of the first document_batch are generated in the |
So after faiss.py#L363? I guess makes sense. I tried seeing if there is a way to know the embedding_dim of the retriever before then (early exit would be great) but doesn't seem apparent. @bogdankostic What do you think? i) Is it worth having the check+message? ii) Where best to put it? |
I think having the check + an informative message makes sense :) We could also add this to the other DocumentStores. |
Understood, I'll raise a PR |
Describe the bug
While calling the FAISSDocumentStore, the embedding_dim is used to create the FAISS Index of the respective dimension, but while generating the embeddings at line 363, faiss.py
embeddings = retriever.embed_documents(document_batch)
the embedding_dim is not carried and causes the following AssertionError while trying to add the embeddings to the FAISS index which was created with the given embedding_dim at line 371, faiss.py
self.faiss_indexes[index].add(embeddings_to_index)
The default embedding_dim is 768 which also throws the same assertion error. The embeddings from the retriever.embed_documents(documnt_batch) are of the size 394. Thus when the document store is initialized as follows with embedding_dim=394, it works as expected
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",embedding_dim=384,similarity="cosine",return_embedding=True)
Upon further inspection, I noticed that the
document_store_faiss.update_embeddings(retriever=retriever_faiss)
That calls the function to generate the embeddings for each document with the respective sentence-transformer at like 363, faiss.py
embeddings = retriever.embed_documents(document_batch)
The length of these generated embeddings (384 as that's the output of the model chosen) doesn't match the embedding_dim which was used to initialize the FAISS index. This when the FAISSDocumentStore is initialized with embedding_dim=384, the desired output is achieved.
I'm not sure if I'm doing something wrong or if it's an actual issue, please do advise on the same.
Error message
Traceback (most recent call last):
File ".\test.py", line 17, in
document_store_faiss.update_embeddings(retriever=retriever_faiss)
File "C:\Users\Employee\Downloads\iCog\elastic\ttt\haystack\haystack\document_stores\faiss.py", line 371, in update_embeddings
self.faiss_indexes[index].add(embeddings_to_index)
File "C:\Users\Employee\anaconda3\envs\faiss\lib\site-packages\faiss_init_.py", line 247, in replacement_add
assert d == self.d
AssertionError
To Reproduce
System:
The text was updated successfully, but these errors were encountered: