-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601
Comments
Hey @vinchg , Thanks for raising this issue. It's highly relevant. So in general: for larger datasets like this, we would recommend not using SQLite but rather Postgres.
and then connect in Python via:
However, this will probably only increase memory efficiency but not resolve the underlying problem.
Do you have already any embeddings in the document store or is that a fresh one?
As a temporary workaround, you could generate the embeddings in batches yourself before calling
Again, we should in any case add the batch functionality soon in @lalitpagaria Would this maybe be something that is of interest to you and you would like to work on? If not, @tanaysoni can take care of it. |
@tholor I would like to work on it but only after a week. So if it not something urgent then I can take it up. |
Ok awesome! That sounds perfectly fine. Thank you @lalitpagaria :) |
Thank you for the response. I'll give postgres a test.
It is a fresh one. I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to Originating from: I added tqdm and it's giving me a ~40 hour estimation (around 9 secs per iter) just to tokenize the dataset. I am using The issue is here: *edit: Comparatively, I tried tokenizing the same dataset outside of the API and it took 17 minutes. I'm working around it currently by calculating my own embeddings and adding them to the docs, but this might also be something to look into. |
@vinchg Not sure if you saw this, but we fixed this one in deepset-ai/FARM#638. A loop with O(n²) was causing the trouble... Memory efficiency should soon be improved by #620 and we'll afterwards also introduce a batch mode for |
Thank you guys! I'm a bit busy with other things atm so I'm not sure when I'll get around to testing, but when I do, I'll post my results |
Question
As mentioned in the title, I'm trying to update the embeddings for Faiss (HNSW) with a DPR retriever on a large dataset (~15 million documents). Following the tutorial steps, I'm writing the documents to a local sqlite db (~15 gb .db). However, calling update_embeddings uses all my RAM (64 GB) and all my swap (64 GB) and proceeds to run out of memory after running for hours. I'm fairly certain the line that's consuming so much memory is:
haystack/haystack/document_store/faiss.py
Line 158 in 3f81c93
More specifically:
haystack/haystack/document_store/sql.py
Line 124 in 3f81c93
The call to query.all() is what's causing the issue.
I'm not very familiar with SQLAlchemy and SQLite in general but it seems that there is some inefficient usage of space when querying the DB. Is there an alternative way with dealing with large datasets like this? It would be preferable if there was a batch option to update_embeddings so that a group of embeddings would be flushed to disk before proceeding.
On a side note, some type of progress indicator or option to enable one when calling write_documents or update_embeddings would be useful (considering these operations can take an hour or more).
The text was updated successfully, but these errors were encountered: