Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Closed
vinchg opened this issue Nov 17, 2020 · 8 comments
Closed

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

vinchg opened this issue Nov 17, 2020 · 8 comments
Assignees
Labels
type:feature New feature or request
Milestone

Comments

@vinchg
Copy link

vinchg commented Nov 17, 2020

Question
As mentioned in the title, I'm trying to update the embeddings for Faiss (HNSW) with a DPR retriever on a large dataset (~15 million documents). Following the tutorial steps, I'm writing the documents to a local sqlite db (~15 gb .db). However, calling update_embeddings uses all my RAM (64 GB) and all my swap (64 GB) and proceeds to run out of memory after running for hours. I'm fairly certain the line that's consuming so much memory is:

documents = self.get_all_documents(index=index)

More specifically:

documents = [self._convert_sql_row_to_document(row) for row in query.all()]

The call to query.all() is what's causing the issue.

I'm not very familiar with SQLAlchemy and SQLite in general but it seems that there is some inefficient usage of space when querying the DB. Is there an alternative way with dealing with large datasets like this? It would be preferable if there was a batch option to update_embeddings so that a group of embeddings would be flushed to disk before proceeding.

On a side note, some type of progress indicator or option to enable one when calling write_documents or update_embeddings would be useful (considering these operations can take an hour or more).

@tholor
Copy link
Member

tholor commented Nov 18, 2020

Hey @vinchg ,

Thanks for raising this issue. It's highly relevant.

So in general: for larger datasets like this, we would recommend not using SQLite but rather Postgres.
You can easily spin one up via a Docker:

docker run --name haystack-postgres -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres
docker exec -it haystack-postgres psql -U postgres -c "CREATE DATABASE haystack;"

and then connect in Python via:

document_store = FAISSDocumentStore(sql_url="postgresql://postgres:password@localhost:5432/haystack",
                                            faiss_index_factory_str=index_type)

However, this will probably only increase memory efficiency but not resolve the underlying problem.
I totally agree that we should add some batch functionality to update_embeddings and a tqdm progress bar.

The call to query.all() is what's causing the issue.

Do you have already any embeddings in the document store or is that a fresh one?

Is there an alternative way with dealing with large datasets like this?

As a temporary workaround, you could generate the embeddings in batches yourself before calling write_documents() and attach them to your documents before writing to the DocumentStore. Rough sketch:

from haystack import Document
dicts = [{"text": "some text"}, ...]
docs = [Document.from_dict(d) for d in dicts]

# get the embedding for a batch of docs
batch_docs = docs[:32]
batch_emb = retriever.embed_passages(batch_docs)

# attach embeddings to the docs
for emb, doc in zip(batch_emb, batch_docs):
    doc.embedding = emb
...

# later: write everything to the doc store
doc_store.write_documents(all_docs)

Again, we should in any case add the batch functionality soon in update_embeddings() to simplify usage here ...

@lalitpagaria Would this maybe be something that is of interest to you and you would like to work on? If not, @tanaysoni can take care of it.

@tholor tholor changed the title Updating Embeddings for Faiss DPR on Large Dataset Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) Nov 18, 2020
@tholor tholor added this to the #5 milestone Nov 18, 2020
@lalitpagaria
Copy link
Contributor

@tholor I would like to work on it but only after a week. So if it not something urgent then I can take it up.

@tholor
Copy link
Member

tholor commented Nov 18, 2020

Ok awesome! That sounds perfectly fine. Thank you @lalitpagaria :)

@vinchg
Copy link
Author

vinchg commented Nov 18, 2020

Thank you for the response. I'll give postgres a test.

Do you have already any embeddings in the document store or is that a fresh one?

It is a fresh one.

I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to query.all() resolves quickly, but then proceeds to loop (for 15+ hours) here:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L338

Originating from:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L415

I added tqdm and it's giving me a ~40 hour estimation (around 9 secs per iter) just to tokenize the dataset. I am using facebook/dpr-ctx_encoder-single-nq-base for the passage embedding model with a max seq len of 256. For reference, I'm working with a 10900F and a 3090.

The issue is here:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L2001

*edit: Comparatively, I tried tokenizing the same dataset outside of the API and it took 17 minutes. I'm working around it currently by calculating my own embeddings and adding them to the docs, but this might also be something to look into.

@tholor
Copy link
Member

tholor commented Nov 19, 2020

Hey @vinchg ,
The speed issue that you mention seems related to #602. We will investigate and optimize it (probably via batching and/or multiprocessing)!

@tholor
Copy link
Member

tholor commented Dec 3, 2020

I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to query.all() resolves quickly, but then proceeds to loop (for 15+ hours) here:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L338
Originating from:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L415

@vinchg Not sure if you saw this, but we fixed this one in deepset-ai/FARM#638. A loop with O(n²) was causing the trouble...

Memory efficiency should soon be improved by #620 and we'll afterwards also introduce a batch mode for update_embeddings()...

@brandenchan
Copy link
Contributor

Hi @vinchg, the new PR #733 should significantly improve the memory efficiency of update embeddings. Could you give this a try and let us know if it helps?

@vinchg
Copy link
Author

vinchg commented Jan 14, 2021

Thank you guys! I'm a bit busy with other things atm so I'm not sure when I'll get around to testing, but when I do, I'll post my results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants