Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

vinchg · 2020-11-17T22:38:55Z

Question
As mentioned in the title, I'm trying to update the embeddings for Faiss (HNSW) with a DPR retriever on a large dataset (~15 million documents). Following the tutorial steps, I'm writing the documents to a local sqlite db (~15 gb .db). However, calling update_embeddings uses all my RAM (64 GB) and all my swap (64 GB) and proceeds to run out of memory after running for hours. I'm fairly certain the line that's consuming so much memory is:

haystack/haystack/document_store/faiss.py

Line 158 in 3f81c93

documents = self.get_all_documents(index=index)

More specifically:

haystack/haystack/document_store/sql.py

Line 124 in 3f81c93

documents = [self._convert_sql_row_to_document(row) for row in query.all()]

The call to query.all() is what's causing the issue.

I'm not very familiar with SQLAlchemy and SQLite in general but it seems that there is some inefficient usage of space when querying the DB. Is there an alternative way with dealing with large datasets like this? It would be preferable if there was a batch option to update_embeddings so that a group of embeddings would be flushed to disk before proceeding.

On a side note, some type of progress indicator or option to enable one when calling write_documents or update_embeddings would be useful (considering these operations can take an hour or more).

tholor · 2020-11-18T06:44:01Z

Hey @vinchg ,

Thanks for raising this issue. It's highly relevant.

So in general: for larger datasets like this, we would recommend not using SQLite but rather Postgres.
You can easily spin one up via a Docker:

docker run --name haystack-postgres -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres
docker exec -it haystack-postgres psql -U postgres -c "CREATE DATABASE haystack;"

and then connect in Python via:

document_store = FAISSDocumentStore(sql_url="postgresql://postgres:password@localhost:5432/haystack",
                                            faiss_index_factory_str=index_type)

However, this will probably only increase memory efficiency but not resolve the underlying problem.
I totally agree that we should add some batch functionality to update_embeddings and a tqdm progress bar.

The call to query.all() is what's causing the issue.

Do you have already any embeddings in the document store or is that a fresh one?

Is there an alternative way with dealing with large datasets like this?

As a temporary workaround, you could generate the embeddings in batches yourself before calling write_documents() and attach them to your documents before writing to the DocumentStore. Rough sketch:

from haystack import Document
dicts = [{"text": "some text"}, ...]
docs = [Document.from_dict(d) for d in dicts]

# get the embedding for a batch of docs
batch_docs = docs[:32]
batch_emb = retriever.embed_passages(batch_docs)

# attach embeddings to the docs
for emb, doc in zip(batch_emb, batch_docs):
    doc.embedding = emb
...

# later: write everything to the doc store
doc_store.write_documents(all_docs)

Again, we should in any case add the batch functionality soon in update_embeddings() to simplify usage here ...

@lalitpagaria Would this maybe be something that is of interest to you and you would like to work on? If not, @tanaysoni can take care of it.

lalitpagaria · 2020-11-18T16:24:18Z

@tholor I would like to work on it but only after a week. So if it not something urgent then I can take it up.

tholor · 2020-11-18T16:25:22Z

Ok awesome! That sounds perfectly fine. Thank you @lalitpagaria :)

vinchg · 2020-11-18T19:01:30Z

Thank you for the response. I'll give postgres a test.

Do you have already any embeddings in the document store or is that a fresh one?

It is a fresh one.

I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to query.all() resolves quickly, but then proceeds to loop (for 15+ hours) here:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L338

Originating from:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L415

I added tqdm and it's giving me a ~40 hour estimation (around 9 secs per iter) just to tokenize the dataset. I am using facebook/dpr-ctx_encoder-single-nq-base for the passage embedding model with a max seq len of 256. For reference, I'm working with a 10900F and a 3090.

The issue is here:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L2001

*edit: Comparatively, I tried tokenizing the same dataset outside of the API and it took 17 minutes. I'm working around it currently by calculating my own embeddings and adding them to the docs, but this might also be something to look into.

tholor · 2020-11-19T08:11:04Z

Hey @vinchg ,
The speed issue that you mention seems related to #602. We will investigate and optimize it (probably via batching and/or multiprocessing)!

tholor · 2020-12-03T08:09:55Z

I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to query.all() resolves quickly, but then proceeds to loop (for 15+ hours) here:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L338
Originating from:
https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L415

@vinchg Not sure if you saw this, but we fixed this one in deepset-ai/FARM#638. A loop with O(n²) was causing the trouble...

Memory efficiency should soon be improved by #620 and we'll afterwards also introduce a batch mode for update_embeddings()...

brandenchan · 2021-01-14T16:37:31Z

Hi @vinchg, the new PR #733 should significantly improve the memory efficiency of update embeddings. Could you give this a try and let us know if it helps?

vinchg · 2021-01-14T23:37:46Z

Thank you guys! I'm a bit busy with other things atm so I'm not sure when I'll get around to testing, but when I do, I'll post my results

vinchg added the question label Nov 17, 2020

tholor mentioned this issue Nov 18, 2020

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

Closed

tholor changed the title ~~Updating Embeddings for Faiss DPR on Large Dataset~~ Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) Nov 18, 2020

tholor added this to the #5 milestone Nov 18, 2020

tholor assigned lalitpagaria Nov 18, 2020

lalitpagaria mentioned this issue Nov 25, 2020

Using Columns names instead of ORM to get all documents #620

Merged

tholor mentioned this issue Nov 27, 2020

Remote PostgreSQL on VM connection timeout #605

Closed

tholor mentioned this issue Dec 8, 2020

DensePassageRetriever update embeddings out of memory issue #666

Closed

tholor assigned brandenchan and tanaysoni and unassigned lalitpagaria Jan 6, 2021

tholor added type:feature New feature or request and removed question labels Jan 6, 2021

tholor modified the milestones: #5, #7 Jan 6, 2021

tanaysoni mentioned this issue Jan 13, 2021

Add batch update of embeddings in document stores #733

Merged

tholor closed this as completed in #733 Jan 21, 2021

shihabrashid-ucr mentioned this issue Aug 4, 2021

Does DPR document store "update embeddings" utilize multiple GPUs? #1318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

vinchg commented Nov 17, 2020 •

edited

Loading

tholor commented Nov 18, 2020

lalitpagaria commented Nov 18, 2020

tholor commented Nov 18, 2020

vinchg commented Nov 18, 2020 •

edited

Loading

tholor commented Nov 19, 2020

tholor commented Dec 3, 2020

brandenchan commented Jan 14, 2021

vinchg commented Jan 14, 2021

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Comments

vinchg commented Nov 17, 2020 • edited Loading

tholor commented Nov 18, 2020

lalitpagaria commented Nov 18, 2020

tholor commented Nov 18, 2020

vinchg commented Nov 18, 2020 • edited Loading

tholor commented Nov 19, 2020

tholor commented Dec 3, 2020

brandenchan commented Jan 14, 2021

vinchg commented Jan 14, 2021

vinchg commented Nov 17, 2020 •

edited

Loading

vinchg commented Nov 18, 2020 •

edited

Loading