Updating Embeddings for Faiss DPR on Big Dataset very slow #602

nliu86 · 2020-11-18T05:56:26Z

It seems to me that updating embedding for Faiss DPR on big dataset is quite slow.
I'm fairly certain the problem is in this call of function _get_predictions in DPR:
dataset, tensor_names, baskets = self.processor.dataset_from_dicts(
dicts, indices=[i for i in range(len(dicts))], return_baskets=True
)

It tries to convert documents to datasets. When the number of documents is around 100,000, this call takes about half hour. But if the number of documents is around 400,000, this call takes almost forever.

It seems we are using a single worker to process the documents. We can improve by parallel processing and batch processing.

Another thing I notice is that we store all embedding in GPU when we do the prediction for passages. I'm not sure it will work if we have a big dataset:
all_embeddings = {"query": torch.tensor([]).to(self.device), "passages": torch.tensor([]).to(self.device)}

In the end, I want to thank you guys for such a wonderful package. In just 2 days, I built a question and answering demo with your great tool. You guys are awesome!

Thanks,

Nick

tholor · 2020-11-18T06:49:09Z

Hey Nick,

Thanks for reporting!

We will investigate it:

1. Speed of processor.dataset_from_dicts
We should first check if there's potential to optimize the processing function directly.
Do you use the fast tokenizers for DPR via use_fast_tokenizers=True in the DPR init?
Similar to other parts of Haystack (DPR Training, Reader inference ...), we could of course leverage multiprocessing here. We just wanted to keep it as simple as possible here and didn't see this bottleneck in preprocessing with the fast Tokenizers yet.
But if multiprocessing speeds it up significantly, we should add it here ...

2. Memory efficiency of all_embeddings
We will introduce a batch mode for update_embeddings. This should resolve your concern and we don't need to bother about all_embeddings in the lower level _get_predictions(). See #601.

nliu86 · 2020-11-19T05:41:10Z

Thanks @tholor for your quick response! I didn't use use_fast_tokenizers=True yesterday. Today I tried this option and the processor is similarly slow. So I think there is a benefit here to leverage multiprocessing. By my estimation, without multiprocessing, we would need 100-200 hours to tokenize the whole wikipedia.

brandenchan · 2020-11-24T15:24:53Z

Hi Nick, I have been looking into this issue and found a loop in one of the subfunctions of dataset_from_dicts that was causing run time to scale quadratically as number of documents increased. It seems to me like your intuition was correct! Have a look at the issue that I linked above if you want some details.

Can you test this proposed change by installing the remove_quadratic_loop branch of FARM?

UPDATE: The remove_quadratic_loop branch has been merged into master

brandenchan · 2020-11-24T16:55:01Z

Your point with the memory efficiency of all_embeddings is also a good one! I've started working on a fix in #618

brandenchan · 2020-12-03T13:03:57Z

@nliu86 The speed and memory improvements have been implemented now. Hope they helped your case! Let us know if you were able to notice any difference in performance

nliu86 · 2020-12-04T06:03:46Z

Hi, @brandenchan, thank you for making all the improvements! When I run "pip install --editable .", how do I make sure my haystack will use latest farm code so that I can take advantage of your speed improvement?

brandenchan · 2020-12-04T09:28:25Z

@nliu86 After running pip install --editable . in your environment to install Haystack, you'll need to install FARM from the git repo using these lines of code.

git clone https://github.com/deepset-ai/FARM.git
cd FARM
pip install -e .

Then run pip list to check that the version of FARM you are using comes from this repo you just cloned!

nliu86 · 2020-12-07T22:47:29Z

@brandenchan Thanks for your helpful instructions. I'm able to test your improvement now. Here are the numbers:
Before your improvements:
100K data: 27 minutes for dataset_from_dicts
500K data: forever for dataset_from_dicts

After your improvements:
100K data: 2 minutes for dataset_from_dicts
500K data: 10 minutes for dataset_from_dicts

It's clear your improvements work. Thank you for the great job!

brandenchan · 2020-12-08T08:57:51Z

Really glad to hear it!

nliu86 added the type:feature New feature or request label Nov 18, 2020

tholor assigned brandenchan Nov 18, 2020

tholor added this to the #5 milestone Nov 18, 2020

tholor mentioned this issue Nov 19, 2020

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Closed

brandenchan mentioned this issue Nov 24, 2020

Remove quadratic loop in _init_samples_in_baskets for efficiency deepset-ai/FARM#638

Merged

brandenchan mentioned this issue Nov 24, 2020

Move DPR embeddings from GPU to CPU straight away #618

Merged

brandenchan closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

nliu86 commented Nov 18, 2020

tholor commented Nov 18, 2020 •

edited

Loading

nliu86 commented Nov 19, 2020

brandenchan commented Nov 24, 2020 •

edited

Loading

brandenchan commented Nov 24, 2020

brandenchan commented Dec 3, 2020

nliu86 commented Dec 4, 2020

brandenchan commented Dec 4, 2020

nliu86 commented Dec 7, 2020

brandenchan commented Dec 8, 2020

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

Comments

nliu86 commented Nov 18, 2020

tholor commented Nov 18, 2020 • edited Loading

nliu86 commented Nov 19, 2020

brandenchan commented Nov 24, 2020 • edited Loading

brandenchan commented Nov 24, 2020

brandenchan commented Dec 3, 2020

nliu86 commented Dec 4, 2020

brandenchan commented Dec 4, 2020

nliu86 commented Dec 7, 2020

brandenchan commented Dec 8, 2020

tholor commented Nov 18, 2020 •

edited

Loading

brandenchan commented Nov 24, 2020 •

edited

Loading