Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

Closed
nliu86 opened this issue Nov 18, 2020 · 9 comments
Closed

Updating Embeddings for Faiss DPR on Big Dataset very slow #602

nliu86 opened this issue Nov 18, 2020 · 9 comments
Assignees
Labels
type:feature New feature or request
Milestone

Comments

@nliu86
Copy link

nliu86 commented Nov 18, 2020

It seems to me that updating embedding for Faiss DPR on big dataset is quite slow.
I'm fairly certain the problem is in this call of function _get_predictions in DPR:
dataset, tensor_names, baskets = self.processor.dataset_from_dicts(
dicts, indices=[i for i in range(len(dicts))], return_baskets=True
)

It tries to convert documents to datasets. When the number of documents is around 100,000, this call takes about half hour. But if the number of documents is around 400,000, this call takes almost forever.

It seems we are using a single worker to process the documents. We can improve by parallel processing and batch processing.

Another thing I notice is that we store all embedding in GPU when we do the prediction for passages. I'm not sure it will work if we have a big dataset:
all_embeddings = {"query": torch.tensor([]).to(self.device), "passages": torch.tensor([]).to(self.device)}

In the end, I want to thank you guys for such a wonderful package. In just 2 days, I built a question and answering demo with your great tool. You guys are awesome!

Thanks,

Nick

@nliu86 nliu86 added the type:feature New feature or request label Nov 18, 2020
@tholor
Copy link
Member

tholor commented Nov 18, 2020

Hey Nick,

Thanks for reporting!

We will investigate it:

1. Speed of processor.dataset_from_dicts
We should first check if there's potential to optimize the processing function directly.
Do you use the fast tokenizers for DPR via use_fast_tokenizers=True in the DPR init?
Similar to other parts of Haystack (DPR Training, Reader inference ...), we could of course leverage multiprocessing here. We just wanted to keep it as simple as possible here and didn't see this bottleneck in preprocessing with the fast Tokenizers yet.
But if multiprocessing speeds it up significantly, we should add it here ...

2. Memory efficiency of all_embeddings
We will introduce a batch mode for update_embeddings. This should resolve your concern and we don't need to bother about all_embeddings in the lower level _get_predictions(). See #601.

@tholor tholor added this to the #5 milestone Nov 18, 2020
@nliu86
Copy link
Author

nliu86 commented Nov 19, 2020

Thanks @tholor for your quick response! I didn't use use_fast_tokenizers=True yesterday. Today I tried this option and the processor is similarly slow. So I think there is a benefit here to leverage multiprocessing. By my estimation, without multiprocessing, we would need 100-200 hours to tokenize the whole wikipedia.

@brandenchan
Copy link
Contributor

brandenchan commented Nov 24, 2020

Hi Nick, I have been looking into this issue and found a loop in one of the subfunctions of dataset_from_dicts that was causing run time to scale quadratically as number of documents increased. It seems to me like your intuition was correct! Have a look at the issue that I linked above if you want some details.

Can you test this proposed change by installing the remove_quadratic_loop branch of FARM?

UPDATE: The remove_quadratic_loop branch has been merged into master

@brandenchan
Copy link
Contributor

Your point with the memory efficiency of all_embeddings is also a good one! I've started working on a fix in #618

@brandenchan
Copy link
Contributor

@nliu86 The speed and memory improvements have been implemented now. Hope they helped your case! Let us know if you were able to notice any difference in performance

@nliu86
Copy link
Author

nliu86 commented Dec 4, 2020

Hi, @brandenchan, thank you for making all the improvements! When I run "pip install --editable .", how do I make sure my haystack will use latest farm code so that I can take advantage of your speed improvement?

@brandenchan
Copy link
Contributor

@nliu86 After running pip install --editable . in your environment to install Haystack, you'll need to install FARM from the git repo using these lines of code.

git clone https://github.com/deepset-ai/FARM.git
cd FARM
pip install -e .

Then run pip list to check that the version of FARM you are using comes from this repo you just cloned!

@nliu86
Copy link
Author

nliu86 commented Dec 7, 2020

@brandenchan Thanks for your helpful instructions. I'm able to test your improvement now. Here are the numbers:
Before your improvements:
100K data: 27 minutes for dataset_from_dicts
500K data: forever for dataset_from_dicts

After your improvements:
100K data: 2 minutes for dataset_from_dicts
500K data: 10 minutes for dataset_from_dicts

It's clear your improvements work. Thank you for the great job!

@brandenchan
Copy link
Contributor

Really glad to hear it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants