-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Embeddings for Faiss DPR on Big Dataset very slow #602
Comments
Hey Nick, Thanks for reporting! We will investigate it: 1. Speed of 2. Memory efficiency of |
Thanks @tholor for your quick response! I didn't use use_fast_tokenizers=True yesterday. Today I tried this option and the processor is similarly slow. So I think there is a benefit here to leverage multiprocessing. By my estimation, without multiprocessing, we would need 100-200 hours to tokenize the whole wikipedia. |
Hi Nick, I have been looking into this issue and found a loop in one of the subfunctions of dataset_from_dicts that was causing run time to scale quadratically as number of documents increased. It seems to me like your intuition was correct! Have a look at the issue that I linked above if you want some details. Can you test this proposed change by installing the UPDATE: The |
Your point with the memory efficiency of all_embeddings is also a good one! I've started working on a fix in #618 |
@nliu86 The speed and memory improvements have been implemented now. Hope they helped your case! Let us know if you were able to notice any difference in performance |
Hi, @brandenchan, thank you for making all the improvements! When I run "pip install --editable .", how do I make sure my haystack will use latest farm code so that I can take advantage of your speed improvement? |
@nliu86 After running
Then run |
@brandenchan Thanks for your helpful instructions. I'm able to test your improvement now. Here are the numbers: After your improvements: It's clear your improvements work. Thank you for the great job! |
Really glad to hear it! |
It seems to me that updating embedding for Faiss DPR on big dataset is quite slow.
I'm fairly certain the problem is in this call of function _get_predictions in DPR:
dataset, tensor_names, baskets = self.processor.dataset_from_dicts(
dicts, indices=[i for i in range(len(dicts))], return_baskets=True
)
It tries to convert documents to datasets. When the number of documents is around 100,000, this call takes about half hour. But if the number of documents is around 400,000, this call takes almost forever.
It seems we are using a single worker to process the documents. We can improve by parallel processing and batch processing.
Another thing I notice is that we store all embedding in GPU when we do the prediction for passages. I'm not sure it will work if we have a big dataset:
all_embeddings = {"query": torch.tensor([]).to(self.device), "passages": torch.tensor([]).to(self.device)}
In the end, I want to thank you guys for such a wonderful package. In just 2 days, I built a question and answering demo with your great tool. You guys are awesome!
Thanks,
Nick
The text was updated successfully, but these errors were encountered: