-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Replace multiprocessing tokenization with batched fast tokenization #3089
Conversation
I remember assessing the full switch to fast tokenizers ~ 1-1.5 years ago. Back then the blocker was that not all popular model architectures (i.e their tokenizers) were supported by fast tokenizers. Could you please verify that it's now the case for the most common ones we see being used by our community (telemetry might help here)? roberta, electra, minilm, t5 are the first ones to pop in my mind but there way more... |
Adding on my todo list for tomorrow |
The most up-to-date list is available in transformers docs |
batch_size = self.max_multiprocessing_chunksize | ||
for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"): | ||
processing_batch = dicts[i : i + batch_size] | ||
indices = [i for i in range(len(processing_batch))] # indices is a remnant from multiprocessing era |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can also use
indices = list(range(len(processing_batch)))
which is a bit faster than the list comprehension for this use case.
@sjrl I updated the branch a bit more and completed some rudimentary performance measurements. I found the old tokenizer batch_encode_plus method twice as slow as the default call on the tokenizer. We can remove batch_encode_plus in a separate PR, but I added it here to ensure CI is green. As we are not only doing tokenization but also some 'basketization" of data, the old multiprocessing is still faster (especially because we used fast tokenizers already), but I don't think the slowdown is as dramatic as we feared. I'll do more measurements on larger bodies of text. So far it looks good. |
As @sjrl is currently away, would you please look at this one @julian-risch ? |
@vblagoje I'm back today, but having @julian-risch eyes on this as well would be great! It looks good to me. Do you have some final timing comparisons of the preprocessing before and after the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me and nice to see a small optimization regarding array copying (np.asarray
) as well. 👍 There is one # TODO remove indices
comment that needs further explanation though. Do you plan to address it before merging?
@@ -41,7 +37,7 @@ def __init__( | |||
eval_batch_size: Optional[int] = None, | |||
distributed: bool = False, | |||
automatic_loading: bool = True, | |||
max_multiprocessing_chunksize: int = 2000, | |||
max_multiprocessing_chunksize: int = 512, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a power of two makes sense 👍 Any intuition on why 512? 2048 would have been closer to the previous number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this parameter is corresponds to the batch size now and needs to work with a single process, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future release we should rename this parameter to batch_size
as there is no multiprocessing anymore. A separate PR could add a deprecation warning and a new batch_size
parameter. We could then support both parameter names for some time until we remove max_multiprocessing_chunksize
completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Here is how I chose - 512. This value is basically then translated into a length of a list of str segments passed to a tokenizer. HF documentation says to have any effect for multithreading in underlying tokenizers we should pass large arrays so this value somehow made sense as default batching size for tokenisation in HF datasets is 1000. We can increase it if we see some significant effect. In summary - we can play with this value a bit more for sure
f"Got ya {num_cpus_used} parallel workers to convert {num_dicts} dictionaries " | ||
f"to pytorch datasets (chunksize = {multiprocessing_chunk_size})..." | ||
) | ||
log_ascii_workers(num_cpus_used, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case we remove multiprocessing from inference as well in a separate PR, then we could remove the implementation of log_ascii_workers
completely from haystack/modeling/utils.py
for i in tqdm(range(0, num_dicts, batch_size), desc="Preprocessing dataset", unit=" Dicts"): | ||
processing_batch = dicts[i : i + batch_size] | ||
dataset, tensor_names, problematic_sample_ids = self.processor.dataset_from_dicts( | ||
dicts=processing_batch, indices=list(range(len(processing_batch))) # TODO remove indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a # TODO remove indices
here. Could you please explain what this is about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, multiprocessing relied on it "indices used during multiprocessing so that IDs assigned to our baskets are unique". It is still in the Processor signature, but we can remove it in the future!
texts, return_offsets_mapping=True, return_special_tokens_mask=True, add_special_tokens=False, verbose=False | ||
) | ||
|
||
# Extract relevant data | ||
tokenids_batch = tokenized_docs_batch["input_ids"] | ||
offsets_batch = [] | ||
for o in tokenized_docs_batch["offset_mapping"]: | ||
offsets_batch.append(np.array([x[0] for x in o])) | ||
offsets_batch.append(np.asarray([x[0] for x in o], dtype="int16")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice small speed optimization. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some np.array calls in the document stores (faiss, milvus, inmemory,...) that we could also change to asarray maybe, if the arrays don't need to be copied. If the speed up is significant here, we should optimize the document stores in a separate PR. It's about arrays of document ids and embeddings there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't detect significant speedup from the use of np.asarray unfortunately
You are right @sjrl - we should have this. I'll prepare them soon. |
Ok @sjrl and @julian-risch I roughly benchmarked tokenizing squad train set. The old multi-process approach is appx twice as fast as the current single process approach. I tried fine-tuning the number of processes (max_processes) for the old approach but the best result I could get is 11 sec for tokenizing the train dataset. With the new single-threaded approach the train dataset tokenization took 23 sec. Should we try tokenizing some bigger datasets? It would be cool if you could give it a quick spin as well @sjrl - just to make sure. |
Related Issues
Proposed Changes:
batch_encode_plus
,encode_plus
)How did you test it?
No test yet
## TodoThis PR is mainly in the draft stage; need to keep the CI churning