-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer encode very slow #398
Comments
Similar observation here. I trained a tokenizer ( def huggingface_tokenize(text, tokenizer):
return tokenizer.encode(text).tokens Compared to a standard sklearn %timeit huggingface_tokenize(text, tokenizer)
>>> 25.6 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sklearn_tokenize(text)
>>> 1.87 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Is this to be expected or is there a way to speed this up? |
Did you install Currently installing that way will work, but rust is in debug mode and not in release mode. That makes a huge difference. If you want to fix it right now, you just need to change rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3)] into rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, debug=False)], We'll be updating that soon so that we can't shot ourselves in the foot in the future. Otherwise could either of you release the files&code necessary to reproduce ? I'd be happy to find the bottlenecks and maybe fix them. |
I installed Unfortunately, I cannot share the data but here's a slightly different reproducible example: import pandas as pd
from tokenizers import SentencePieceBPETokenizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
data = fetch_20newsgroups()
vec = CountVectorizer()
vec.fit(data.data)
text = data.data[0]
sklearn_tokenize = vec.build_tokenizer()
%timeit sklearn_tokenize(text)
>>> 40.5 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
data_path = 'tokenize_benchmark.txt'
pd.Series(data.data).to_csv(data_path, header=False, index=False)
tokenizer = SentencePieceBPETokenizer()
tokenizer.train([data_path], vocab_size=16_000, min_frequency=2, limit_alphabet=1000)
def huggingface_tokenize(text, tokenizer):
return tokenizer.encode(text).tokens
%timeit huggingface_tokenize(text, tokenizer)
>>> 228 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
i've been lurking on this through mail notifications, I think part of the difference might be due to accesing the tokenizers/bindings/python/src/encoding.rs Lines 89 to 92 in 62c3d40
First There might actually be an implementation for edit: The implementation exists, that might cut some inefficiency. The other question is, what does |
Hi @sobayed , Thanks for the example, that was helpful ! As @sebpuetz mentionned, you are actually comparing 2 very different algorithms.
The two are vastly different, the first one, will yield quite a bit a "Unk" tokens, or you will have a huge vocabulary size (which means machine learning models will be huge). Here is an example: # 'supercool' is not in the train data
vec.vocabulary_['supercool']
# KeyError: 'supercool'
# On the other hand the BPE algorithm manages to split this word into parts
tokenizer.encode('supercool').tokens
# ['▁super', 'c', 'ool'] I hope that explains the observed differences. That does not prevent us from finding more optimisations in the future to bring that down even further. @traboukos I hope your example falls in a similar category (or debug mode problem I mentionned), but it's hard to say without access to your tokenizer or data file or ones that can reproduce the problem (for instance BPE algorithm is known to be notably slow on languages that don't include whitespaces). |
Hi @Narsil, many thanks for your explanations! I'm actually aware of the differences in the algorithms. My question was mainly whether the method I'm currently using is the fastest way to get the tokens using a trained tokenizer or whether there is a more efficient way. If this is already the best that is possible at the moment, I'm fine with that. |
Well you can have some speedups if you use tokenizer.encode_batch([text, text, ....]) But it depends on your use case as depending on your code you might get a dead lock if you are already using threading in Python: #311 To actually get the speedup you need to set |
For my current use case, batch encoding is unfortunately not an option. Still good to know about it though for the future! Many thanks again for your help 👍 |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi All,
I have trained a tokenizer on my own dataset consisting of files with 50.000 lines of about 5.000 tokens each. The training process seems fast, all cores are utilised and it finishes in around 30 minutes for my dataset. However the encoding process of single sentences or even in batch appears really slow. The following code finishes in 30 seconds for a file of 50.000 lines.
For context opening a file reading the lines in memory looping in python over every line, splitting on space, looping on every token an replacing with an int from a dict lookup finishes in 6 seconds.
Are the speeds mentioned above normal ? Trying to tokenize on the fly using Tensorflow datasets is hopeless currently since my GPUs get 2% utilisation. Do I need to save the dataset in tokenized form ? This is also a costly process since it needs to be performed daily for a lot of data.
The text was updated successfully, but these errors were encountered: