Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer encode very slow #398

Closed
traboukos opened this issue Sep 7, 2020 · 9 comments
Closed

Tokenizer encode very slow #398

traboukos opened this issue Sep 7, 2020 · 9 comments
Labels

Comments

@traboukos
Copy link

Hi All,

I have trained a tokenizer on my own dataset consisting of files with 50.000 lines of about 5.000 tokens each. The training process seems fast, all cores are utilised and it finishes in around 30 minutes for my dataset. However the encoding process of single sentences or even in batch appears really slow. The following code finishes in 30 seconds for a file of 50.000 lines.

t = Tokenizer.from_file('my-trained-tokenizer.json')
t.enable_padding(length=512)
t.enable_truncation(max_length=512)

# opening the file and reading the lines in memory takes 250ms
with open('my-file.txt') as f:
    lines = [f'[start]{l}' for l in f]

# encode batch on the inputs takes 30 seconds for 50.000 lines
inputs = t.encode_batch(lines)

# looping and encoding one by one takes several minutes
for l in lines:
    ids = t.encode(l).ids

For context opening a file reading the lines in memory looping in python over every line, splitting on space, looping on every token an replacing with an int from a dict lookup finishes in 6 seconds.

Are the speeds mentioned above normal ? Trying to tokenize on the fly using Tensorflow datasets is hopeless currently since my GPUs get 2% utilisation. Do I need to save the dataset in tokenized form ? This is also a costly process since it needs to be performed daily for a lot of data.

@sobayed
Copy link

sobayed commented Sep 14, 2020

Similar observation here. I trained a tokenizer (Tokenizer(vocabulary_size=64000, model=SentencePieceBPE, unk_token=<unk>, replacement=▁, add_prefix_space=True, dropout=None)) on a dataset of 200k records. I want to use the tokenizer in a scikit-learn pipeline so I'm only interested in the resulting tokens of a text:

def huggingface_tokenize(text, tokenizer):
    return tokenizer.encode(text).tokens

Compared to a standard sklearn CountVectorizer, the tokenization of an individual text is ~ 13x slower. See benchmark:

%timeit huggingface_tokenize(text, tokenizer)
>>> 25.6 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit sklearn_tokenize(text)
>>> 1.87 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Is this to be expected or is there a way to speed this up?

@Narsil
Copy link
Collaborator

Narsil commented Sep 14, 2020

Did you install tokenizers from source ? With pip install -e . ?

Currently installing that way will work, but rust is in debug mode and not in release mode. That makes a huge difference. If you want to fix it right now, you just need to change

rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3)]

into

rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, debug=False)],

We'll be updating that soon so that we can't shot ourselves in the foot in the future.

Otherwise could either of you release the files&code necessary to reproduce ? I'd be happy to find the bottlenecks and maybe fix them.
That would be a huge help, thanks.

@sobayed
Copy link

sobayed commented Sep 15, 2020

I installed tokenizers from PyPI, not from source.

Unfortunately, I cannot share the data but here's a slightly different reproducible example:

import pandas as pd

from tokenizers import SentencePieceBPETokenizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

data = fetch_20newsgroups()

vec = CountVectorizer()
vec.fit(data.data)
text = data.data[0]
sklearn_tokenize = vec.build_tokenizer()
%timeit sklearn_tokenize(text)
>>> 40.5 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

data_path = 'tokenize_benchmark.txt'
pd.Series(data.data).to_csv(data_path, header=False, index=False)
tokenizer = SentencePieceBPETokenizer()
tokenizer.train([data_path], vocab_size=16_000, min_frequency=2, limit_alphabet=1000)

def huggingface_tokenize(text, tokenizer):
    return tokenizer.encode(text).tokens

%timeit huggingface_tokenize(text, tokenizer)
>>> 228 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@Narsil Narsil self-assigned this Sep 15, 2020
@sebpuetz
Copy link
Contributor

sebpuetz commented Sep 15, 2020

i've been lurking on this through mail notifications, I think part of the difference might be due to accesing the tokens attribute on the encoded tokens. This copies all tokens twice:

#[getter]
fn get_tokens(&self) -> Vec<String> {
self.encoding.get_tokens().to_vec()
}

First to_vec() allocates a new String for each token in the Encoding and allocates a new Vec with at least n_tokens capacity. Then a PyList is allocated and each of the tokens from the new Vec is first converted to a PyString and then pushed to the PyList. It might even be that the PyList grows a few times and gets re-allocated, currently not sure about the implementation of Vec -> PyList conversion.

There might actually be an implementation for &[String] -> &PyList which could save at least the intermediate step in Rust. I don't think the conversion to PyList can be optimized a lot, tho.

edit: The implementation exists, that might cut some inefficiency.
https://github.com/PyO3/pyo3/blob/a0960f891801c0534856cb90fa90451828579470/src/types/list.rs#L165-L179

The other question is, what does skelearn_tokenize() do? Is it just white-space tokenizing? If so, then you're comparing very different methods and algorithms and it wouldn't be surprising at all that it's much faster.

@Narsil
Copy link
Collaborator

Narsil commented Sep 15, 2020

Hi @sobayed ,

Thanks for the example, that was helpful ! As @sebpuetz mentionned, you are actually comparing 2 very different algorithms.

sklearn examples seems to be doing roughly whitespace splitting with some normalization.
huggingface does a BPE encoding algorithm.

The two are vastly different, the first one, will yield quite a bit a "Unk" tokens, or you will have a huge vocabulary size (which means machine learning models will be huge).
BPE on the other hand was designed so that words that are not known, are split into parts that should make sense for the language.

Here is an example:

# 'supercool' is not in the train data
vec.vocabulary_['supercool']
# KeyError: 'supercool'

# On the other hand the BPE algorithm manages to split this word into parts
tokenizer.encode('supercool').tokens
#  ['▁super', 'c', 'ool']

I hope that explains the observed differences. That does not prevent us from finding more optimisations in the future to bring that down even further.

@traboukos I hope your example falls in a similar category (or debug mode problem I mentionned), but it's hard to say without access to your tokenizer or data file or ones that can reproduce the problem (for instance BPE algorithm is known to be notably slow on languages that don't include whitespaces).

@sobayed
Copy link

sobayed commented Sep 15, 2020

Hi @Narsil, many thanks for your explanations! I'm actually aware of the differences in the algorithms. My question was mainly whether the method I'm currently using is the fastest way to get the tokens using a trained tokenizer or whether there is a more efficient way.

If this is already the best that is possible at the moment, I'm fine with that.

@Narsil
Copy link
Collaborator

Narsil commented Sep 15, 2020

Well you can have some speedups if you use encode_batch instead of encode as you can be able to use parallelization.

tokenizer.encode_batch([text, text, ....])

But it depends on your use case as depending on your code you might get a dead lock if you are already using threading in Python: #311

To actually get the speedup you need to set TOKENIZERS_PARALLELISM=1 mycommand.py, you should have a warning otherwise.

@sobayed
Copy link

sobayed commented Sep 15, 2020

For my current use case, batch encoding is unfortunately not an option. Still good to know about it though for the future! Many thanks again for your help 👍

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 16, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants