Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

jaeminSon · 2024-05-30T07:40:51Z

System Info

Ubuntu 20.04.2 LTS
transformers==4.40.1

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

add many tokens to tokenizer using tokenizer.add_tokens method
save it
load it (takes too long time)

reason: this list-comprehension takes long time

transformers/src/transformers/tokenization_utils_fast.py

Line 173 in 2b9e252

tokens_to_add = [

using vs-code debugger, when the list-comprehension omits comparison, such as the following, it is fast.

tokens_to_add = [
            token
            for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])
        ]

There is comment saying

# The following logic will be replace with a single add_tokens once a fix is pushed to tokenizers
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
# uses the information stored in `added_tokens_decoder`.
# this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens

Any plan to replace this code?

Expected behavior

Faster tokenizer initialization

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-05T12:44:46Z

Completely agree. #30574 will fix it! I just need to fix a breaking change made in tokenizers before I can release and should adress this

jaeminSon · 2024-06-10T06:16:02Z

Thanks for your time addressing this issue!

ArthurZucker · 2024-06-18T13:19:34Z

Also #31404 is fixing this as well! kudos to @ydshieh

jaeminSon closed this as completed Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

jaeminSon commented May 30, 2024 •

edited

Loading

ArthurZucker commented Jun 5, 2024

jaeminSon commented Jun 10, 2024

ArthurZucker commented Jun 18, 2024

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

Comments

jaeminSon commented May 30, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 5, 2024

jaeminSon commented Jun 10, 2024

ArthurZucker commented Jun 18, 2024

jaeminSon commented May 30, 2024 •

edited

Loading