Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

Closed
2 of 4 tasks
jaeminSon opened this issue May 30, 2024 · 3 comments
Closed
2 of 4 tasks

Slow tokenizer-loading when many tokens (~17k) are added by a user. #31134

jaeminSon opened this issue May 30, 2024 · 3 comments

Comments

@jaeminSon
Copy link

jaeminSon commented May 30, 2024

System Info

Ubuntu 20.04.2 LTS
transformers==4.40.1

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. add many tokens to tokenizer using tokenizer.add_tokens method
  2. save it
  3. load it (takes too long time)

reason: this list-comprehension takes long time

using vs-code debugger, when the list-comprehension omits comparison, such as the following, it is fast.

tokens_to_add = [
            token
            for index, token in sorted(added_tokens_decoder.items(), key=lambda x: x[0])
        ]

There is comment saying

# The following logic will be replace with a single add_tokens once a fix is pushed to tokenizers
# allows converting a slow -> fast, non-legacy: if the `tokenizer.json` does not have all the added tokens
# uses the information stored in `added_tokens_decoder`.
# this is costly for fast tokenizers as we re-compute the regex again. But not all tokens are added tokens

Any plan to replace this code?

Expected behavior

Faster tokenizer initialization

@ArthurZucker
Copy link
Collaborator

Completely agree. #30574 will fix it! I just need to fix a breaking change made in tokenizers before I can release and should adress this

@jaeminSon
Copy link
Author

Thanks for your time addressing this issue!

@ArthurZucker
Copy link
Collaborator

Also #31404 is fixing this as well! kudos to @ydshieh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants