Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix convert_tokens_to_string when decoder is None #34569

Merged

Conversation

dszeto
Copy link
Contributor

@dszeto dszeto commented Nov 1, 2024

What does this PR do?

In convert_tokens_to_string of src/transformers/tokenization_utils_fast.py, self.backend_tokenizer.decoder can be None when the tokenizer is trained by tokenizers.trainers.WordLevelTrainer. This fix adds a check and falls back to joining tokens with a space when no decoder is found. This follows the default behavior in the Rust implementation of Tokenizers.

Original question posted on https://discuss.huggingface.co/t/pretrainedtokenizerfast-convert-tokens-to-string-always-assumes-the-presence-of-decoder/114978

All existing tokenizer tests are passing. Style changes were made by make fixup.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker

@AlonKellner-Jounce
Copy link

I want to emphasize my support for this PR with a simply reproducible error (and a real use-case) that this PR fixes:
When running vLLM with huggingface models which have a PreTrainedTokenizerFast tokenizer, for example:

python3 -m vllm.entrypoints.openai.api_server --model jounce/dummy-phi3

Currently fails like so:

ERROR 11-11 09:00:29 engine.py:158]   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/detokenizer.py", line 122, in decode_sequence_inplace
ERROR 11-11 09:00:29 engine.py:158]     read_offset) = detokenize_incrementally(
ERROR 11-11 09:00:29 engine.py:158]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158]   File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/detokenizer.py", line 301, in detokenize_incrementally
ERROR 11-11 09:00:29 engine.py:158]     prefix_text = tokenizer.convert_tokens_to_string(
ERROR 11-11 09:00:29 engine.py:158]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158]   File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_fast.py", line 641, in convert_tokens_to_string
ERROR 11-11 09:00:29 engine.py:158]     return self.backend_tokenizer.decoder.decode(tokens)
ERROR 11-11 09:00:29 engine.py:158]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-11 09:00:29 engine.py:158] AttributeError: 'NoneType' object has no attribute 'decode'

Great PR, tnx :)

@dszeto dszeto force-pushed the convert-tokens-to-string-wordlevel branch from 1dd4100 to bf0e086 Compare November 11, 2024 17:38
@dszeto dszeto force-pushed the convert-tokens-to-string-wordlevel branch from bf0e086 to 07ec57f Compare November 12, 2024 00:50
@dszeto
Copy link
Contributor Author

dszeto commented Nov 13, 2024

Hey @ArthurZucker , would be awesome if you could take a quick look at this.

Please let me know if someone else should review.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few unrelated changes that made their way here, can you revert them so we can merge? 🤗

@ArthurZucker ArthurZucker merged commit 74db22f into huggingface:main Nov 25, 2024
24 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Fix convert_tokens_to_string when decoder is None

* revert unrelated changs

---------

Co-authored-by: Arthur Zucker <[email protected]>
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* Fix convert_tokens_to_string when decoder is None

* revert unrelated changs

---------

Co-authored-by: Arthur Zucker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants