Respect `add_prefix_space` option in `LlamaTokenizerFast` #29694

aoxolotl · 2024-03-17T00:15:22Z

What does this PR do?

Respect add_prefix_space option in Llama tokenizer (Fixes #29625)

The add_prefix_space option in Llama tokenizer was set but not passed to the super().__init__() method to PretrainedTokenizersFast. This resulted in SPIECE_UNDERLINE token being added even when add_prefix_space=False.

Minimal example

When add_prefix_space is False

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=False)
>>> tokenizer.tokenize('overheard')
['over', 'he', 'ard']

add_prefix_space is True

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=True)
>>> tokenizer.tokenize('overheard')
>>> ['\u2581over', 'he', 'ard']

Fixes #29625

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @scruel

ArthurZucker

Thanks! Could you add a test to make sure this works? 🤗

aoxolotl · 2024-03-23T23:21:16Z

Sure! Will update the pull request with the tests

aoxolotl · 2024-03-24T10:46:19Z

Hey @ArthurZucker, I have a question about the tests. There already seems to be an add_prefix_space test present in test_tokenization_llama.py. However, this strangely passes even without the modifications above. The difference seems to come from using AutoTokenizer (as in the mentioned issue) vs using the LlamaTokenizer* classes directly. Is there a difference between the init routes we take in the above two scenarios?

Edit:
The difference actually comes from using hf-internal-testing/llama-tokenizer-non-normalized vs meta-llama/Llama-2-7b-hf. Examples:

>>> hf_tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer-non-normalized", add_prefix_space=False, legacy=False)
>>> hf_tokenizer.tokenize('overheard')
['over', 'he', 'ard']

>>> tokenizer = LlamaTokenizerFast.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=False, legacy=False)
>>> tokenizer.tokenize('overheard')
['\u2581over', 'he', 'ard']

ArthurZucker · 2024-03-25T09:36:39Z

Actually I think we need to wait for #28881, to have a proper fix for Llama! Feel free to skip it. T5 should work!

scruel · 2024-03-25T10:03:11Z

Won't you see my review?

ArthurZucker · 2024-03-25T10:22:39Z

pending means it's not been submitted!

scruel · 2024-03-25T11:04:45Z

pending means it's not been submitted!

Oh, I see, it this HF repos limited review action by others?

ArthurZucker · 2024-03-25T11:14:16Z

No, pressing the submit review button should be enough

scruel · 2024-03-18T15:22:18Z

src/transformers/tokenization_utils_base.py

@@ -737,6 +737,7 @@ def as_tensor(value, dtype=None):

            def is_tensor(obj):
                return isinstance(obj, mx.array)
+


Mind remove this empty line?

Sure, updating in latest commit

scruel · 2024-03-25T11:15:58Z

No, pressing the submit review button should be enough

It works! Thanks!

aoxolotl · 2024-03-25T11:33:16Z

@scruel Removed in latest commit!
@ArthurZucker I will make a separate PR for T5 tokenizer which adds any necessary tests. Leaving this PR for LlamaTokenizer specific changes

HuggingFaceDocBuilderDev · 2024-04-18T09:16:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-04-23T12:49:37Z

#28881 should actually fix this!

Pass add_prefix_space to super().__init__()

29ffaeb

aoxolotl marked this pull request as draft March 17, 2024 00:26

aoxolotl force-pushed the llama_add_prefix_space branch from 23763dc to 29ffaeb Compare March 18, 2024 14:04

Set default value for add_prefix_space as False

1ba0b2b

aoxolotl changed the title ~~Respect add_prefix_space option in LlamaTokenizerFast and T5TokenizerFast~~ Respect add_prefix_space option in LlamaTokenizerFast Mar 18, 2024

Default add_prefix_space value should be True not False

d1beea5

aoxolotl marked this pull request as ready for review March 18, 2024 14:54

ArthurZucker reviewed Mar 21, 2024

View reviewed changes

scruel reviewed Mar 25, 2024

View reviewed changes

Remove unnecessary empty line

3a163ff

ArthurZucker mentioned this pull request Apr 22, 2024

[LlamaTokenizerFast] Refactor default llama #28881

Merged

ArthurZucker closed this in #28881 Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect `add_prefix_space` option in `LlamaTokenizerFast` #29694

Respect `add_prefix_space` option in `LlamaTokenizerFast` #29694

aoxolotl commented Mar 17, 2024 •

edited

Loading

ArthurZucker left a comment

aoxolotl commented Mar 23, 2024

aoxolotl commented Mar 24, 2024 •

edited

Loading

ArthurZucker commented Mar 25, 2024

scruel commented Mar 25, 2024

ArthurZucker commented Mar 25, 2024

scruel commented Mar 25, 2024

ArthurZucker commented Mar 25, 2024

scruel Mar 18, 2024

aoxolotl Mar 25, 2024

scruel commented Mar 25, 2024

aoxolotl commented Mar 25, 2024

HuggingFaceDocBuilderDev commented Apr 18, 2024

ArthurZucker commented Apr 23, 2024

		@@ -737,6 +737,7 @@ def as_tensor(value, dtype=None):

		def is_tensor(obj):
		return isinstance(obj, mx.array)

Respect add_prefix_space option in LlamaTokenizerFast #29694

Respect add_prefix_space option in LlamaTokenizerFast #29694

Conversation

aoxolotl commented Mar 17, 2024 • edited Loading

What does this PR do?

Minimal example

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

aoxolotl commented Mar 23, 2024

aoxolotl commented Mar 24, 2024 • edited Loading

ArthurZucker commented Mar 25, 2024

scruel commented Mar 25, 2024

ArthurZucker commented Mar 25, 2024

scruel commented Mar 25, 2024

ArthurZucker commented Mar 25, 2024

scruel Mar 18, 2024

Choose a reason for hiding this comment

aoxolotl Mar 25, 2024

Choose a reason for hiding this comment

scruel commented Mar 25, 2024

aoxolotl commented Mar 25, 2024

HuggingFaceDocBuilderDev commented Apr 18, 2024

ArthurZucker commented Apr 23, 2024

Respect `add_prefix_space` option in `LlamaTokenizerFast` #29694

Respect `add_prefix_space` option in `LlamaTokenizerFast` #29694

aoxolotl commented Mar 17, 2024 •

edited

Loading

aoxolotl commented Mar 24, 2024 •

edited

Loading