-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect add_prefix_space
option in LlamaTokenizerFast
#29694
Conversation
23763dc
to
29ffaeb
Compare
add_prefix_space
option in LlamaTokenizerFast
and T5TokenizerFast
add_prefix_space
option in LlamaTokenizerFast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you add a test to make sure this works? 🤗
Sure! Will update the pull request with the tests |
Hey @ArthurZucker, I have a question about the tests. There already seems to be an Edit: >>> hf_tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer-non-normalized", add_prefix_space=False, legacy=False)
>>> hf_tokenizer.tokenize('overheard')
['over', 'he', 'ard']
>>> tokenizer = LlamaTokenizerFast.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=False, legacy=False)
>>> tokenizer.tokenize('overheard')
['\u2581over', 'he', 'ard'] |
Actually I think we need to wait for #28881, to have a proper fix for Llama! Feel free to skip it. T5 should work! |
pending means it's not been submitted! |
Oh, I see, it this HF repos limited review action by others? |
No, pressing the |
@@ -737,6 +737,7 @@ def as_tensor(value, dtype=None): | |||
|
|||
def is_tensor(obj): | |||
return isinstance(obj, mx.array) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind remove this empty line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updating in latest commit
It works! Thanks! |
@scruel Removed in latest commit! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
#28881 should actually fix this! |
What does this PR do?
Respect
add_prefix_space
option in Llama tokenizer (Fixes #29625)The
add_prefix_space
option in Llama tokenizer was set but not passed to thesuper().__init__()
method toPretrainedTokenizersFast
. This resulted in SPIECE_UNDERLINE token being added even whenadd_prefix_space=False
.Minimal example
When
add_prefix_space
isFalse
add_prefix_space
isTrue
Fixes #29625
Before submitting
Pull Request section?
to it if that's the case
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @scruel