-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add_prefix_space
won't be respected by Llama tokenizer
#29625
Comments
add_prefix_space
can be set for Llama tokenizeradd_prefix_space
won't be respected by Llama tokenizer
Hey, I took a peek under the hood and looks like setting >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=False)
>>> tokenizer.tokenize('overheard')
['over', 'he', 'ard'] Mind if I take this up @ArthurZucker & @scruel? Edit: For completeness, showing that behavior is unchanged when
|
You always can take by creating a PR. |
Thank you, made a pull request. This was happening in |
Thanks I'll review asap! |
closing as #28881 fixed it! |
@ArthurZucker are you sure this is fixed? I am still experiencing this in 4.41.0: I can also still not see it being used here: |
You need to se |
It is used in |
This is very confusing and not transparent to the user at all. |
I agree with you, on main there is this: if add_prefix_space is not None:
logger.warning_once(
"You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers"
)
kwargs["from_slow"] = True which should give you a warning and automatically convert it |
But it does not seem to be taken into account. @itazap would be nice if you can investigate and open a PR to make sure it forces from flow: In [1]: from transformers import AutoTokenizer
tokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
In [3]: tokenizer.encode("Hey")
Out[3]: [1, 18637]
In [4]: tokenizer.tokenize("Hey")
Out[4]: ['▁Hey']
In [5]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False, from_slow=True)
In [6]: tokenizer.tokenize("Hey")
Out[6]: ['H', 'ey']
In [7]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False)
In [8]: tokenizer.tokenize("Hey")
Out[8]: ['▁Hey'] |
^^ Thanks Another thing I noted, is that if I specify |
I think it should be taken into account! |
Apparently not, need to manually set it. |
When I manually add it to tokenizer_config.json on main it works |
* add prefix space ignored in llama #29625 * adding test with add_prefix_space=False * ruff --------- Co-authored-by: Ita Zaporozhets <[email protected]>
I am still struggling to understand how this exactly works with a combination of all different settings for the tokenizer. I believe a tutorial / docs description would be very helpful there. For example, this breaks:
I assume because it forces |
This will be supported very soon, @itazap is working on making all of this a lot simpler and clearer! |
Looking forward to it :) @itazap Please tag me on any prs, happy to give feedback on this in general. |
I kinda agree with you and will see what I can do on the |
System Info
transformers
version: 4.38.2With
sentencepiece==0.2.0
andprotobuf==4.25.3
installedWho can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Also tried
add_dummy_prefix_space=False
, the output is still the same.Expected behavior
The tokenize result should not add prefix space (
SPIECE_UNDERLINE
)The text was updated successfully, but these errors were encountered: