`add_prefix_space` won't be respected by Llama tokenizer #29625

scruel · 2024-03-13T08:54:02Z

System Info

transformers version: 4.38.2
Platform: Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.21.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

With sentencepiece==0.2.0 and protobuf==4.25.3 installed

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", local_files_only=True, add_prefix_space=False)
>>> tokenizer.tokenize("overheard")
['▁over', 'he', 'ard']

Also tried add_dummy_prefix_space=False, the output is still the same.

Expected behavior

The tokenize result should not add prefix space (SPIECE_UNDERLINE)

The text was updated successfully, but these errors were encountered:

aoxolotl · 2024-03-15T19:02:05Z

Hey, I took a peek under the hood and looks like setting add_prefix_true is only changing kwargs[slow]=True (in tokenization_llama_fast.py. The super().__init__() method should receive this parameter if set.
Passing this in seems to work in preliminary tests

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=False)
>>> tokenizer.tokenize('overheard')
['over', 'he', 'ard']

Mind if I take this up @ArthurZucker & @scruel?

Edit: For completeness, showing that behavior is unchanged when add_prefix_space=True

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=True)
>>> tokenizer.tokenize('overheard')
>>> ['\u2581over', 'he', 'ard']

scruel · 2024-03-16T01:41:29Z

You always can take by creating a PR.

aoxolotl · 2024-03-17T00:16:25Z

Thank you, made a pull request. This was happening in T5TokenizerFast as well.

ArthurZucker · 2024-03-21T06:58:09Z

Thanks I'll review asap!

ArthurZucker · 2024-05-10T08:19:40Z

closing as #28881 fixed it!

psinger · 2024-05-21T17:02:29Z

@ArthurZucker are you sure this is fixed? I am still experiencing this in 4.41.0:

I can also still not see it being used here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama_fast.py#L153

ArthurZucker · 2024-05-22T04:38:29Z

You need to se from_slow=True to trigger conversion

ArthurZucker · 2024-05-22T04:38:52Z

It is used in convert_slow 😉

psinger · 2024-05-22T07:02:00Z

This is very confusing and not transparent to the user at all.
If I just use the AutoTokenizer class with default settings I would expect this to work and not silently do nothing.
It should at least give a warning, or rather set the from_slow then automatically.

ArthurZucker · 2024-05-22T08:59:21Z

I agree with you, on main there is this:

        if add_prefix_space is not None:
            logger.warning_once(
                "You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers"
            )
            kwargs["from_slow"] = True

which should give you a warning and automatically convert it

ArthurZucker · 2024-05-22T09:02:07Z

But it does not seem to be taken into account. @itazap would be nice if you can investigate and open a PR to make sure it forces from flow:

In [1]: from transformers import AutoTokenizer
tokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

In [3]: tokenizer.encode("Hey")
Out[3]: [1, 18637]

In [4]: tokenizer.tokenize("Hey")
Out[4]: ['▁Hey']

In [5]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False, from_slow=True)

In [6]: tokenizer.tokenize("Hey")
Out[6]: ['H', 'ey']

In [7]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False)

In [8]: tokenizer.tokenize("Hey")
Out[8]: ['▁Hey']

psinger · 2024-05-22T12:48:13Z

^^ Thanks

Another thing I noted, is that if I specify from_slow in tokenizer_config.json then it is ignored. Is this expected behavior?

ArthurZucker · 2024-05-23T09:56:39Z

I think it should be taken into account!

psinger · 2024-05-23T11:00:24Z

Apparently not, need to manually set it.

ArthurZucker · 2024-05-23T13:05:07Z

When I manually add it to tokenizer_config.json on main it works

* add prefix space ignored in llama #29625 * adding test with add_prefix_space=False * ruff --------- Co-authored-by: Ita Zaporozhets <[email protected]>

psinger · 2024-06-05T09:54:54Z

I am still struggling to understand how this exactly works with a combination of all different settings for the tokenizer. I believe a tutorial / docs description would be very helpful there.

For example, this breaks:

tokenizer = AutoTokenizer.from_pretrained(
    "deepseek-ai/deepseek-coder-6.7b-base",
    add_prefix_space=False,
)

I assume because it forces from_slow=True but cannot find any SentencePiece model file. But why can't I use add_prefix_space with just a fast tokenizer?

@ArthurZucker

ArthurZucker · 2024-06-05T13:17:10Z

This will be supported very soon, @itazap is working on making all of this a lot simpler and clearer!
And yes, that is what's happening but shoul;d not, we should not need sentencepiece dependency to update prefix space. That's a mistake on my par sorry about it 😢

psinger · 2024-06-06T10:53:59Z

Looking forward to it :) @itazap
Actually, would be great if one could pass add_prefix_space to the tokenize function itself, instead of needing to pass it when creating the tokenizer. Currently it is really unflexible if one wants to tokenize separate parts and then concatenate them afterwards without such a functionality.

Please tag me on any prs, happy to give feedback on this in general.

ArthurZucker · 2024-06-07T06:28:38Z

I kinda agree with you and will see what I can do on the tokenizers side, but this might need a lot of changes (supporting a new argument) and can already be done but with the attribute that needs to be set at each call, not super optimal

scruel changed the title ~~add_prefix_space can be set for Llama tokenizer~~ add_prefix_space won't be respected by Llama tokenizer Mar 13, 2024

aoxolotl mentioned this issue Mar 17, 2024

Respect add_prefix_space option in LlamaTokenizerFast #29694

Closed

5 tasks

huggingface deleted a comment from github-actions bot Apr 15, 2024

huggingface deleted a comment from github-actions bot May 10, 2024

ArthurZucker closed this as completed May 10, 2024

itazap pushed a commit that referenced this issue May 22, 2024

add_prefix_space param added to llamatokenizerfast #29625

1ba8de4

itazap pushed a commit that referenced this issue May 22, 2024

add prefix space ignored in llama #29625

de7c438

itazap added a commit that referenced this issue May 24, 2024

add prefix space ignored in llama #29625 (#30964)

7f6e874

* add prefix space ignored in llama #29625 * adding test with add_prefix_space=False * ruff --------- Co-authored-by: Ita Zaporozhets <[email protected]>

psinger mentioned this issue Jun 5, 2024

Closes #743 h2oai/h2o-llmstudio#746

Merged

itazap mentioned this issue Jul 17, 2024

add prefix space ignored in llama #29625 #30964

Merged

1 task

Lyaaaaaaaaaaaaaaa mentioned this issue Aug 14, 2024

Can't load Llama's tokenizer with add_prefix_space=True parameter. #32682

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`add_prefix_space` won't be respected by Llama tokenizer #29625

`add_prefix_space` won't be respected by Llama tokenizer #29625

scruel commented Mar 13, 2024 •

edited

Loading

aoxolotl commented Mar 15, 2024 •

edited

Loading

scruel commented Mar 16, 2024

aoxolotl commented Mar 17, 2024

ArthurZucker commented Mar 21, 2024

ArthurZucker commented May 10, 2024

psinger commented May 21, 2024

ArthurZucker commented May 22, 2024

ArthurZucker commented May 22, 2024

psinger commented May 22, 2024

ArthurZucker commented May 22, 2024

ArthurZucker commented May 22, 2024

psinger commented May 22, 2024

ArthurZucker commented May 23, 2024

psinger commented May 23, 2024

ArthurZucker commented May 23, 2024

psinger commented Jun 5, 2024

ArthurZucker commented Jun 5, 2024

psinger commented Jun 6, 2024 •

edited

Loading

ArthurZucker commented Jun 7, 2024

add_prefix_space won't be respected by Llama tokenizer #29625

add_prefix_space won't be respected by Llama tokenizer #29625

Comments

scruel commented Mar 13, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

aoxolotl commented Mar 15, 2024 • edited Loading

scruel commented Mar 16, 2024

aoxolotl commented Mar 17, 2024

ArthurZucker commented Mar 21, 2024

ArthurZucker commented May 10, 2024

psinger commented May 21, 2024

ArthurZucker commented May 22, 2024

ArthurZucker commented May 22, 2024

psinger commented May 22, 2024

ArthurZucker commented May 22, 2024

ArthurZucker commented May 22, 2024

psinger commented May 22, 2024

ArthurZucker commented May 23, 2024

psinger commented May 23, 2024

ArthurZucker commented May 23, 2024

psinger commented Jun 5, 2024

ArthurZucker commented Jun 5, 2024

psinger commented Jun 6, 2024 • edited Loading

ArthurZucker commented Jun 7, 2024

`add_prefix_space` won't be respected by Llama tokenizer #29625

`add_prefix_space` won't be respected by Llama tokenizer #29625

scruel commented Mar 13, 2024 •

edited

Loading

aoxolotl commented Mar 15, 2024 •

edited

Loading

psinger commented Jun 6, 2024 •

edited

Loading