[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

kiva12138 · 2024-05-22T02:22:34Z

System Info

Platform = Windows
PyTorch = 2.3.0
Transformers = 4.41.0

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoTokenizer

LLaMAPath = '/path/to/llama3-8b'

# The following two yields the same results, all of them contains BOS token and no EOS token
tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=True, add_eos_token=True)
# tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=False, add_eos_token=False)

tokenizer.add_special_tokens({"pad_token": "<|reserved_special_token_0|>"}) 
inputs = tokenizer(['hi, how are you today?'], padding=True, return_tensors='pt')
print(inputs)

All of the statements above produce [128000, 6151, 11, 1268, 527, 499, 3432, 30]

Expected behavior

I think when using tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=True, add_eos_token=True), we get [128000, 6151, 11, 1268, 527, 499, 3432, 30, 128001],

when using tokenizer = AutoTokenizer.from_pretrained(LLaMAPath, add_bos_token=False, add_eos_token=False), we get [6151, 11, 1268, 527, 499, 3432, 30],

The text was updated successfully, but these errors were encountered:

eyloncaplan · 2024-05-22T05:17:10Z

I'm having the same issue. Neither of these change the encodings:
tokenizer.add_bos_token = False
tokenizer.add_eos_token = True

amyeroberts · 2024-05-22T10:11:46Z

cc @ArthurZucker

ArthurZucker · 2024-05-23T09:59:03Z

Hey! This is related to #30607, the tokenizer for Llama3 is a PreTrainedTokenizerFast, not the LLamaTokenizer or a LlamaTokenizerFast. Though it might actually be good to support an easy way to add bos and eos. Currently what you have to do is update the TemplateProcessor which is fairly annoying (not beginner friendly).

That's something which should be handle on the tokenizers side

eyloncaplan · 2024-05-23T21:36:17Z

Hey! This is related to #30607, the tokenizer for Llama3 is a PreTrainedTokenizerFast, not the LLamaTokenizer or a LlamaTokenizerFast. Though it might actually be good to support an easy way to add bos and eos. Currently what you have to do is update the TemplateProcessor which is fairly annoying (not beginner friendly).

That's something which should be handle on the tokenizers side

@ArthurZucker I think it's called TemplateProcessing, not TemplateProcessor. For those wondering this is how I used it to get the tokenizer to put the eos token:

bos = "<|begin_of_text|>"
eos = "<|end_of_text|>"
tokenizer._tokenizer.post_processor = processors.Sequence(
    [
        processors.ByteLevel(trim_offsets=False),
        processors.TemplateProcessing(
            single=f"{bos}:0 $A:0 {eos}:0",
            pair=f"{bos}:0 $A:0 {bos}:1 $B:1 {eos}:1",
            special_tokens=[
                (bos, tokenizer.bos_token_id),
                (eos, tokenizer.eos_token_id),
            ],
        ),
    ]
)

Now I'm worried that the padding tokens won't get added properly, but that's a different issue...

ArthurZucker · 2024-05-24T09:33:47Z

Padding token is unrelated, it's added if you ask the tokenizer to pad the input!
And yes, thanks for providing the snippet @eyloncaplan 😉

kddubey · 2024-11-02T05:18:53Z

In case anyone else is blocked by this issue, I copied code from #31316 into a function which patches the tokenizer to support dynamically setting add_bos_token and add_eos_token.

Running this script—

from transformers import AutoTokenizer

model_id = "yujiepan/llama-3.1-tiny-random"

text = "a b"

print("Load plain tokenizer\n")
tokenizer = AutoTokenizer.from_pretrained(model_id)

print("   Default:", tokenizer(text)["input_ids"])

tokenizer.add_eos_token = True
print("   Add EOS:", tokenizer(text)["input_ids"])

print("\nLoad and patch tokenizer\n")
tokenizer2 = AutoTokenizer.from_pretrained(model_id)
force_support(tokenizer2)

tokenizer2.add_eos_token = True
print("  Add EOS:", tokenizer2(text)["input_ids"])

tokenizer2.add_eos_token = False
print("Don't add:", tokenizer2(text)["input_ids"])

—prints:

Load plain tokenizer

   Default: [128000, 64, 293]
   Add EOS: [128000, 64, 293]

Load and patch tokenizer

  Add EOS: [128000, 64, 293, 128009]
Don't add: [128000, 64, 293]

kiva12138 changed the title ~~[LLaMA] 'add_bos_token=True, add_eos_token=True' seems not taking effect~~ [LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect May 22, 2024

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label May 22, 2024

ArthurZucker added the Feature request Request for a new feature label May 23, 2024

itazap mentioned this issue Jun 4, 2024

adding user defined tokens #30824 #30929

Closed

5 tasks

itazap mentioned this issue Jun 13, 2024

SPLIT PR: eos bos tokens #31316

Open

1 task

cnlinxi mentioned this issue Jun 19, 2024

Support fine-tuning LLaMA3? RUCAIBox/LLMBox#264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

kiva12138 commented May 22, 2024 •

edited

Loading

eyloncaplan commented May 22, 2024

amyeroberts commented May 22, 2024

ArthurZucker commented May 23, 2024 •

edited

Loading

eyloncaplan commented May 23, 2024 •

edited

Loading

ArthurZucker commented May 24, 2024

kddubey commented Nov 2, 2024 •

edited

Loading

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

Comments

kiva12138 commented May 22, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

eyloncaplan commented May 22, 2024

amyeroberts commented May 22, 2024

ArthurZucker commented May 23, 2024 • edited Loading

eyloncaplan commented May 23, 2024 • edited Loading

ArthurZucker commented May 24, 2024

kddubey commented Nov 2, 2024 • edited Loading

kiva12138 commented May 22, 2024 •

edited

Loading

ArthurZucker commented May 23, 2024 •

edited

Loading

eyloncaplan commented May 23, 2024 •

edited

Loading

kddubey commented Nov 2, 2024 •

edited

Loading