SPLIT PR: eos bos tokens #31316

itazap · 2024-06-07T14:34:53Z

Fix for 2 issues:

add_bos_token & add_eos_token flags ignored for PreTrainedTokenizerFast: issue discussed here and here
add_special_tokens does not update bos_token or eos_token - ex .add_special_tokens({'bos_token': '<new_bos>'})

TASKS:

added an update_post_processor function in PreTrainedTokenizerFast based on llamatokenizer, allows reading of bos / eos token flag

**SUPPORTS FAST ONLY
slow required updating kwargs to be passed into sp_model , so that bos/eos tokens can be added accordingly..

Reviewer: @ArthurZucker

NOTE: hub token seems to not have access to llama 3, should pass after addressed

HuggingFaceDocBuilderDev · 2024-06-07T14:56:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks for working on this! Seems like this adds support for kwargs in tokenize and encode for slow, but not for fast, and it relies on the spm add bos and add eos vs the transformers code for this!
For llama it should work tho, which is totally fine, it's also fine to only support this in fast tokenizers, I don't know what the best but anyways it's low priority!

ArthurZucker · 2024-07-10T08:53:53Z

tests/models/llama/test_tokenization_llama.py

@@ -827,3 +829,18 @@ def test_special_tokens_strip(self):
        self.assertEqual(input_ids, [284, 1, 156])
        tokens = self.tokenizer.tokenize("No <s> ▁He")
        self.assertEqual(tokens, ["▁No", "<s>", "▁He"])  # spaces are eaten by rstrip / lstrip
+
+    @unittest.skip("@require_read_token does not work? getting gated repo error")


this should work, rebasing should make it work!

tests/models/llama/test_tokenization_llama.py

src/transformers/tokenization_utils_fast.py

ArthurZucker · 2024-07-10T08:58:30Z

src/transformers/models/llama/tokenization_llama.py

don't think this is the function we want to modify for the slow tokenizer.
This will collide with the

transformers/src/transformers/models/llama/tokenization_llama_fast.py

Line 301 in 99c0e55

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):

Thanks for the review @ArthurZucker! Maybe it's best to keep the support only for fast for now! wdyt?

yeah that does not sound bad, we can also ship for slow later on

ArthurZucker

Nice and simple! Thanks

tests/models/llama/test_tokenization_llama.py

src/transformers/tokenization_utils_fast.py

tests/models/llama/test_tokenization_llama.py

ArthurZucker

LGTM! My last inquire is wether or not we should somehow preserve the state of the old post_proccessor. Will go with no for now.

src/transformers/tokenization_utils_fast.py

ArthurZucker

Thanks! A little bit of documentation for this

src/transformers/tokenization_utils_fast.py

ArthurZucker · 2024-08-08T13:15:16Z

src/transformers/tokenization_utils_fast.py

+    @property
+    def add_eos_token(self):
+        return self._add_eos_token
+
+    @property
+    def add_bos_token(self):
+        return self._add_bos_token
+
+    @add_eos_token.setter
+    def add_eos_token(self, value):
+        self._add_eos_token = value
+        self.update_post_processor()
+
+    @add_bos_token.setter
+    def add_bos_token(self, value):
+        self._add_bos_token = value
+        self.update_post_processor()


Cool! Let's document add_eos and add_bos

sounds good! should it only be an autodoc? like in gemma [here](https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/gemma#transformers.GemmaTokenizer:~:text=for%20BPE%2Ddropout.-,add_bos_token%20(bool%2C%20optional%2C%20defaults%20to%20True)%20%E2%80%94%20Whether%20or%20not%20to%20add%20an%20bos_token%20at%20the%20start%20of%20sequences.,-add_eos_token%20(bool)

Yeah, or in the init 's doc (of kwargs)

@ArthurZucker sorry I'm not sure what is the init doc of kwargs?

ArthurZucker · 2024-08-08T13:16:45Z

tests/models/llama/test_tokenization_llama.py

+        tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", add_bos_token=True, add_eos_token=True)
+        self.assertEqual(tokenizer("hello")["input_ids"][0], tokenizer.bos_token_id)  # bos token
+        self.assertEqual(tokenizer("hello")["input_ids"][-1], tokenizer.eos_token_id)  # eos token


let's test setting the token, asserting the property is set, and accessible!

ArthurZucker

LGTM! Just wondering if we are gonna break anything wrt the template processor. Let's go with this for now IMO.
Can you just test that we can also load these add_bos_tokens from saved tokenizers in the tests

ArthurZucker · 2024-08-22T13:19:04Z

src/transformers/tokenization_utils_fast.py

+
+    def update_post_processor(self):
+        """
+        Updates the underlying post processor with the current `bos_token` and `eos_token`.


Suggested change

Updates the underlying post processor with the current `bos_token` and `eos_token`.

Overwrites the underlying post processor with the current `bos_token` and `eos_token`.

ArthurZucker · 2024-08-22T13:20:02Z

src/transformers/tokenization_utils_fast.py

+    @property
+    def add_eos_token(self):
+        return self._add_eos_token
+
+    @property
+    def add_bos_token(self):
+        return self._add_bos_token
+
+    @add_eos_token.setter
+    def add_eos_token(self, value):
+        self._add_eos_token = value
+        self.update_post_processor()
+
+    @add_bos_token.setter
+    def add_bos_token(self, value):
+        self._add_bos_token = value
+        self.update_post_processor()


Yeah, or in the init 's doc (of kwargs)

itazap · 2024-08-22T14:22:35Z

docs/source/en/main_classes/tokenizer.md

@@ -77,6 +77,8 @@ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers
    - batch_decode
    - decode
    - encode
+    - add_bos_token
+    - add_eos_token


@ArthurZucker is this the correct place for the docs or is there a different init doc location you are thinking of?

…rocessor)

itazap force-pushed the bos_eos_token_fix branch 7 times, most recently from 5d11492 to 95c185d Compare June 21, 2024 09:11

itazap marked this pull request as ready for review June 21, 2024 09:25

itazap requested a review from ArthurZucker June 21, 2024 09:25

ArthurZucker reviewed Jul 10, 2024

View reviewed changes

itazap force-pushed the bos_eos_token_fix branch from cf963df to c305105 Compare July 29, 2024 13:29

itazap requested a review from ArthurZucker July 29, 2024 14:56

ArthurZucker reviewed Jul 30, 2024

View reviewed changes

itazap requested review from ArthurZucker and ydshieh July 31, 2024 09:46

ArthurZucker reviewed Aug 5, 2024

View reviewed changes

src/transformers/tokenization_utils_fast.py Show resolved Hide resolved

src/transformers/tokenization_utils_fast.py Show resolved Hide resolved

itazap requested a review from ArthurZucker August 5, 2024 12:20

ArthurZucker approved these changes Aug 8, 2024

View reviewed changes

itazap requested a review from ArthurZucker August 9, 2024 12:15

ArthurZucker approved these changes Aug 22, 2024

View reviewed changes

itazap commented Aug 22, 2024

View reviewed changes

itazap force-pushed the bos_eos_token_fix branch from 9615bd4 to bd019ee Compare August 22, 2024 14:24

ydshieh self-assigned this Aug 29, 2024

itazap added 6 commits September 1, 2024 23:11

add missing import

200c1cc

add condition for update

d77e5ea

skip if post_processor is not type Sequence (do not support TemplateP…

a76d2b7

…rocessor)

rebased missed

154979c

clean after rebase

cca19dd

CI make copies issue fix?

2dbb00c

itazap added 16 commits September 1, 2024 23:14

Adding bos eos test

c490db2

undo pegasus change

0a15060

undo pegasus change

46d1153

clean up spaces for pr

39d92e7

clean up for PR

ef0bd27

rebase

e8bfe90

cleaning after rebase

f80a9fd

add conidition for updating post processor

dbacd66

revert comment

c7a84ff

make test asserts consistent

92b52ce

update with warning

832ecb5

update test

0d8e00e

logger.warn

040e2fe

autodoc for bos/eos token

04608b0

update docstring as per feedback

2df45da

Trigger CI attempt

89fde69

itazap force-pushed the bos_eos_token_fix branch from a1916cb to 89fde69 Compare September 1, 2024 21:14

itazap and others added 6 commits September 2, 2024 05:31

missed rebase commit

6102d5d

use internal repo for tesT

40a05d0

use internal repo for test

b957658

use internal repo for test

6312dd3

Merge branch 'main' into bos_eos_token_fix

aad810d

ruff

0d3085d

ydshieh mentioned this pull request Sep 10, 2024

use diff internal model in tests #33387

Merged

itazap and others added 3 commits September 13, 2024 16:40

test reload saved

1fb5b16

test reload saved

a1e024e

Merge branch 'main' into bos_eos_token_fix

02e75eb

kddubey mentioned this pull request Nov 2, 2024

[LLaMA3] 'add_bos_token=True, add_eos_token=True' seems not taking effect #30947

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPLIT PR: eos bos tokens #31316

SPLIT PR: eos bos tokens #31316

itazap commented Jun 7, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 7, 2024

ArthurZucker left a comment

ArthurZucker Jul 10, 2024

ArthurZucker Jul 10, 2024

itazap Jul 11, 2024 •

edited

Loading

ArthurZucker Jul 22, 2024

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker Aug 8, 2024

itazap Aug 9, 2024

ArthurZucker Aug 22, 2024

itazap Sep 11, 2024

ArthurZucker Aug 8, 2024

ArthurZucker left a comment

ArthurZucker Aug 22, 2024

ArthurZucker Aug 22, 2024

itazap Aug 22, 2024

	Updates the underlying post processor with the current `bos_token` and `eos_token`.
	Overwrites the underlying post processor with the current `bos_token` and `eos_token`.

SPLIT PR: eos bos tokens #31316

Are you sure you want to change the base?

SPLIT PR: eos bos tokens #31316

Conversation

itazap commented Jun 7, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jun 7, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itazap Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itazap commented Jun 7, 2024 •

edited

Loading

itazap Jul 11, 2024 •

edited

Loading