Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPLIT PR: eos bos tokens #31316

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

SPLIT PR: eos bos tokens #31316

wants to merge 31 commits into from

Conversation

itazap
Copy link
Collaborator

@itazap itazap commented Jun 7, 2024

Fix for 2 issues:

  1. add_bos_token & add_eos_token flags ignored for PreTrainedTokenizerFast: issue discussed here and here
  2. add_special_tokens does not update bos_token or eos_token - ex .add_special_tokens({'bos_token': '<new_bos>'})

TASKS:

  • added an update_post_processor function in PreTrainedTokenizerFast based on llamatokenizer, allows reading of bos / eos token flag

**SUPPORTS FAST ONLY
slow required updating kwargs to be passed into sp_model , so that bos/eos tokens can be added accordingly..

Reviewer: @ArthurZucker

NOTE: hub token seems to not have access to llama 3, should pass after addressed

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap force-pushed the bos_eos_token_fix branch 7 times, most recently from 5d11492 to 95c185d Compare June 21, 2024 09:11
@itazap itazap marked this pull request as ready for review June 21, 2024 09:25
@itazap itazap requested a review from ArthurZucker June 21, 2024 09:25
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Seems like this adds support for kwargs in tokenize and encode for slow, but not for fast, and it relies on the spm add bos and add eos vs the transformers code for this!
For llama it should work tho, which is totally fine, it's also fine to only support this in fast tokenizers, I don't know what the best but anyways it's low priority!

@@ -827,3 +829,18 @@ def test_special_tokens_strip(self):
self.assertEqual(input_ids, [284, 1, 156])
tokens = self.tokenizer.tokenize("No <s> ▁He")
self.assertEqual(tokens, ["▁No", "<s>", "▁He"]) # spaces are eaten by rstrip / lstrip

@unittest.skip("@require_read_token does not work? getting gated repo error")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should work, rebasing should make it work!

tests/models/llama/test_tokenization_llama.py Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this is the function we want to modify for the slow tokenizer.
This will collide with the

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):

Copy link
Collaborator Author

@itazap itazap Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @ArthurZucker! Maybe it's best to keep the support only for fast for now! wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that does not sound bad, we can also ship for slow later on

@itazap itazap force-pushed the bos_eos_token_fix branch from cf963df to c305105 Compare July 29, 2024 13:29
@itazap itazap requested a review from ArthurZucker July 29, 2024 14:56
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and simple! Thanks

tests/models/llama/test_tokenization_llama.py Show resolved Hide resolved
tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Show resolved Hide resolved
tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
@itazap itazap requested review from ArthurZucker and ydshieh July 31, 2024 09:46
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! My last inquire is wether or not we should somehow preserve the state of the old post_proccessor. Will go with no for now.

src/transformers/tokenization_utils_fast.py Show resolved Hide resolved
src/transformers/tokenization_utils_fast.py Show resolved Hide resolved
@itazap itazap requested a review from ArthurZucker August 5, 2024 12:20
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A little bit of documentation for this

src/transformers/tokenization_utils_fast.py Outdated Show resolved Hide resolved
Comment on lines +919 to +934
@property
def add_eos_token(self):
return self._add_eos_token

@property
def add_bos_token(self):
return self._add_bos_token

@add_eos_token.setter
def add_eos_token(self, value):
self._add_eos_token = value
self.update_post_processor()

@add_bos_token.setter
def add_bos_token(self, value):
self._add_bos_token = value
self.update_post_processor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Let's document add_eos and add_bos

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, or in the init 's doc (of kwargs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker sorry I'm not sure what is the init doc of kwargs?

Comment on lines 838 to 840
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", add_bos_token=True, add_eos_token=True)
self.assertEqual(tokenizer("hello")["input_ids"][0], tokenizer.bos_token_id) # bos token
self.assertEqual(tokenizer("hello")["input_ids"][-1], tokenizer.eos_token_id) # eos token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's test setting the token, asserting the property is set, and accessible!

@itazap itazap requested a review from ArthurZucker August 9, 2024 12:15
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just wondering if we are gonna break anything wrt the template processor. Let's go with this for now IMO.
Can you just test that we can also load these add_bos_tokens from saved tokenizers in the tests


def update_post_processor(self):
"""
Updates the underlying post processor with the current `bos_token` and `eos_token`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Updates the underlying post processor with the current `bos_token` and `eos_token`.
Overwrites the underlying post processor with the current `bos_token` and `eos_token`.

Comment on lines +919 to +934
@property
def add_eos_token(self):
return self._add_eos_token

@property
def add_bos_token(self):
return self._add_bos_token

@add_eos_token.setter
def add_eos_token(self, value):
self._add_eos_token = value
self.update_post_processor()

@add_bos_token.setter
def add_bos_token(self, value):
self._add_bos_token = value
self.update_post_processor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, or in the init 's doc (of kwargs)

@@ -77,6 +77,8 @@ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers
- batch_decode
- decode
- encode
- add_bos_token
- add_eos_token
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker is this the correct place for the docs or is there a different init doc location you are thinking of?

@itazap itazap force-pushed the bos_eos_token_fix branch from 9615bd4 to bd019ee Compare August 22, 2024 14:24
@ydshieh ydshieh self-assigned this Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants