-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPLIT PR: eos bos tokens #31316
base: main
Are you sure you want to change the base?
SPLIT PR: eos bos tokens #31316
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
5d11492
to
95c185d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Seems like this adds support for kwargs in tokenize
and encode
for slow, but not for fast, and it relies on the spm add bos and add eos vs the transformers code for this!
For llama it should work tho, which is totally fine, it's also fine to only support this in fast tokenizers, I don't know what the best but anyways it's low priority!
@@ -827,3 +829,18 @@ def test_special_tokens_strip(self): | |||
self.assertEqual(input_ids, [284, 1, 156]) | |||
tokens = self.tokenizer.tokenize("No <s> ▁He") | |||
self.assertEqual(tokens, ["▁No", "<s>", "▁He"]) # spaces are eaten by rstrip / lstrip | |||
|
|||
@unittest.skip("@require_read_token does not work? getting gated repo error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work, rebasing should make it work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think this is the function we want to modify for the slow tokenizer.
This will collide with the
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @ArthurZucker! Maybe it's best to keep the support only for fast for now! wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that does not sound bad, we can also ship for slow later on
cf963df
to
c305105
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and simple! Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! My last inquire is wether or not we should somehow preserve the state of the old post_proccessor. Will go with no for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A little bit of documentation for this
@property | ||
def add_eos_token(self): | ||
return self._add_eos_token | ||
|
||
@property | ||
def add_bos_token(self): | ||
return self._add_bos_token | ||
|
||
@add_eos_token.setter | ||
def add_eos_token(self, value): | ||
self._add_eos_token = value | ||
self.update_post_processor() | ||
|
||
@add_bos_token.setter | ||
def add_bos_token(self, value): | ||
self._add_bos_token = value | ||
self.update_post_processor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Let's document add_eos and add_bos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! should it only be an autodoc? like in gemma [here](https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/gemma#transformers.GemmaTokenizer:~:text=for%20BPE%2Ddropout.-,add_bos_token%20(bool%2C%20optional%2C%20defaults%20to%20True)%20%E2%80%94%20Whether%20or%20not%20to%20add%20an%20bos_token%20at%20the%20start%20of%20sequences.,-add_eos_token%20(bool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, or in the init 's doc (of kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker sorry I'm not sure what is the init doc of kwargs?
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", add_bos_token=True, add_eos_token=True) | ||
self.assertEqual(tokenizer("hello")["input_ids"][0], tokenizer.bos_token_id) # bos token | ||
self.assertEqual(tokenizer("hello")["input_ids"][-1], tokenizer.eos_token_id) # eos token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's test setting the token, asserting the property is set, and accessible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just wondering if we are gonna break anything wrt the template processor. Let's go with this for now IMO.
Can you just test that we can also load these add_bos_tokens
from saved tokenizers in the tests
|
||
def update_post_processor(self): | ||
""" | ||
Updates the underlying post processor with the current `bos_token` and `eos_token`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates the underlying post processor with the current `bos_token` and `eos_token`. | |
Overwrites the underlying post processor with the current `bos_token` and `eos_token`. |
@property | ||
def add_eos_token(self): | ||
return self._add_eos_token | ||
|
||
@property | ||
def add_bos_token(self): | ||
return self._add_bos_token | ||
|
||
@add_eos_token.setter | ||
def add_eos_token(self, value): | ||
self._add_eos_token = value | ||
self.update_post_processor() | ||
|
||
@add_bos_token.setter | ||
def add_bos_token(self, value): | ||
self._add_bos_token = value | ||
self.update_post_processor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, or in the init 's doc (of kwargs)
@@ -77,6 +77,8 @@ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers | |||
- batch_decode | |||
- decode | |||
- encode | |||
- add_bos_token | |||
- add_eos_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker is this the correct place for the docs or is there a different init doc location you are thinking of?
9615bd4
to
bd019ee
Compare
a1916cb
to
89fde69
Compare
Fix for 2 issues:
add_bos_token
&add_eos_token
flags ignored forPreTrainedTokenizerFast
: issue discussed here and hereadd_special_tokens
does not updatebos_token
oreos_token
- ex.add_special_tokens({'bos_token': '<new_bos>'})
TASKS:
update_post_processor
function inPreTrainedTokenizerFast
based on llamatokenizer, allows reading of bos / eos token flag**SUPPORTS FAST ONLY
slow required updating kwargs to be passed into
sp_model
, so that bos/eos tokens can be added accordingly..Reviewer: @ArthurZucker
NOTE: hub token seems to not have access to llama 3, should pass after addressed