-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string
#15775
🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string
#15775
Conversation
Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working to resolve this issue! It was indeed a behavior that was troublesome! 😄
I also really appreciate that you put a test in your first pull review request!
I think one of the difficulties of the problem you are attacking is how to readjust spaces around these special tokens. Thinking about it quickly, I think the result should be identical to the one returned by the rust version of the tokenizer.
reverse_text = tokenizer.convert_tokens_to_string(tokens_including_special) | ||
|
||
self.assertEqual(len(tokenizer.tokenize(reverse_text)), len(input_ids)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't necessarily know how to handle in this convert_tokens_to_string
function is how to re-add spaces around each special token. I wonder if we should be aligned with what the fast version of the tokenizer would outputs. If it's something that seems relevant to you, I would add this test:
if self.test_rust_tokenizer: | |
rust_tokenizer = self.get_rust_tokenizer() | |
special_tokens_string_rust = rust_tokenizer.convert_tokens_to_string(special_tokens) | |
self.assertEqual(special_tokens_string, special_tokens_string_rust) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I didn't go for the "compare to Rust tokenizer" since not all tokenizers in question have this luxury. Thinking about it now that you mention it, this code will definitely add another layer of reliability to the existing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after adding this test, a few previously passing tests failed. This is because of the space you mentioned in another comment. I removed the space, so now the code looks like this:
def convert_tokens_to_string(self, tokens):
...
out_string += self.sp_model.decode(current_sub_tokens) + token # << a space used to be here
...
And now all the tests pass, including T5Tokenizer
.
I will note that I changed calls to self.sp_model.decode_pieces
to self.sp_model.decode
, as decode
, if needed, delegates to decode_pieces
, and this makes the code uniform, in my opinion.
for token in tokens: | ||
# make sure that special tokens are not decoded using sentencepiece model | ||
if token in self.all_special_tokens: | ||
out_string += self.sp_model.decode(current_sub_tokens) + token + " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to my previous comment, but I'm not sure what the behavior should be here to add spaces. How did you choose this strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the exact code from PR #8435, I assumed this implementation would be relevant for all SPM tokenizers, but it may well be that this assumption is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running the test you suggested, I do believe that this extra space is an error, at least compared to the Rust tokenizers. Removing the space passes the test 👍
EDIT:
I forgot to run the whole test suite for the T5Tokenizer
and now I see there's a problem. If I keep the space, test_fast_and_slow_same_result
passes but test_sentencepiece_tokenize_and_convert_tokens_to_string
fails. If I remove the space, then it's vice-versa, and I'm not sure what the correct behavior should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't be able to look at this issue T5Tokenizer
with the again until next Monday but I promise to come back to you early next week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I will be waiting patiently 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, after feeling dumb for not noticing the pytest -n auto
argument, which is why the tests took almost 3 hours to run, I found the problem: when accepting the change, it removed a constructor argument I passed. I fixed that, and now only one test is failing:
This is weird, since using the "microsoft/deberta-v2-xlarge" tokenizer works on both slow and fast. I believe the problem is the test vocab. Is there something I can do to fix it, other than changing the test sentence (which seems quite drastic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Can you try to change this line (in test_tokenization_deberta_v2.py
) :
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB) |
by tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>")
.
I think it's because the SAMPLE_VOCAB
defines the unknown token as "<unk>"
and not as "[UNK]"
which is the default value for DebertaV2Tokenizer
.
EDIT: I've inspected the content inside and it's probably even better to replace it by
tokenizer = DebertaV2Tokenizer(
SAMPLE_VOCAB,
bos_token="[CLS]",
eos_token="[SEP]",
unk_token="<unk>",
sep_token="[SEP]",
pad_token="<pad>",
cls_token="[CLS]",
mask_token="[MASK]",
)
as it's how the special tokens are defined inside SAMPLE_VOCAB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. This change breaks test_get_vocab
, which makes sense, but test_sentencepiece_tokenize_and_decode
at least now shows the "real" problem, which is what I fixed for the other slow tokenizers:
AssertionError: '[CLS]<unk>his is text to test the tokenizer.[SEP]' != '[CLS] <unk>his is text to test the tokenizer.[SEP]'
It's the spacing thing. For some tokenizers, it was a simple fix, for others, it required overriding _decode
, but I can fix this. I'll need to check how the fast tokenizer works to mimic it space-between-special-tokens-wise.
EDIT:
The test_get_vocab
test is fixed, I just did tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SaulLu,
I made DeBERTa v2 act like the fast tokenizer. It's very ad-hoc, since the behavior of the special tokens is a bit weird, but it passes all the tests 🙂
Would love to hear your feedback!
Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it.
Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens.
The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer.
- Albert - Barthez - Camembert - MBart50 - T5
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @SaulLu, how are you? 🙂 Anything new regarding this PR? 🙏 |
This PR is still in progress 🙂 |
The Albert changes resolved the lack of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this PR has been left stale for so long @beneyal. I'll take over the reviews from now on as Lucile has been super busy on other projects!
First of all, this PR contains some breaking changes (for the better). Could you add three 🚨 in the title so that when we draft the next release notes, we clearly see it and can mention it in our section about breaking changes that fix standing bugs?
Then I think the most standing comment was on the Deberta-v2 tokenizer.
tests/test_tokenization_common.py
Outdated
@@ -369,11 +369,39 @@ def test_sentencepiece_tokenize_and_convert_tokens_to_string(self): | |||
# check if converting back to original text works | |||
reverse_text = tokenizer.convert_tokens_to_string(tokens) | |||
|
|||
# All tokenizers pass this test without the below commented out code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# All tokenizers pass this test without the below commented out code. | |
# All tokenizers pass this test without the below commented out code. |
To clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the comment 🙂
if token == self.special_tokens[2]: # if token == <unk> | ||
seen_unk = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This special code should indeed be removed. I think it's okay to exclude Deberta-v2 from the new test, decode as it makes the most sense even if it breaks the compatibility with the fast tokenizer if we think there is a bug there. We don't guarantee 100% match between the two in all situations.
convert_tokens_to_string
convert_tokens_to_string
@sgugger Wonderful to have you on board! 🚀 I can't comment on the Deberta snippet, so I'll try here: we basically want the code for Deberta V2 be the same as the code for the other tokenizers like Albert, i.e., without the special I also changed the title as requested 🙂 |
Yes, that would be great! And if you could fix the merge conflicts/rebase on main that would also be awesome! |
@sgugger I have a rather silly question: I see in GitHub there are two conflicts: |
…to 15003-albert-tokenizer-special-tokens
If you have merged your branch with upstream then just a push should be enough to resolve the conflicts. To rebase just do this on your branch:
then you'll need to force-push
|
…string` is not tested.
@sgugger Done 🙂 Whisper's test failed because an HTTP timeout, but the rest seem fine. Deberta V2 is now excluded from the sentencepiece tests, but has the same logic as Albert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, one last nit and this should be ready for merge!
@@ -30,14 +30,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase): | |||
|
|||
tokenizer_class = DebertaV2Tokenizer | |||
rust_tokenizer_class = DebertaV2TokenizerFast | |||
test_sentencepiece = True | |||
test_sentencepiece = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this removes more tests that were passing. Might be better to just override the failing test to just pass, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, on it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the fix. Please feel free to nit some more, I'm happy to work on this PR again! 🥳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! Let's just add some comments on why those tests are skipped and we'll be good to merge!
@sgugger Done, thanks for the suggestions! |
Thanks again and sorry for the long wait on this PR! |
Hi @beneyal First, thank you for working on this. I have a question as we have a CI failure. With the following code snippet, from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
print(sequences) we have different results: Before this PR ['<pad> <extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>'] After this PR ['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>'] We no longer have the space between |
@ydshieh Glad to do it! Yes, this is intended behavior, as it mimics the fast tokenizer: from transformers import T5Tokenizer, T5TokenizerFast, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer_fast = T5TokenizerFast.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
inputs_ids_fast = tokenizer_fast("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
assert input_ids.tolist() == inputs_ids_fast.tolist()
sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
sequences_fast = tokenizer_fast.batch_decode(sequence_ids)
print(sequences, sequences == sequences_fast) # outputs True |
…kens in `convert_tokens_to_string` (huggingface#15775) * Add test for SentencePiece not adding special tokens to strings * Add SentencePieceStringConversionMixin to fix issue 15003 * Fix conversion from tokens to string for most SentencePiece tokenizers Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab * Fix DebertaV2Tokenizer * Ignore LayoutXLMTokenizer in SentencePiece string conversion test * Run 'make style' and 'make quality' * Clean convert_tokens_to_string test Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it. * Remove commented out code * Improve robustness of convert_tokens_to_string test Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens. * Inline and remove SentencePieceStringConversionMixin The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer. * Run 'make style' and 'make quality' * Revert removal of space in convert_tokens_to_string * Remove redundant import * Revert test text to original * Uncomment the lowercasing of the reverse_text variable * Mimic Rust tokenizer behavior for tokenizers - Albert - Barthez - Camembert - MBart50 - T5 * Fix accidentally skipping test in wrong tokenizer * Add test for equivalent Rust and slow tokenizer behavior * Override _decode in BigBirdTokenizer to mimic Rust behavior * Override _decode in FNetTokenizer to mimic Rust behavior * Override _decode in XLNetTokenizer to mimic Rust behavior * Remove unused 're' import * Update DebertaV2Tokenizer to mimic Rust tokenizer * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested. * Ignore problematic tests in Deberta V2 * Add comment on why the Deberta V2 tests are skipped
…kens in `convert_tokens_to_string` (huggingface#15775) * Add test for SentencePiece not adding special tokens to strings * Add SentencePieceStringConversionMixin to fix issue 15003 * Fix conversion from tokens to string for most SentencePiece tokenizers Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab * Fix DebertaV2Tokenizer * Ignore LayoutXLMTokenizer in SentencePiece string conversion test * Run 'make style' and 'make quality' * Clean convert_tokens_to_string test Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it. * Remove commented out code * Improve robustness of convert_tokens_to_string test Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens. * Inline and remove SentencePieceStringConversionMixin The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer. * Run 'make style' and 'make quality' * Revert removal of space in convert_tokens_to_string * Remove redundant import * Revert test text to original * Uncomment the lowercasing of the reverse_text variable * Mimic Rust tokenizer behavior for tokenizers - Albert - Barthez - Camembert - MBart50 - T5 * Fix accidentally skipping test in wrong tokenizer * Add test for equivalent Rust and slow tokenizer behavior * Override _decode in BigBirdTokenizer to mimic Rust behavior * Override _decode in FNetTokenizer to mimic Rust behavior * Override _decode in XLNetTokenizer to mimic Rust behavior * Remove unused 're' import * Update DebertaV2Tokenizer to mimic Rust tokenizer * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested. * Ignore problematic tests in Deberta V2 * Add comment on why the Deberta V2 tests are skipped
Hello! 👋
This PR fixes issue #15003 and is also related to issue #11716.
First, I've expanded the
test_sentencepiece_tokenize_and_convert_tokens_to_string
test intest_tokenization_common.py
which, in addition to checking a round-trip from string to tokens and back, performs a check whether the tokens corresponding toinput_ids
(usingconvert_ids_to_tokens
) followed byconvert_tokens_to_string
and then re-tokenized, give the same number of tokens asinput_ids
. This is because not all SentencePIece tokenizers have a*Fast
flavor, and so I couldn't compare behaviors similar to theT5Tokenizer
test.This test failed for the following tokenizers:
AlbertTokenizer
BarthezTokenizer
CamembertTokenizer
DebertaV2Tokenizer
FNetTokenizer
LayoutXLMTokenizer
M2M100Tokenizer
MarianTokenizer
MBart50Tokenizer
PegasusTokenizer
Speech2TextTokenizer
Most of these tokenizers had the same implementation for
convert_tokens_to_string
:All tokenizers who had this implementation, or something that can be reduced to it, are now inheriting
SentencePieceStringConversionMixin
(defined intokenization_utils.py
), and the implementation uses the same code as PR #8435. The rest have a variation of this code, and perhaps a future refactor can DRY it a bit.The only tokenizer that doesn't pass the test, and is therefore excluded from it, is
LayoutXLMTokenizer
, whoseconvert_tokens_to_string
method doesn't use the SentencePiece model in its implementation, and so I figured the lack of special tokens is intentional, rather than a byproduct of SentencePiece like the rest.I think this PR is relevant for @NielsRogge who initially opened the issue on
AlbertTokenizer
, and following the PR guidelines, I think this is also relevant to @LysandreJik and @patrickvonplaten.This has been a very fun project, and hopefully you will allow me to contribute more in the future 🙂
Thanks,
Ben