Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string #15775

Merged
merged 29 commits into from
Nov 2, 2022
Merged

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string #15775

merged 29 commits into from
Nov 2, 2022

Conversation

beneyal
Copy link
Contributor

@beneyal beneyal commented Feb 22, 2022

Hello! 👋

This PR fixes issue #15003 and is also related to issue #11716.

First, I've expanded the test_sentencepiece_tokenize_and_convert_tokens_to_string test in test_tokenization_common.py which, in addition to checking a round-trip from string to tokens and back, performs a check whether the tokens corresponding to input_ids (using convert_ids_to_tokens) followed by convert_tokens_to_string and then re-tokenized, give the same number of tokens as input_ids. This is because not all SentencePIece tokenizers have a *Fast flavor, and so I couldn't compare behaviors similar to the T5Tokenizer test.

This test failed for the following tokenizers:

  • AlbertTokenizer
  • BarthezTokenizer
  • CamembertTokenizer
  • DebertaV2Tokenizer
  • FNetTokenizer
  • LayoutXLMTokenizer
  • M2M100Tokenizer
  • MarianTokenizer
  • MBart50Tokenizer
  • PegasusTokenizer
  • Speech2TextTokenizer

Most of these tokenizers had the same implementation for convert_tokens_to_string:

def convert_tokens_to_string(self, tokens):
    return self.sp_model.decode(tokens)

All tokenizers who had this implementation, or something that can be reduced to it, are now inheriting SentencePieceStringConversionMixin (defined in tokenization_utils.py), and the implementation uses the same code as PR #8435. The rest have a variation of this code, and perhaps a future refactor can DRY it a bit.

The only tokenizer that doesn't pass the test, and is therefore excluded from it, is LayoutXLMTokenizer, whose convert_tokens_to_string method doesn't use the SentencePiece model in its implementation, and so I figured the lack of special tokens is intentional, rather than a byproduct of SentencePiece like the rest.

I think this PR is relevant for @NielsRogge who initially opened the issue on AlbertTokenizer, and following the PR guidelines, I think this is also relevant to @LysandreJik and @patrickvonplaten.

This has been a very fun project, and hopefully you will allow me to contribute more in the future 🙂

Thanks,
Ben

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@patil-suraj patil-suraj requested a review from SaulLu February 23, 2022 10:48
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working to resolve this issue! It was indeed a behavior that was troublesome! 😄

I also really appreciate that you put a test in your first pull review request!

I think one of the difficulties of the problem you are attacking is how to readjust spaces around these special tokens. Thinking about it quickly, I think the result should be identical to the one returned by the rust version of the tokenizer.

tests/test_tokenization_common.py Outdated Show resolved Hide resolved
tests/test_tokenization_common.py Outdated Show resolved Hide resolved
tests/test_tokenization_common.py Outdated Show resolved Hide resolved
tests/test_tokenization_common.py Outdated Show resolved Hide resolved
reverse_text = tokenizer.convert_tokens_to_string(tokens_including_special)

self.assertEqual(len(tokenizer.tokenize(reverse_text)), len(input_ids))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't necessarily know how to handle in this convert_tokens_to_string function is how to re-add spaces around each special token. I wonder if we should be aligned with what the fast version of the tokenizer would outputs. If it's something that seems relevant to you, I would add this test:

Suggested change
if self.test_rust_tokenizer:
rust_tokenizer = self.get_rust_tokenizer()
special_tokens_string_rust = rust_tokenizer.convert_tokens_to_string(special_tokens)
self.assertEqual(special_tokens_string, special_tokens_string_rust)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I didn't go for the "compare to Rust tokenizer" since not all tokenizers in question have this luxury. Thinking about it now that you mention it, this code will definitely add another layer of reliability to the existing test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after adding this test, a few previously passing tests failed. This is because of the space you mentioned in another comment. I removed the space, so now the code looks like this:

def convert_tokens_to_string(self, tokens):
    ...
    out_string += self.sp_model.decode(current_sub_tokens) + token  # << a space used to be here
    ...

And now all the tests pass, including T5Tokenizer.
I will note that I changed calls to self.sp_model.decode_pieces to self.sp_model.decode, as decode, if needed, delegates to decode_pieces, and this makes the code uniform, in my opinion.

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved
for token in tokens:
# make sure that special tokens are not decoded using sentencepiece model
if token in self.all_special_tokens:
out_string += self.sp_model.decode(current_sub_tokens) + token + " "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my previous comment, but I'm not sure what the behavior should be here to add spaces. How did you choose this strategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the exact code from PR #8435, I assumed this implementation would be relevant for all SPM tokenizers, but it may well be that this assumption is incorrect.

Copy link
Contributor Author

@beneyal beneyal Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running the test you suggested, I do believe that this extra space is an error, at least compared to the Rust tokenizers. Removing the space passes the test 👍

EDIT:
I forgot to run the whole test suite for the T5Tokenizer and now I see there's a problem. If I keep the space, test_fast_and_slow_same_result passes but test_sentencepiece_tokenize_and_convert_tokens_to_string fails. If I remove the space, then it's vice-versa, and I'm not sure what the correct behavior should be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to look at this issue T5Tokenizer with the again until next Monday but I promise to come back to you early next week

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! I will be waiting patiently 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them 😅 Once the testing finishes (currently 50%), I'll know exactly why. For context, I accepted the changes in main over what was in the file of my branch. Didn't do anything else with that file.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, after feeling dumb for not noticing the pytest -n auto argument, which is why the tests took almost 3 hours to run, I found the problem: when accepting the change, it removed a constructor argument I passed. I fixed that, and now only one test is failing:

image

This is weird, since using the "microsoft/deberta-v2-xlarge" tokenizer works on both slow and fast. I believe the problem is the test vocab. Is there something I can do to fix it, other than changing the test sentence (which seems quite drastic)?

Copy link
Contributor

@SaulLu SaulLu Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Can you try to change this line (in test_tokenization_deberta_v2.py) :

tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB)

by tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>").

I think it's because the SAMPLE_VOCAB defines the unknown token as "<unk>" and not as "[UNK]" which is the default value for DebertaV2Tokenizer.

EDIT: I've inspected the content inside and it's probably even better to replace it by

tokenizer = DebertaV2Tokenizer(
    SAMPLE_VOCAB, 
        bos_token="[CLS]",
        eos_token="[SEP]",
        unk_token="<unk>",
        sep_token="[SEP]",
        pad_token="<pad>",
        cls_token="[CLS]",
        mask_token="[MASK]",
)

as it's how the special tokens are defined inside SAMPLE_VOCAB

Copy link
Contributor Author

@beneyal beneyal Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This change breaks test_get_vocab, which makes sense, but test_sentencepiece_tokenize_and_decode at least now shows the "real" problem, which is what I fixed for the other slow tokenizers:

 AssertionError: '[CLS]<unk>his is text to test the tokenizer.[SEP]' != '[CLS] <unk>his is text to test the tokenizer.[SEP]'

It's the spacing thing. For some tokenizers, it was a simple fix, for others, it required overriding _decode, but I can fix this. I'll need to check how the fast tokenizer works to mimic it space-between-special-tokens-wise.

EDIT:
The test_get_vocab test is fixed, I just did tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SaulLu,

I made DeBERTa v2 act like the fast tokenizer. It's very ad-hoc, since the behavior of the special tokens is a bit weird, but it passes all the tests 🙂

Would love to hear your feedback!

src/transformers/models/pegasus/tokenization_pegasus.py Outdated Show resolved Hide resolved
src/transformers/models/mbart50/tokenization_mbart50.py Outdated Show resolved Hide resolved
Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.
Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.
The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.
@beneyal beneyal requested a review from SaulLu February 25, 2022 16:56
@github-actions github-actions bot closed this Apr 26, 2022
@huggingface huggingface deleted a comment from github-actions bot Apr 26, 2022
@SaulLu SaulLu reopened this Apr 26, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@beneyal
Copy link
Contributor Author

beneyal commented Sep 16, 2022

Hi @SaulLu, how are you? 🙂

Anything new regarding this PR? 🙏

@beneyal
Copy link
Contributor Author

beneyal commented Oct 11, 2022

This PR is still in progress 🙂

@huggingface huggingface deleted a comment from github-actions bot Oct 11, 2022
@dvfabbri
Copy link

The Albert changes resolved the lack of [MASK] for me. Plans to get this merged?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this PR has been left stale for so long @beneyal. I'll take over the reviews from now on as Lucile has been super busy on other projects!

First of all, this PR contains some breaking changes (for the better). Could you add three 🚨 in the title so that when we draft the next release notes, we clearly see it and can mention it in our section about breaking changes that fix standing bugs?

Then I think the most standing comment was on the Deberta-v2 tokenizer.

@@ -369,11 +369,39 @@ def test_sentencepiece_tokenize_and_convert_tokens_to_string(self):
# check if converting back to original text works
reverse_text = tokenizer.convert_tokens_to_string(tokens)

# All tokenizers pass this test without the below commented out code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# All tokenizers pass this test without the below commented out code.
# All tokenizers pass this test without the below commented out code.

To clean up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the comment 🙂

Comment on lines 358 to 359
if token == self.special_tokens[2]: # if token == <unk>
seen_unk = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This special code should indeed be removed. I think it's okay to exclude Deberta-v2 from the new test, decode as it makes the most sense even if it breaks the compatibility with the fast tokenizer if we think there is a bug there. We don't guarantee 100% match between the two in all situations.

@beneyal beneyal changed the title Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string Nov 2, 2022
@beneyal
Copy link
Contributor Author

beneyal commented Nov 2, 2022

@sgugger Wonderful to have you on board! 🚀

I can't comment on the Deberta snippet, so I'll try here: we basically want the code for Deberta V2 be the same as the code for the other tokenizers like Albert, i.e., without the special seen_unk hack?

I also changed the title as requested 🙂

@sgugger
Copy link
Collaborator

sgugger commented Nov 2, 2022

Yes, that would be great! And if you could fix the merge conflicts/rebase on main that would also be awesome!

@beneyal
Copy link
Contributor Author

beneyal commented Nov 2, 2022

@sgugger I have a rather silly question: I see in GitHub there are two conflicts: tests/deberta_v2/test_tokenization_deberta_v2.py and tests/layoutxlm/test_tokenization_layoutxlm.py. When merging upstream/main to my branch, there are no conflicts. Regarding rebasing, I've never done it so I'm not familiar with that functionality and what do I rebase into what. Could you please help me with this? 😅

@sgugger
Copy link
Collaborator

sgugger commented Nov 2, 2022

If you have merged your branch with upstream then just a push should be enough to resolve the conflicts.

To rebase just do this on your branch:

git rebase origin/main

then you'll need to force-push

git push -u origin branch_name --force

@beneyal
Copy link
Contributor Author

beneyal commented Nov 2, 2022

@sgugger Done 🙂 Whisper's test failed because an HTTP timeout, but the rest seem fine. Deberta V2 is now excluded from the sentencepiece tests, but has the same logic as Albert.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, one last nit and this should be ready for merge!

@@ -30,14 +30,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):

tokenizer_class = DebertaV2Tokenizer
rust_tokenizer_class = DebertaV2TokenizerFast
test_sentencepiece = True
test_sentencepiece = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this removes more tests that were passing. Might be better to just override the failing test to just pass, here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, on it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the fix. Please feel free to nit some more, I'm happy to work on this PR again! 🥳

@beneyal beneyal requested review from sgugger and removed request for SaulLu November 2, 2022 18:03
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Let's just add some comments on why those tests are skipped and we'll be good to merge!

@beneyal
Copy link
Contributor Author

beneyal commented Nov 2, 2022

@sgugger Done, thanks for the suggestions!

@sgugger sgugger merged commit 9f9ddcc into huggingface:main Nov 2, 2022
@sgugger
Copy link
Collaborator

sgugger commented Nov 2, 2022

Thanks again and sorry for the long wait on this PR!

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 3, 2022

Hi @beneyal First, thank you for working on this. I have a question as we have a CI failure. With the following code snippet,

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
print(sequences)

we have different results:

Before this PR

['<pad> <extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

After this PR

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

We no longer have the space between <pad> and <extra_id_0>. Is this expected? 🙏 Thank you.

@beneyal
Copy link
Contributor Author

beneyal commented Nov 3, 2022

@ydshieh Glad to do it!

Yes, this is intended behavior, as it mimics the fast tokenizer:

from transformers import T5Tokenizer, T5TokenizerFast, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer_fast = T5TokenizerFast.from_pretrained("t5-small")

model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
inputs_ids_fast = tokenizer_fast("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

assert input_ids.tolist() == inputs_ids_fast.tolist()

sequence_ids = model.generate(input_ids)

sequences = tokenizer.batch_decode(sequence_ids)
sequences_fast = tokenizer_fast.batch_decode(sequence_ids)

print(sequences, sequences == sequences_fast)  # outputs True

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Nov 3, 2022
…kens in `convert_tokens_to_string` (huggingface#15775)

* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022
…kens in `convert_tokens_to_string` (huggingface#15775)

* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants