🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` #15775

beneyal · 2022-02-22T23:10:24Z

Hello! 👋

This PR fixes issue #15003 and is also related to issue #11716.

First, I've expanded the test_sentencepiece_tokenize_and_convert_tokens_to_string test in test_tokenization_common.py which, in addition to checking a round-trip from string to tokens and back, performs a check whether the tokens corresponding to input_ids (using convert_ids_to_tokens) followed by convert_tokens_to_string and then re-tokenized, give the same number of tokens as input_ids. This is because not all SentencePIece tokenizers have a *Fast flavor, and so I couldn't compare behaviors similar to the T5Tokenizer test.

This test failed for the following tokenizers:

AlbertTokenizer
BarthezTokenizer
CamembertTokenizer
DebertaV2Tokenizer
FNetTokenizer
LayoutXLMTokenizer
M2M100Tokenizer
MarianTokenizer
MBart50Tokenizer
PegasusTokenizer
Speech2TextTokenizer

Most of these tokenizers had the same implementation for convert_tokens_to_string:

def convert_tokens_to_string(self, tokens):
    return self.sp_model.decode(tokens)

All tokenizers who had this implementation, or something that can be reduced to it, are now inheriting SentencePieceStringConversionMixin (defined in tokenization_utils.py), and the implementation uses the same code as PR #8435. The rest have a variation of this code, and perhaps a future refactor can DRY it a bit.

The only tokenizer that doesn't pass the test, and is therefore excluded from it, is LayoutXLMTokenizer, whose convert_tokens_to_string method doesn't use the SentencePiece model in its implementation, and so I figured the lack of special tokens is intentional, rather than a byproduct of SentencePiece like the rest.

I think this PR is relevant for @NielsRogge who initially opened the issue on AlbertTokenizer, and following the PR guidelines, I think this is also relevant to @LysandreJik and @patrickvonplaten.

This has been a very fun project, and hopefully you will allow me to contribute more in the future 🙂

Thanks,
Ben

Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer

HuggingFaceDocBuilder · 2022-02-22T23:10:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

SaulLu

Thank you so much for working to resolve this issue! It was indeed a behavior that was troublesome! 😄

I also really appreciate that you put a test in your first pull review request!

I think one of the difficulties of the problem you are attacking is how to readjust spaces around these special tokens. Thinking about it quickly, I think the result should be identical to the one returned by the rust version of the tokenizer.

tests/test_tokenization_common.py

SaulLu · 2022-02-24T15:37:33Z

tests/test_tokenization_common.py

+        reverse_text = tokenizer.convert_tokens_to_string(tokens_including_special)
+
+        self.assertEqual(len(tokenizer.tokenize(reverse_text)), len(input_ids))
+


What I don't necessarily know how to handle in this convert_tokens_to_string function is how to re-add spaces around each special token. I wonder if we should be aligned with what the fast version of the tokenizer would outputs. If it's something that seems relevant to you, I would add this test:

Suggested change

if self.test_rust_tokenizer:

rust_tokenizer = self.get_rust_tokenizer()

special_tokens_string_rust = rust_tokenizer.convert_tokens_to_string(special_tokens)

self.assertEqual(special_tokens_string, special_tokens_string_rust)

Initially, I didn't go for the "compare to Rust tokenizer" since not all tokenizers in question have this luxury. Thinking about it now that you mention it, this code will definitely add another layer of reliability to the existing test.

Okay, after adding this test, a few previously passing tests failed. This is because of the space you mentioned in another comment. I removed the space, so now the code looks like this:

def convert_tokens_to_string(self, tokens): ... out_string += self.sp_model.decode(current_sub_tokens) + token # << a space used to be here ...

And now all the tests pass, including T5Tokenizer.
I will note that I changed calls to self.sp_model.decode_pieces to self.sp_model.decode, as decode, if needed, delegates to decode_pieces, and this makes the code uniform, in my opinion.

src/transformers/tokenization_utils.py

SaulLu · 2022-02-24T15:42:50Z

src/transformers/tokenization_utils.py

+        for token in tokens:
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                out_string += self.sp_model.decode(current_sub_tokens) + token + " "


This is related to my previous comment, but I'm not sure what the behavior should be here to add spaces. How did you choose this strategy?

This is the exact code from PR #8435, I assumed this implementation would be relevant for all SPM tokenizers, but it may well be that this assumption is incorrect.

After running the test you suggested, I do believe that this extra space is an error, at least compared to the Rust tokenizers. Removing the space passes the test 👍

EDIT:
I forgot to run the whole test suite for the T5Tokenizer and now I see there's a problem. If I keep the space, test_fast_and_slow_same_result passes but test_sentencepiece_tokenize_and_convert_tokens_to_string fails. If I remove the space, then it's vice-versa, and I'm not sure what the correct behavior should be.

I won't be able to look at this issue T5Tokenizer with the again until next Monday but I promise to come back to you early next week

Thank you very much! I will be waiting patiently 😄

All of them 😅 Once the testing finishes (currently 50%), I'll know exactly why. For context, I accepted the changes in main over what was in the file of my branch. Didn't do anything else with that file.

So, after feeling dumb for not noticing the pytest -n auto argument, which is why the tests took almost 3 hours to run, I found the problem: when accepting the change, it removed a constructor argument I passed. I fixed that, and now only one test is failing:

This is weird, since using the "microsoft/deberta-v2-xlarge" tokenizer works on both slow and fast. I believe the problem is the test vocab. Is there something I can do to fix it, other than changing the test sentence (which seems quite drastic)?

I see! Can you try to change this line (in test_tokenization_deberta_v2.py) :

transformers/tests/deberta_v2/test_tokenization_deberta_v2.py

Line 42 in dced262

tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB)

by tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>").

I think it's because the SAMPLE_VOCAB defines the unknown token as "<unk>" and not as "[UNK]" which is the default value for DebertaV2Tokenizer.

EDIT: I've inspected the content inside and it's probably even better to replace it by

tokenizer = DebertaV2Tokenizer( SAMPLE_VOCAB, bos_token="[CLS]", eos_token="[SEP]", unk_token="<unk>", sep_token="[SEP]", pad_token="<pad>", cls_token="[CLS]", mask_token="[MASK]", )

as it's how the special tokens are defined inside SAMPLE_VOCAB

Interesting. This change breaks test_get_vocab, which makes sense, but test_sentencepiece_tokenize_and_decode at least now shows the "real" problem, which is what I fixed for the other slow tokenizers:

AssertionError: '[CLS]<unk>his is text to test the tokenizer.[SEP]' != '[CLS] <unk>his is text to test the tokenizer.[SEP]'

It's the spacing thing. For some tokenizers, it was a simple fix, for others, it required overriding _decode, but I can fix this. I'll need to check how the fast tokenizer works to mimic it space-between-special-tokens-wise.

EDIT:
The test_get_vocab test is fixed, I just did tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>").

Hi @SaulLu,

I made DeBERTa v2 act like the fast tokenizer. It's very ad-hoc, since the behavior of the special tokens is a bit weird, but it passes all the tests 🙂

Would love to hear your feedback!

src/transformers/models/pegasus/tokenization_pegasus.py

src/transformers/models/mbart50/tokenization_mbart50.py

Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it.

Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens.

The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer.

- Albert - Barthez - Camembert - MBart50 - T5

github-actions · 2022-09-16T15:02:46Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

beneyal · 2022-09-16T17:53:09Z

Hi @SaulLu, how are you? 🙂

Anything new regarding this PR? 🙏

beneyal · 2022-10-11T15:08:29Z

This PR is still in progress 🙂

dvfabbri · 2022-10-30T16:19:20Z

The Albert changes resolved the lack of [MASK] for me. Plans to get this merged?

sgugger

Sorry this PR has been left stale for so long @beneyal. I'll take over the reviews from now on as Lucile has been super busy on other projects!

First of all, this PR contains some breaking changes (for the better). Could you add three 🚨 in the title so that when we draft the next release notes, we clearly see it and can mention it in our section about breaking changes that fix standing bugs?

Then I think the most standing comment was on the Deberta-v2 tokenizer.

sgugger · 2022-11-02T13:26:36Z

tests/test_tokenization_common.py

@@ -369,11 +369,39 @@ def test_sentencepiece_tokenize_and_convert_tokens_to_string(self):
        # check if converting back to original text works
        reverse_text = tokenizer.convert_tokens_to_string(tokens)

+        # All tokenizers pass this test without the below commented out code.


Suggested change

# All tokenizers pass this test without the below commented out code.

# All tokenizers pass this test without the below commented out code.

To clean up?

I'll remove the comment 🙂

sgugger · 2022-11-02T13:29:14Z

src/transformers/models/deberta_v2/tokenization_deberta_v2.py

+                    if token == self.special_tokens[2]:  # if token == <unk>
+                        seen_unk = True


This special code should indeed be removed. I think it's okay to exclude Deberta-v2 from the new test, decode as it makes the most sense even if it breaks the compatibility with the fast tokenizer if we think there is a bug there. We don't guarantee 100% match between the two in all situations.

beneyal · 2022-11-02T14:32:00Z

@sgugger Wonderful to have you on board! 🚀

I can't comment on the Deberta snippet, so I'll try here: we basically want the code for Deberta V2 be the same as the code for the other tokenizers like Albert, i.e., without the special seen_unk hack?

I also changed the title as requested 🙂

sgugger · 2022-11-02T14:38:55Z

Yes, that would be great! And if you could fix the merge conflicts/rebase on main that would also be awesome!

beneyal · 2022-11-02T15:22:22Z

@sgugger I have a rather silly question: I see in GitHub there are two conflicts: tests/deberta_v2/test_tokenization_deberta_v2.py and tests/layoutxlm/test_tokenization_layoutxlm.py. When merging upstream/main to my branch, there are no conflicts. Regarding rebasing, I've never done it so I'm not familiar with that functionality and what do I rebase into what. Could you please help me with this? 😅

…to 15003-albert-tokenizer-special-tokens

sgugger · 2022-11-02T16:27:04Z

If you have merged your branch with upstream then just a push should be enough to resolve the conflicts.

To rebase just do this on your branch:

git rebase origin/main

then you'll need to force-push

git push -u origin branch_name --force

…string` is not tested.

beneyal · 2022-11-02T17:07:52Z

@sgugger Done 🙂 Whisper's test failed because an HTTP timeout, but the rest seem fine. Deberta V2 is now excluded from the sentencepiece tests, but has the same logic as Albert.

sgugger

Almost there, one last nit and this should be ready for merge!

sgugger · 2022-11-02T17:34:58Z

tests/models/deberta_v2/test_tokenization_deberta_v2.py

@@ -30,14 +30,14 @@ class DebertaV2TokenizationTest(TokenizerTesterMixin, unittest.TestCase):

    tokenizer_class = DebertaV2Tokenizer
    rust_tokenizer_class = DebertaV2TokenizerFast
-    test_sentencepiece = True
+    test_sentencepiece = False


I think this removes more tests that were passing. Might be better to just override the failing test to just pass, here.

Good idea, on it!

Pushed the fix. Please feel free to nit some more, I'm happy to work on this PR again! 🥳

sgugger

Thanks for iterating! Let's just add some comments on why those tests are skipped and we'll be good to merge!

tests/models/deberta_v2/test_tokenization_deberta_v2.py

beneyal · 2022-11-02T19:26:55Z

@sgugger Done, thanks for the suggestions!

sgugger · 2022-11-02T19:45:48Z

Thanks again and sorry for the long wait on this PR!

ydshieh · 2022-11-03T09:56:18Z

Hi @beneyal First, thank you for working on this. I have a question as we have a CI failure. With the following code snippet,

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
print(sequences)

we have different results:

Before this PR

['<pad> <extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

After this PR

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

We no longer have the space between <pad> and <extra_id_0>. Is this expected? 🙏 Thank you.

beneyal · 2022-11-03T11:45:15Z

@ydshieh Glad to do it!

Yes, this is intended behavior, as it mimics the fast tokenizer:

from transformers import T5Tokenizer, T5TokenizerFast, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
tokenizer_fast = T5TokenizerFast.from_pretrained("t5-small")

model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
inputs_ids_fast = tokenizer_fast("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

assert input_ids.tolist() == inputs_ids_fast.tolist()

sequence_ids = model.generate(input_ids)

sequences = tokenizer.batch_decode(sequence_ids)
sequences_fast = tokenizer_fast.batch_decode(sequence_ids)

print(sequences, sequences == sequences_fast)  # outputs True

…kens in `convert_tokens_to_string` (huggingface#15775) * Add test for SentencePiece not adding special tokens to strings * Add SentencePieceStringConversionMixin to fix issue 15003 * Fix conversion from tokens to string for most SentencePiece tokenizers Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer * Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab * Fix DebertaV2Tokenizer * Ignore LayoutXLMTokenizer in SentencePiece string conversion test * Run 'make style' and 'make quality' * Clean convert_tokens_to_string test Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it. * Remove commented out code * Improve robustness of convert_tokens_to_string test Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens. * Inline and remove SentencePieceStringConversionMixin The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer. * Run 'make style' and 'make quality' * Revert removal of space in convert_tokens_to_string * Remove redundant import * Revert test text to original * Uncomment the lowercasing of the reverse_text variable * Mimic Rust tokenizer behavior for tokenizers - Albert - Barthez - Camembert - MBart50 - T5 * Fix accidentally skipping test in wrong tokenizer * Add test for equivalent Rust and slow tokenizer behavior * Override _decode in BigBirdTokenizer to mimic Rust behavior * Override _decode in FNetTokenizer to mimic Rust behavior * Override _decode in XLNetTokenizer to mimic Rust behavior * Remove unused 're' import * Update DebertaV2Tokenizer to mimic Rust tokenizer * Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested. * Ignore problematic tests in Deberta V2 * Add comment on why the Deberta V2 tests are skipped

beneyal added 7 commits February 22, 2022 20:09

Add test for SentencePiece not adding special tokens to strings

c0c15bb

Add SentencePieceStringConversionMixin to fix issue 15003

08d47a7

Fix conversion from tokens to string for most SentencePiece tokenizers

6c82c09

Tokenizers fixed: - AlbertTokenizer - BarthezTokenizer - CamembertTokenizer - FNetTokenizer - M2M100Tokenizer - MBart50Tokenizer - PegasusTokenizer - Speech2TextTokenizer

Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

7bfe422

Fix DebertaV2Tokenizer

cb4c824

Ignore LayoutXLMTokenizer in SentencePiece string conversion test

e6795b9

Run 'make style' and 'make quality'

17e0921

patil-suraj requested a review from SaulLu February 23, 2022 10:48

SaulLu reviewed Feb 24, 2022

View reviewed changes

beneyal added 5 commits February 24, 2022 18:33

Clean convert_tokens_to_string test

c29bcdd

Instead of explicitly ignoring LayoutXLMTokenizer in the test, override the test in LayoutLMTokenizationTest and do nothing in it.

Remove commented out code

e11cf2a

Improve robustness of convert_tokens_to_string test

fb1c273

Instead of comparing lengths of re-tokenized text and input_ids, check that converting all special tokens to string yields a string with all special tokens.

Inline and remove SentencePieceStringConversionMixin

91413e5

The convert_tokens_to_string method is now implemented in each relevant SentencePiece tokenizer.

Run 'make style' and 'make quality'

0743ae0

beneyal requested a review from SaulLu February 25, 2022 16:56

beneyal and others added 10 commits February 25, 2022 19:59

Revert removal of space in convert_tokens_to_string

bad0f43

Remove redundant import

8cb264e

Revert test text to original

c14ebcb

Uncomment the lowercasing of the reverse_text variable

f021c2d

Mimic Rust tokenizer behavior for tokenizers

cee809c

- Albert - Barthez - Camembert - MBart50 - T5

Fix accidentally skipping test in wrong tokenizer

adc06ff

Add test for equivalent Rust and slow tokenizer behavior

6b3cd77

Override _decode in BigBirdTokenizer to mimic Rust behavior

e94d85e

Override _decode in FNetTokenizer to mimic Rust behavior

3273b1a

Override _decode in XLNetTokenizer to mimic Rust behavior

a89dba6

github-actions bot closed this Apr 26, 2022

huggingface deleted a comment from github-actions bot Apr 26, 2022

SaulLu reopened this Apr 26, 2022

Merge 'main' into the 15003 fix branch

cb53d92

huggingface deleted a comment from github-actions bot Oct 11, 2022

sgugger reviewed Nov 2, 2022

View reviewed changes

beneyal changed the title ~~Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string~~ 🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string Nov 2, 2022

Merge branch 'main' of https://github.com/huggingface/transformers in…

0a8c149

…to 15003-albert-tokenizer-special-tokens

Deberta tokenizer now behaves like Albert and its `convert_tokens_to_…

073b749

…string` is not tested.

sgugger reviewed Nov 2, 2022

View reviewed changes

Ignore problematic tests in Deberta V2

048bc65

beneyal requested review from sgugger and removed request for SaulLu November 2, 2022 18:03

sgugger approved these changes Nov 2, 2022

View reviewed changes

tests/models/deberta_v2/test_tokenization_deberta_v2.py Show resolved Hide resolved

tests/models/deberta_v2/test_tokenization_deberta_v2.py Show resolved Hide resolved

Add comment on why the Deberta V2 tests are skipped

21c3f6d

sgugger merged commit 9f9ddcc into huggingface:main Nov 2, 2022

ydshieh mentioned this pull request Nov 3, 2022

Fix some doctests after PR 15775 #20036

Merged

sgugger mentioned this pull request Nov 7, 2022

Discrepancy between PegasusTokenizer and PegasusTokenizerFast #20081

Closed

4 tasks

garaud mentioned this pull request Feb 22, 2023

AlbertTokenizer doesn't decode special tokens properly #15003

Closed

		reverse_text = tokenizer.convert_tokens_to_string(tokens_including_special)

		self.assertEqual(len(tokenizer.tokenize(reverse_text)), len(input_ids))

+        if self.test_rust_tokenizer:
+            rust_tokenizer = self.get_rust_tokenizer()
+            special_tokens_string_rust = rust_tokenizer.convert_tokens_to_string(special_tokens)
+            self.assertEqual(special_tokens_string, special_tokens_string_rust)

	# All tokenizers pass this test without the below commented out code.
	# All tokenizers pass this test without the below commented out code.

		if token == self.special_tokens[2]: # if token == <unk>
		seen_unk = True

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string #15775

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string #15775

Conversation

beneyal commented Feb 22, 2022

HuggingFaceDocBuilder commented Feb 22, 2022

SaulLu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beneyal Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaulLu Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

beneyal Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 16, 2022

beneyal commented Sep 16, 2022

beneyal commented Oct 11, 2022

dvfabbri commented Oct 30, 2022

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beneyal commented Nov 2, 2022

sgugger commented Nov 2, 2022

beneyal commented Nov 2, 2022

sgugger commented Nov 2, 2022

beneyal commented Nov 2, 2022

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

beneyal commented Nov 2, 2022

sgugger commented Nov 2, 2022

ydshieh commented Nov 3, 2022 • edited Loading

beneyal commented Nov 3, 2022

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` #15775

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in `convert_tokens_to_string` #15775

beneyal Feb 24, 2022 •

edited

Loading

SaulLu Apr 28, 2022 •

edited

Loading

beneyal Apr 28, 2022 •

edited

Loading

ydshieh commented Nov 3, 2022 •

edited

Loading