Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in
convert_tokens_to_string
#15775🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in
convert_tokens_to_string
#15775Changes from 7 commits
c0c15bb
08d47a7
6c82c09
7bfe422
cb4c824
e6795b9
17e0921
c29bcdd
e11cf2a
fb1c273
91413e5
0743ae0
bad0f43
8cb264e
c14ebcb
f021c2d
cee809c
adc06ff
6b3cd77
e94d85e
3273b1a
a89dba6
cb53d92
205e133
c271a63
0a8c149
073b749
048bc65
21c3f6d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to my previous comment, but I'm not sure what the behavior should be here to add spaces. How did you choose this strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the exact code from PR #8435, I assumed this implementation would be relevant for all SPM tokenizers, but it may well be that this assumption is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running the test you suggested, I do believe that this extra space is an error, at least compared to the Rust tokenizers. Removing the space passes the test 👍
EDIT:
I forgot to run the whole test suite for the
T5Tokenizer
and now I see there's a problem. If I keep the space,test_fast_and_slow_same_result
passes buttest_sentencepiece_tokenize_and_convert_tokens_to_string
fails. If I remove the space, then it's vice-versa, and I'm not sure what the correct behavior should be.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't be able to look at this issue
T5Tokenizer
with the again until next Monday but I promise to come back to you early next weekThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I will be waiting patiently 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them 😅 Once the testing finishes (currently 50%), I'll know exactly why. For context, I accepted the changes in
main
over what was in the file of my branch. Didn't do anything else with that file.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, after feeling dumb for not noticing the
pytest -n auto
argument, which is why the tests took almost 3 hours to run, I found the problem: when accepting the change, it removed a constructor argument I passed. I fixed that, and now only one test is failing:This is weird, since using the "microsoft/deberta-v2-xlarge" tokenizer works on both slow and fast. I believe the problem is the test vocab. Is there something I can do to fix it, other than changing the test sentence (which seems quite drastic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Can you try to change this line (in
test_tokenization_deberta_v2.py
) :transformers/tests/deberta_v2/test_tokenization_deberta_v2.py
Line 42 in dced262
by
tokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>")
.I think it's because the
SAMPLE_VOCAB
defines the unknown token as"<unk>"
and not as"[UNK]"
which is the default value forDebertaV2Tokenizer
.EDIT: I've inspected the content inside and it's probably even better to replace it by
as it's how the special tokens are defined inside
SAMPLE_VOCAB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. This change breaks
test_get_vocab
, which makes sense, buttest_sentencepiece_tokenize_and_decode
at least now shows the "real" problem, which is what I fixed for the other slow tokenizers:It's the spacing thing. For some tokenizers, it was a simple fix, for others, it required overriding
_decode
, but I can fix this. I'll need to check how the fast tokenizer works to mimic it space-between-special-tokens-wise.EDIT:
The
test_get_vocab
test is fixed, I just didtokenizer = DebertaV2Tokenizer(SAMPLE_VOCAB, unk_token="<unk>")
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SaulLu,
I made DeBERTa v2 act like the fast tokenizer. It's very ad-hoc, since the behavior of the special tokens is a bit weird, but it passes all the tests 🙂
Would love to hear your feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the comment 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't necessarily know how to handle in this
convert_tokens_to_string
function is how to re-add spaces around each special token. I wonder if we should be aligned with what the fast version of the tokenizer would outputs. If it's something that seems relevant to you, I would add this test:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I didn't go for the "compare to Rust tokenizer" since not all tokenizers in question have this luxury. Thinking about it now that you mention it, this code will definitely add another layer of reliability to the existing test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after adding this test, a few previously passing tests failed. This is because of the space you mentioned in another comment. I removed the space, so now the code looks like this:
And now all the tests pass, including
T5Tokenizer
.I will note that I changed calls to
self.sp_model.decode_pieces
toself.sp_model.decode
, asdecode
, if needed, delegates todecode_pieces
, and this makes the code uniform, in my opinion.