T5Tokenizer Fast and Slow give different results with AddedTokens #16334

patrickvonplaten · 2022-03-22T13:49:00Z

When adding a new token to T5TokenizerFast and/or T5Tokenizer, we get different results for the tokenizers which is unexpected.

E.g. running the following code:

from transformers import AutoTokenizer, AddedToken

tok = AutoTokenizer.from_pretrained("t5-small", use_fast=False)
tok_fast = AutoTokenizer.from_pretrained("t5-small", use_fast=True)

tok.add_tokens("$$$")
tok_fast.add_tokens(AddedToken("$$$", lstrip=False))

prompt = "Hello what is going on $$$ no ? We should"

print("Slow")
print(tok.decode(tok(prompt).input_ids))

print("Fast")
print(tok_fast.decode(tok_fast(prompt).input_ids))

yields different results for each tokenizer

Slow
Hello what is going on $$$ no? We should</s>
Fast
Hello what is going on$$$ no? We should</s>

Environment info

transformers version: 4.18.0.dev0
Platform: Linux-5.15.15-76051515-generic-x86_64-with-glibc2.34
Python version: 3.9.7
Huggingface_hub version: 0.4.0.dev0
PyTorch version (GPU?): 1.10.2+cu102 (True)
Tensorflow version (GPU?): 2.8.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.4.0 (cpu)
Jax version: 0.3.1
JaxLib version: 0.3.0

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2022-03-22T13:50:16Z

cc @Narsil @SaulLu

Narsil · 2022-03-23T12:47:56Z

Hi, The behavior can be explained by the fact that the encode, splits on whitespace and ignores them,
then the decoder uses Metaspace (which is for the spm behavior) which does not prefix things with spaces even on the added token. The spaces are supposed to already be contained within the tokens themselves.

We could have parity on this at least for sure !

But I am not sure who is right in that case, both decoded values look OK to me. The proposed AddedToken contains no information about the spaces so it's ok to no place one back by default (it would break things when added tokens are specifically intended for stuff not containing spaces).
In that particular instance, because we're coming from a sentence with a space, ofc it makes more sense to put one back to recover the original string. But decode[999, 998] with 999="$(" and 998=")$" It's unclear to me if a user wants "$( )$" or "$()$" when decoded. (Just trying to take an plausible example where the answer is unclear.)

github-actions · 2022-04-21T15:08:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

wise-east · 2022-07-29T22:41:30Z

should this be reopened if it's not resolved yet?

github-actions · 2022-09-01T15:03:04Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SaulLu mentioned this issue Apr 26, 2022

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string #15775

Merged

github-actions bot closed this as completed Apr 30, 2022

patrickvonplaten mentioned this issue May 2, 2022

Collection of Tokenizer issues #17051

Open

4 tasks

SaulLu reopened this Aug 8, 2022

github-actions bot closed this as completed Sep 9, 2022

ArthurZucker mentioned this issue Jun 1, 2023

[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers huggingface/tokenizers#1250

Closed

ArthurZucker mentioned this issue Aug 3, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5Tokenizer Fast and Slow give different results with AddedTokens #16334

T5Tokenizer Fast and Slow give different results with AddedTokens #16334

patrickvonplaten commented Mar 22, 2022

patrickvonplaten commented Mar 22, 2022

Narsil commented Mar 23, 2022 •

edited

Loading

github-actions bot commented Apr 21, 2022

wise-east commented Jul 29, 2022

github-actions bot commented Sep 1, 2022

T5Tokenizer Fast and Slow give different results with AddedTokens #16334

T5Tokenizer Fast and Slow give different results with AddedTokens #16334

Comments

patrickvonplaten commented Mar 22, 2022

Environment info

patrickvonplaten commented Mar 22, 2022

Narsil commented Mar 23, 2022 • edited Loading

github-actions bot commented Apr 21, 2022

wise-east commented Jul 29, 2022

github-actions bot commented Sep 1, 2022

Narsil commented Mar 23, 2022 •

edited

Loading