Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Closed
mawilson1234 opened this issue Dec 14, 2021 · 13 comments · Fixed by #23909
Closed

Comments

@mawilson1234
Copy link

mawilson1234 commented Dec 14, 2021

Environment info

  • transformers version: 4.13.0
  • Platform: Windows-10-10.0.19043-SP0
  • Python version: 3.9.7
  • PyTorch version (GPU?): 1.10.0+cu113 (True)
  • Tensorflow version (GPU?): 2.7.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik @SaulLu

Information

Models I am using: DistilBERT, BERT, RoBERTa

The problem arises when using:

  • my own modified scripts: (see below)

The tasks I am working on is:

  • my own task or dataset: (see below)

To reproduce

When adding a new token to various models (so far found with DistilBERT, BERT, and RoBERTa), adding a new token using the add_tokens function changes how adjacent parts of the string are tokenized in subtle ways (for DistilBERT and BERT, this might depend on do_basic_tokenize being set to False when creating the tokenizer, at least in the examples I've found). (This might be related to the issue reported in #11531, but that one specifically mentions T5.) See the code below for details.

This doesn't seem like intended behavior based on what I can tell from looking at the documentation, but it's possible I'm misunderstanding something about the right way to add new tokens to produce the behavior I'd like. (Currently, to get the expected behavior, I've had to manually modify the vocab (+ merges file for RoBERTa), using additional scripting, and load the tokenizer from the modified files. If it'd be of use, I could post the code for that workaround here, but I've left it out for now since it's a bit long and may not be relevant.)

Steps to reproduce the behavior:

(Distil)BERT:

from transformers import DistilBertTokenizer, BertTokenizer

new_word = 'mynewword'

# BERT
bt = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', '##ne', '##w', '##word']

bt.tokenize('testing.')
# ['testing', '##.'] (note that the period is tokenized as '##.')

bt.add_tokens(new_word)
bt.tokenize('mynewword') # verify the new token now exists
# ['mynewword']

bt.tokenize('mynewword.')
# ['mynewword', '.'] (note that the period is tokenized as '.' rather than the expected '##.')

# DistilBERT
dbt = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_basic_tokenize=False)
dbt.tokenize('mynewword')
# ['my', '##ne', '##w', '##word']

dbt.tokenize('testing.')
# ['testing', '##.']

dbt.add_tokens(new_word)
dbt.tokenize('mynewword')
# ['mynewword']

dbt.tokenize('mynewword.')
# ['mynewword', '.'] (expected: ['mynewword', '##.'])

RoBERTa:

from transformers import RobertaTokenizer

new_word = 'mynewword'
rt = RobertaTokenizer.from_pretrained('roberta-base')

rt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', 'new', 'word']

rt.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa'] (note that the final token includes a preceding 'Ġ')

rt.add_tokens(new_word)
rt.tokenize('mynewword') # verify the new token was added
# ['mynewword']

rt.tokenize('A mynewword a')
# ['A', 'mynewword', 'a'] (note that the final token lacks a 'Ġ')

Expected behavior

Adding a token to a tokenizer should not affect tokenization of adjacent elements (when these are not part of the added token).

@montellasebastien
Copy link

Environment

  • Python version: 3.6.13
  • Platform: Ubuntu 18.04.5 LTS
  • PyTorch version (GPU?): 1.10.1 (True)
  • transformers version: 3.3.1
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Hello !

I have noticed the same with transformers (v3.3.1) with the BartTokenizer. Tokenization behavior on some existing words changes after adding new tokens, and the Ġ prefix disappears as well.

@huggingface huggingface deleted a comment from github-actions bot Feb 22, 2022
@SaulLu
Copy link
Contributor

SaulLu commented Feb 22, 2022

Thank you very much for the detailed issue, unfortunately it seems to us that there is no simple way to add tokens in the way you describe.

Currently the added tokens are not added to the vocabulary of the tokenization model - here WordPiece - but are preserved from the beginning of the tokenization - no matter which tokenization model is used afterwards. To put it simply, if you added the 'mynewword' token then the first thing your tokenizer will do when you ask to tokenize this example This is a example with mynewwords token inside is pre-tokenize your example like this ["This is a example with ", "mynewword, "s token inside"] and then the tokenization model will be applied to "This is a example with " and "s token inside".

However, if you see how a easy solution could be implemented, we would be happy to discuss it!

@egsotic
Copy link

egsotic commented Mar 8, 2023

Hi
I've encountered a similar problem specifically when adding a new token t_new which is a prefix of an existing token t_old.
This is very counterintuitive as (1) t_old is in the vocabulary but is now tokenized into multiple sub-words and (2) these sub-words are actually not pieces, but are rather treated as completely different tokens (as you mentioned, it happens in the pre-tokenize phase).

Putting aside a solution, maybe a warning message should be added in tokenizer.add_tokens when tokenization changes for some in-vocabulary word?
This could be a simple and quite general way of letting know the user that something could go "weird" when tokenization an in-vocabulary word.

Example code:

tokenizer = AutoTokenizer.from_pretrained(...)
new_tokens = [...]

vocab_before_add = list(tokenizer.vocab)
vocab_tokenization_before_add = [tuple(tokenizer.tokenize(w)) for w in vocab_before_add]

tokenizer.add_tokens(new_tokens)

vocab_tokenization_after_add = (tuple(tokenizer.tokenize(w)) for w in vocab_before_add)

in_vocab_tokens_changed = [
    (w, before, after)
    for w, before, after in zip(vocab_before_add, vocab_tokenization_before_add, vocab_tokenization_after_add)
    if before != after
]

Thanks!

@SaulLu
Copy link
Contributor

SaulLu commented Mar 8, 2023

cc @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Hey! Thanks for reporting. I'll have a look, when I can.

@ArthurZucker
Copy link
Collaborator

Regarding the original issue as well as the second issue, it appears that a specific parameter exist to prevent the tokenizer from matching the new token in the middle of words. single_word: Whether this token must be a single word or can break words . However after testing a bit, this parameter does not seem to take effect when asked, which indeed greatly change the behavior.

Also, regarding the spaces before and after, rstrip and lstrip are also suppose to control the space before and after.

@tlapusan
Copy link

Hi @ArthurZucker,

I'm using the latest version of transformers (4.37.2), but still having the same odd behaviour which @mawilson1234 described in the initial comment.

I see that the PR23909 is merged in master and it should already solve this. Do you have any thoughts ? thanks

@mawilson1234 did you manage to make it work ? thanks also :)

Right now I'm reading another issue: google-research/bert#396. The solution would be to change the [unused#] tokens from vocabulary. Made a little experiment, changed few [unused] tokens and it seems to work correctly.

@ArthurZucker
Copy link
Collaborator

Do you have a reproducer @tlapusan ?

@tlapusan
Copy link

sure, just created a colab notebook : https://colab.research.google.com/drive/1fnn9gZgjI-UJdkfp5tAPwF06NGhs4pGc?usp=sharing

It's basically the same code which @mawilson1234 wrote in his initial comment.

@ArthurZucker
Copy link
Collaborator

Seems to me that the slow and fast tokenizer do not give the same behaviour and will not add ## in front of the '.'. (print(bt.tokenize('testing.')) will not add the ## with AutoTokenizer). If you look at the implementation of wordpiece:
(if mynewword was part of the vocab) it should add the ## to the subword .. But since the custom word is processed differently, the only string that gets into word piece is .. That is problematic indeed.
Would you like to open a PR to potentially better support added tokens?

@ArthurZucker
Copy link
Collaborator

(just for bert as it is a special case. tokenize can be change a bit to add ## to the next subtext if one is processed by the addedvocab

@tlapusan
Copy link

But since the custom word is processed differently, the only string that gets into word piece is .. That is problematic indeed.

Could you please add more details here ? Is it somehow related to the pre-tokenization step which @SaulLu also mentioned in a previous comment ?

The pre-tokenization looks the same (with and without the new 'mynewword' added to the vocabulary). It seems for me that the 'mynewword' will be taken in account by the wordpiece tokenizer.

tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("This is a example with mynewword. token inside")
[('This', (0, 4)),
 ('is', (5, 7)),
 ('a', (8, 9)),
 ('example', (10, 17)),
 ('with', (18, 22)),
 ('mynewword', (23, 32)),
 ('.', (32, 33)),
 ('token', (34, 39)),
 ('inside', (40, 46))]


@ArthurZucker
Copy link
Collaborator

The issue is not with the fast tokenizer if I am not mistaken. My comment is valid for the slow implementation not for the tokenizers based !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants