Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

mawilson1234 · 2021-12-14T19:48:25Z

Environment info

transformers version: 4.13.0
Platform: Windows-10-10.0.19043-SP0
Python version: 3.9.7
PyTorch version (GPU?): 1.10.0+cu113 (True)
Tensorflow version (GPU?): 2.7.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik @SaulLu

Information

Models I am using: DistilBERT, BERT, RoBERTa

The problem arises when using:

my own modified scripts: (see below)

The tasks I am working on is:

my own task or dataset: (see below)

To reproduce

When adding a new token to various models (so far found with DistilBERT, BERT, and RoBERTa), adding a new token using the add_tokens function changes how adjacent parts of the string are tokenized in subtle ways (for DistilBERT and BERT, this might depend on do_basic_tokenize being set to False when creating the tokenizer, at least in the examples I've found). (This might be related to the issue reported in #11531, but that one specifically mentions T5.) See the code below for details.

This doesn't seem like intended behavior based on what I can tell from looking at the documentation, but it's possible I'm misunderstanding something about the right way to add new tokens to produce the behavior I'd like. (Currently, to get the expected behavior, I've had to manually modify the vocab (+ merges file for RoBERTa), using additional scripting, and load the tokenizer from the modified files. If it'd be of use, I could post the code for that workaround here, but I've left it out for now since it's a bit long and may not be relevant.)

Steps to reproduce the behavior:

(Distil)BERT:

from transformers import DistilBertTokenizer, BertTokenizer

new_word = 'mynewword'

# BERT
bt = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', '##ne', '##w', '##word']

bt.tokenize('testing.')
# ['testing', '##.'] (note that the period is tokenized as '##.')

bt.add_tokens(new_word)
bt.tokenize('mynewword') # verify the new token now exists
# ['mynewword']

bt.tokenize('mynewword.')
# ['mynewword', '.'] (note that the period is tokenized as '.' rather than the expected '##.')

# DistilBERT
dbt = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_basic_tokenize=False)
dbt.tokenize('mynewword')
# ['my', '##ne', '##w', '##word']

dbt.tokenize('testing.')
# ['testing', '##.']

dbt.add_tokens(new_word)
dbt.tokenize('mynewword')
# ['mynewword']

dbt.tokenize('mynewword.')
# ['mynewword', '.'] (expected: ['mynewword', '##.'])

RoBERTa:

from transformers import RobertaTokenizer

new_word = 'mynewword'
rt = RobertaTokenizer.from_pretrained('roberta-base')

rt.tokenize('mynewword') # verify the new word doesn't yet exist
# ['my', 'new', 'word']

rt.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa'] (note that the final token includes a preceding 'Ġ')

rt.add_tokens(new_word)
rt.tokenize('mynewword') # verify the new token was added
# ['mynewword']

rt.tokenize('A mynewword a')
# ['A', 'mynewword', 'a'] (note that the final token lacks a 'Ġ')

Expected behavior

Adding a token to a tokenizer should not affect tokenization of adjacent elements (when these are not part of the added token).

The text was updated successfully, but these errors were encountered:

montellasebastien · 2022-02-22T15:39:35Z

Environment

Python version: 3.6.13
Platform: Ubuntu 18.04.5 LTS
PyTorch version (GPU?): 1.10.1 (True)
transformers version: 3.3.1
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Hello !

I have noticed the same with transformers (v3.3.1) with the BartTokenizer. Tokenization behavior on some existing words changes after adding new tokens, and the Ġ prefix disappears as well.

SaulLu · 2022-02-22T17:38:56Z

Thank you very much for the detailed issue, unfortunately it seems to us that there is no simple way to add tokens in the way you describe.

Currently the added tokens are not added to the vocabulary of the tokenization model - here WordPiece - but are preserved from the beginning of the tokenization - no matter which tokenization model is used afterwards. To put it simply, if you added the 'mynewword' token then the first thing your tokenizer will do when you ask to tokenize this example This is a example with mynewwords token inside is pre-tokenize your example like this ["This is a example with ", "mynewword, "s token inside"] and then the tokenization model will be applied to "This is a example with " and "s token inside".

However, if you see how a easy solution could be implemented, we would be happy to discuss it!

egsotic · 2023-03-08T07:58:43Z

Hi
I've encountered a similar problem specifically when adding a new token t_new which is a prefix of an existing token t_old.
This is very counterintuitive as (1) t_old is in the vocabulary but is now tokenized into multiple sub-words and (2) these sub-words are actually not pieces, but are rather treated as completely different tokens (as you mentioned, it happens in the pre-tokenize phase).

Putting aside a solution, maybe a warning message should be added in tokenizer.add_tokens when tokenization changes for some in-vocabulary word?
This could be a simple and quite general way of letting know the user that something could go "weird" when tokenization an in-vocabulary word.

Example code:

tokenizer = AutoTokenizer.from_pretrained(...)
new_tokens = [...]

vocab_before_add = list(tokenizer.vocab)
vocab_tokenization_before_add = [tuple(tokenizer.tokenize(w)) for w in vocab_before_add]

tokenizer.add_tokens(new_tokens)

vocab_tokenization_after_add = (tuple(tokenizer.tokenize(w)) for w in vocab_before_add)

in_vocab_tokens_changed = [
    (w, before, after)
    for w, before, after in zip(vocab_before_add, vocab_tokenization_before_add, vocab_tokenization_after_add)
    if before != after
]

Thanks!

SaulLu · 2023-03-08T09:43:14Z

cc @ArthurZucker

ArthurZucker · 2023-05-30T10:04:58Z

Hey! Thanks for reporting. I'll have a look, when I can.

ArthurZucker · 2023-05-31T13:26:01Z

Regarding the original issue as well as the second issue, it appears that a specific parameter exist to prevent the tokenizer from matching the new token in the middle of words. single_word: Whether this token must be a single word or can break words . However after testing a bit, this parameter does not seem to take effect when asked, which indeed greatly change the behavior.

Also, regarding the spaces before and after, rstrip and lstrip are also suppose to control the space before and after.

tlapusan · 2024-01-31T12:26:39Z

Hi @ArthurZucker,

I'm using the latest version of transformers (4.37.2), but still having the same odd behaviour which @mawilson1234 described in the initial comment.

I see that the PR23909 is merged in master and it should already solve this. Do you have any thoughts ? thanks

@mawilson1234 did you manage to make it work ? thanks also :)

Right now I'm reading another issue: google-research/bert#396. The solution would be to change the [unused#] tokens from vocabulary. Made a little experiment, changed few [unused] tokens and it seems to work correctly.

ArthurZucker · 2024-01-31T12:57:07Z

Do you have a reproducer @tlapusan ?

tlapusan · 2024-01-31T13:18:38Z

sure, just created a colab notebook : https://colab.research.google.com/drive/1fnn9gZgjI-UJdkfp5tAPwF06NGhs4pGc?usp=sharing

It's basically the same code which @mawilson1234 wrote in his initial comment.

ArthurZucker · 2024-01-31T13:35:27Z

Seems to me that the slow and fast tokenizer do not give the same behaviour and will not add ## in front of the '.'. (print(bt.tokenize('testing.')) will not add the ## with AutoTokenizer). If you look at the implementation of wordpiece:
(if mynewword was part of the vocab) it should add the ## to the subword .. But since the custom word is processed differently, the only string that gets into word piece is .. That is problematic indeed.
Would you like to open a PR to potentially better support added tokens?

ArthurZucker · 2024-01-31T13:36:20Z

(just for bert as it is a special case. tokenize can be change a bit to add ## to the next subtext if one is processed by the addedvocab

tlapusan · 2024-01-31T15:03:41Z

But since the custom word is processed differently, the only string that gets into word piece is .. That is problematic indeed.

Could you please add more details here ? Is it somehow related to the pre-tokenization step which @SaulLu also mentioned in a previous comment ?

The pre-tokenization looks the same (with and without the new 'mynewword' added to the vocabulary). It seems for me that the 'mynewword' will be taken in account by the wordpiece tokenizer.

tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("This is a example with mynewword. token inside")
[('This', (0, 4)),
 ('is', (5, 7)),
 ('a', (8, 9)),
 ('example', (10, 17)),
 ('with', (18, 22)),
 ('mynewword', (23, 32)),
 ('.', (32, 33)),
 ('token', (34, 39)),
 ('inside', (40, 46))]

ArthurZucker · 2024-02-28T02:31:51Z

The issue is not with the fast tokenizer if I am not mistaken. My comment is valid for the slow implementation not for the tokenizers based !

github-actions bot closed this as completed Jan 24, 2022

huggingface deleted a comment from github-actions bot Feb 22, 2022

ArthurZucker mentioned this issue May 31, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

ArthurZucker mentioned this issue Aug 1, 2023

[Bis] Adding new tokens while preserving tokenization of adjacent tokens #25225

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

mawilson1234 commented Dec 14, 2021 •

edited by LysandreJik

Loading

montellasebastien commented Feb 22, 2022

SaulLu commented Feb 22, 2022

egsotic commented Mar 8, 2023 •

edited

Loading

SaulLu commented Mar 8, 2023

ArthurZucker commented May 30, 2023

ArthurZucker commented May 31, 2023

tlapusan commented Jan 31, 2024

ArthurZucker commented Jan 31, 2024

tlapusan commented Jan 31, 2024

ArthurZucker commented Jan 31, 2024

ArthurZucker commented Jan 31, 2024

tlapusan commented Jan 31, 2024

ArthurZucker commented Feb 28, 2024

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Comments

mawilson1234 commented Dec 14, 2021 • edited by LysandreJik Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

montellasebastien commented Feb 22, 2022

Environment

SaulLu commented Feb 22, 2022

egsotic commented Mar 8, 2023 • edited Loading

SaulLu commented Mar 8, 2023

ArthurZucker commented May 30, 2023

ArthurZucker commented May 31, 2023

tlapusan commented Jan 31, 2024

ArthurZucker commented Jan 31, 2024

tlapusan commented Jan 31, 2024

ArthurZucker commented Jan 31, 2024

ArthurZucker commented Jan 31, 2024

tlapusan commented Jan 31, 2024

ArthurZucker commented Feb 28, 2024

mawilson1234 commented Dec 14, 2021 •

edited by LysandreJik

Loading

egsotic commented Mar 8, 2023 •

edited

Loading