Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bis] Adding new tokens while preserving tokenization of adjacent tokens #25225

Closed
2 of 4 tasks
Madjakul opened this issue Aug 1, 2023 · 1 comment
Closed
2 of 4 tasks

Comments

@Madjakul
Copy link

Madjakul commented Aug 1, 2023

System Info

  • transformers version: 4.31
  • Platform: Linux [...] 5.19.0-50-generic 50-Ubuntu x86_64 GNU/Linux
  • Python version: 3.10.12
  • Huggingface_hub version: 0.16.4
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This issue is related to this HuggingFace post on the official forum, hence the similar title, and to my knowledge, no answer was given as to whether this is the normal tokenizer behavior. I ran into the same problem as the original poster while trying to tokenize a sentence after adding new tokens: the adjacent tokens of the newly added ones aren't computed with their preceded escape symbol.

>>> import transformers
>>> tok = transformers.RobertaTokenizer.from_pretrained("roberta-base")
>>> lotr_sent = 'Aragorn told Frodo to mind Lothlorien'
>>> tok.convert_ids_to_tokens(tok(lotr_sent)['input_ids'])
['<s>', 'Ar', 'ag', 'orn', 'Ġtold', 'ĠFro', 'do', 'Ġto', 'Ġmind', 'ĠL', 'oth', 'lor', 'ien', '</s>']
>>> tok.add_tokens(['Aragorn', 'Frodo', 'Lothlorien'])
3
>>> tok.convert_ids_to_tokens(tok(lotr_sent)['input_ids'])
['<s>', 'Aragorn', 'told', 'Frodo', 'to', 'Ġmind', 'Lothlorien', '</s>']

Expected behavior

The tokens told, Frodo, to and Lothlorien should be preceded with a Ġ character if I am not mistaken ; e.g.:

>>> import transformers
>>> tok = transformers.RobertaTokenizer.from_pretrained("roberta-base")
>>> lotr_sent = 'Aragorn told Frodo to mind Lothlorien'
>>> tok.convert_ids_to_tokens(tok(lotr_sent)['input_ids'])
['<s>', 'Ar', 'ag', 'orn', 'Ġtold', 'ĠFro', 'do', 'Ġto', 'Ġmind', 'ĠL', 'oth', 'lor', 'ien', '</s>']
>>> tok.add_tokens(['Aragorn', 'Frodo', 'Lothlorien'])
3
>>> tok.convert_ids_to_tokens(tok(lotr_sent)['input_ids'])
['<s>', 'Aragorn', 'Ġtold', 'ĠFrodo', 'Ġto', 'Ġmind', 'ĠLothlorien', '</s>']
@ArthurZucker
Copy link
Collaborator

Hey! This has already been answered, and is a duplicate of #14770. Will be fixed by #23909.

@Madjakul Madjakul closed this as completed Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants