Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] add_special_tokens's documentation is ambigus #22935

Closed
4 tasks
zplizzi opened this issue Apr 22, 2023 · 4 comments · Fixed by #23909
Closed
4 tasks

[Doc] add_special_tokens's documentation is ambigus #22935

zplizzi opened this issue Apr 22, 2023 · 4 comments · Fixed by #23909
Assignees

Comments

@zplizzi
Copy link

zplizzi commented Apr 22, 2023

System Info

  • transformers version: 4.28.1
  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.31
  • Python version: 3.9.5
  • Huggingface_hub version: 0.13.2
  • Safetensors version: not installed
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
print(tok.bos_token)
print(tok.eos_token)
print(tok.bos_token_id)
print(tok.eos_token_id)

print(tok("the dog walked", add_special_tokens=True))

outputs

<|endoftext|>
<|endoftext|>
0
0
{'input_ids': [783, 4370, 7428], 'attention_mask': [1, 1, 1]}

Expected behavior

I expect it to output [0, 783, 4370, 7428, 0]. Or am I misunderstanding what add_special_tokens is supposed to do?

@danielemurgolo
Copy link

The add_special_tokens, when set to True is used to add special tokens at the beginning and at the end of the input sequence. In your case, since you are using a single input sequence, the tokenizer will add the special tokens [CLS] and [SEP] respectively at the beginning and at the end of the sentence.

Note that not all tokenizers support adding special tokens. If a tokenizer does not support adding special tokens, setting add_special_tokens to True will have no effect.

You are using the "EleutherAI/pythia-70m" tokenizer which does not have a specific token for [CLS] and [SEP]. These tokens are represented by the bos_token and eos_token, respectively. Hence, the output you are seeing is correct and corresponds to the tokenized input sequence with the added special tokens.

If you want to add [CLS] and [SEP] tokens to your input sequence using this tokenizer, you can do so by explicitly specifying the token IDs for these tokens, like this:

input_ids = tok.encode("the dog walked", add_special_tokens=False)
input_ids = [tok.bos_token_id] + input_ids + [tok.eos_token_id]
attention_mask = [1] * len(input_ids)
output = {"input_ids": input_ids, "attention_mask": attention_mask}
print(output)

@zplizzi
Copy link
Author

zplizzi commented May 1, 2023

Thanks for explaining. Can this behavior be added to the docs for the transformer tokenizer class? Nowhere on the API docs does it say that add_special_tokens=True will add the cls and sep tokens. One might naturally assume that BOS and EOS would be the natural ones to place before and after a sequence!

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented May 25, 2023

You can also define these tokens when initialising the model or after. tokenizer.cls_token = "[CLS]" should be working. I agree that the doc should be clearer. Thanks for reporting the confusion

@ArthurZucker ArthurZucker changed the title tokenizer [Doc] add_special_tokens's documentation is ambigus Jun 1, 2023
@ArthurZucker ArthurZucker self-assigned this Jun 1, 2023
@huggingface huggingface deleted a comment from github-actions bot Jun 27, 2023
@ArthurZucker
Copy link
Collaborator

I am waiting until the added tokens refactoring is finish to make sure this is fixed, and update the doc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants