[Doc] `add_special_tokens`'s documentation is ambigus #22935

zplizzi · 2023-04-22T00:40:55Z

System Info

transformers version: 4.28.1
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.31
Python version: 3.9.5
Huggingface_hub version: 0.13.2
Safetensors version: not installed
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
print(tok.bos_token)
print(tok.eos_token)
print(tok.bos_token_id)
print(tok.eos_token_id)

print(tok("the dog walked", add_special_tokens=True))

outputs

<|endoftext|>
<|endoftext|>
0
0
{'input_ids': [783, 4370, 7428], 'attention_mask': [1, 1, 1]}

Expected behavior

I expect it to output [0, 783, 4370, 7428, 0]. Or am I misunderstanding what add_special_tokens is supposed to do?

The text was updated successfully, but these errors were encountered:

danielemurgolo · 2023-04-22T19:13:11Z

The add_special_tokens, when set to True is used to add special tokens at the beginning and at the end of the input sequence. In your case, since you are using a single input sequence, the tokenizer will add the special tokens [CLS] and [SEP] respectively at the beginning and at the end of the sentence.

Note that not all tokenizers support adding special tokens. If a tokenizer does not support adding special tokens, setting add_special_tokens to True will have no effect.

You are using the "EleutherAI/pythia-70m" tokenizer which does not have a specific token for [CLS] and [SEP]. These tokens are represented by the bos_token and eos_token, respectively. Hence, the output you are seeing is correct and corresponds to the tokenized input sequence with the added special tokens.

If you want to add [CLS] and [SEP] tokens to your input sequence using this tokenizer, you can do so by explicitly specifying the token IDs for these tokens, like this:

input_ids = tok.encode("the dog walked", add_special_tokens=False)
input_ids = [tok.bos_token_id] + input_ids + [tok.eos_token_id]
attention_mask = [1] * len(input_ids)
output = {"input_ids": input_ids, "attention_mask": attention_mask}
print(output)

zplizzi · 2023-05-01T18:47:18Z

Thanks for explaining. Can this behavior be added to the docs for the transformer tokenizer class? Nowhere on the API docs does it say that add_special_tokens=True will add the cls and sep tokens. One might naturally assume that BOS and EOS would be the natural ones to place before and after a sequence!

ArthurZucker · 2023-05-25T12:37:57Z

You can also define these tokens when initialising the model or after. tokenizer.cls_token = "[CLS]" should be working. I agree that the doc should be clearer. Thanks for reporting the confusion

ArthurZucker · 2023-06-27T05:41:10Z

I am waiting until the added tokens refactoring is finish to make sure this is fixed, and update the doc!

ArthurZucker changed the title ~~tokenizer~~ [Doc] add_special_tokens's documentation is ambigus Jun 1, 2023

ArthurZucker self-assigned this Jun 1, 2023

huggingface deleted a comment from github-actions bot Jun 27, 2023

ArthurZucker mentioned this issue Jun 26, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

huggingface deleted a comment from github-actions bot Jul 21, 2023

huggingface deleted a comment from github-actions bot Aug 16, 2023

huggingface deleted a comment from github-actions bot Sep 13, 2023

ArthurZucker closed this as completed in #23909 Sep 18, 2023

Uxito-Ada mentioned this issue Feb 21, 2025

Simplify EOS token handling in tokenize function intel/ipex-llm#12866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] `add_special_tokens`'s documentation is ambigus #22935

[Doc] `add_special_tokens`'s documentation is ambigus #22935

zplizzi commented Apr 22, 2023

danielemurgolo commented Apr 22, 2023

zplizzi commented May 1, 2023

ArthurZucker commented May 25, 2023 •

edited

Loading

ArthurZucker commented Jun 27, 2023

[Doc] add_special_tokens's documentation is ambigus #22935

[Doc] add_special_tokens's documentation is ambigus #22935

Comments

zplizzi commented Apr 22, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

danielemurgolo commented Apr 22, 2023

zplizzi commented May 1, 2023

ArthurZucker commented May 25, 2023 • edited Loading

ArthurZucker commented Jun 27, 2023

[Doc] `add_special_tokens`'s documentation is ambigus #22935

[Doc] `add_special_tokens`'s documentation is ambigus #22935

ArthurZucker commented May 25, 2023 •

edited

Loading