Mistral Tokenizer.decode() add a space when use_fast=True #29452

HugeHeart · 2024-03-05T02:44:58Z

# transformers version is 4.38.2

from transformers import AutoTokenizer
model_path="mistralai/Mistral-7B-v0.1"
fast = AutoTokenizer.from_pretrained(model_path, use_fast=True)
slow = AutoTokenizer.from_pretrained(model_path, use_fast=False)
text = "hi"
print(f"fast tokenize={fast.encode(text)}")
print(f"slow tokenize={slow.encode(text)}")

# fast decode add a space after <s>
print(f"fast decode={fast.decode(fast.encode(text))}")
print(f"slow decode={slow.decode(slow.encode(text))}")

output is

fast tokenize=[1, 12014]
slow tokenize=[1, 12014]
fast decode=<s> hi
slow decode=<s>hi

fast and slow tokenizer have same encode result, but fast add a space after "<s>" when decode

I also noticed issue huggingface/tokenizers#1448 , @ArthurZucker said "use metaspace with prepend_scheme="first" and no normalizer", this already exists in transformers version 4.38.2 and doesn't seem to work.

Are there any useful info I missed? how can I delete the space after "<s>" when use_fast=True?

Thank you to those who have contributed to the Transformers lib.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-05T05:58:37Z

Hey! Thanks for reporting.
If you check this:

In [4]: slow.convert_ids_to_tokens([1, 12014])
Out[4]: ['<s>', '▁hi']

This issue emerges because convert_tokens_to_string is "wrong":

In [14]: slow.convert_tokens_to_string(['<s>', '▁hi'])
Out[14]: '<s>hi'

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        # since we manually add the prefix space, we have to remove it when decoding
        if tokens[0].startswith(SPIECE_UNDERLINE) and self.add_prefix_space:
            tokens[0] = tokens[0][1:]

        current_sub_tokens = []
        out_string = ""
        prev_is_special = False
        for i, token in enumerate(tokens):
            # make sure that special tokens are not decoded using sentencepiece model
            if token in self.all_special_tokens:
                if not prev_is_special and i != 0 and self.legacy:
                    out_string += " "
                out_string += self.sp_model.decode(current_sub_tokens) + token
                prev_is_special = True
                current_sub_tokens = []
            else:
+               if prev_is_special and i==1 and self.add_prefix_space:
+                   out_string += " "
                current_sub_tokens.append(token)
                prev_is_special = False
        out_string += self.sp_model.decode(current_sub_tokens)
        return out_string

this should fix it.
Now this should bring both implementation closer while keeping:

In [4]: slow.convert_tokens_to_string(['▁hi'])
Out[4]: 'hi'

HugeHeart · 2024-03-07T02:37:58Z

Hey! Thanks for reporting. If you check this:

In [4]: slow.convert_ids_to_tokens([1, 12014])
Out[4]: ['<s>', '▁hi']

This issue emerges because convert_tokens_to_string is "wrong":

In [14]: slow.convert_tokens_to_string(['<s>', '▁hi'])
Out[14]: '<s>hi'

   def convert_tokens_to_string(self, tokens):
       """Converts a sequence of tokens (string) in a single string."""
       # since we manually add the prefix space, we have to remove it when decoding
       if tokens[0].startswith(SPIECE_UNDERLINE) and self.add_prefix_space:
           tokens[0] = tokens[0][1:]

       current_sub_tokens = []
       out_string = ""
       prev_is_special = False
       for i, token in enumerate(tokens):
           # make sure that special tokens are not decoded using sentencepiece model
           if token in self.all_special_tokens:
               if not prev_is_special and i != 0 and self.legacy:
                   out_string += " "
               out_string += self.sp_model.decode(current_sub_tokens) + token
               prev_is_special = True
               current_sub_tokens = []
           else:
+               if prev_is_special and i==1 and self.add_prefix_space:
+                   out_string += " "
               current_sub_tokens.append(token)
               prev_is_special = False
       out_string += self.sp_model.decode(current_sub_tokens)
       return out_string

this should fix it. Now this should bring both implementation closer while keeping:

In [4]: slow.convert_tokens_to_string(['▁hi'])
Out[4]: 'hi'

Thanks for your reply!
When a text contains special tokens, I want the text after tokenizer's encode and decode to be consistent with the original text.
But even with the code you mentioned, I still can't achieve this effect.

Here is an example:

from transformers import AutoTokenizer
model_path="mistralai/Mistral-7B-v0.1"
fast = AutoTokenizer.from_pretrained(model_path, use_fast=True, add_bos_token=False)
slow = AutoTokenizer.from_pretrained(model_path, use_fast=False, add_bos_token=False)
text = "<s>user: hi</s>assistant: hello</s>"
print(f"text={text}")

print(f"fast tokenize={fast.encode(text)}")
print(f"slow tokenize={slow.encode(text)}")

fast_decode_text = fast.decode(fast.encode(text))
slow_decode_text = slow.decode(slow.encode(text))
print(f"fast decode text={fast_decode_text}")
print(f"slow decode text={slow_decode_text}")

output:

text=<s>user: hi</s>assistant: hello</s>
fast tokenize=[1, 2188, 28747, 12014, 2, 13892, 28747, 6312, 28709, 2]
slow tokenize=[1, 2188, 28747, 12014, 2, 13892, 28747, 6312, 28709, 2]
fast decode text=<s> user: hi</s> assistant: hello</s>
slow decode text=<s>  user: hi</s> assistant: hello</s>

fast/slow decode text add space after special tokens.

Is there any parameter that can be controlled not to add space after special tokens?

ArthurZucker · 2024-03-07T04:35:47Z

That is expected 😉
The problem here is mentioned in #28881 which is a PR suppose to fix it.
You can fix this with:

slow = AutoTokenizer.from_pretrained(model_path, use_fast=False, add_bos_token=False, legacy=False)

ArthurZucker · 2024-03-07T04:35:57Z

check the doc for this argument 🤗

HugeHeart · 2024-03-07T06:28:35Z

That is expected 😉 The problem here is mentioned in #28881 which is a PR suppose to fix it. You can fix this with:
slow = AutoTokenizer.from_pretrained(model_path, use_fast=False, add_bos_token=False, legacy=False)

Thank you for your prompt response. But when use_fast=True, space will still appear after sp tokens.

Here is an example:

from transformers import AutoTokenizer
model_path="mistralai/Mistral-7B-v0.1"
fast = AutoTokenizer.from_pretrained(model_path, use_fast=True, add_bos_token=False, legacy=False)
slow = AutoTokenizer.from_pretrained(model_path, use_fast=False, add_bos_token=False, legacy=False)
text = "<s>user: hi</s>assistant: hello</s>"
print(f"text={text}")

print(f"fast tokenize={fast.encode(text)}")
print(f"slow tokenize={slow.encode(text)}")

fast_decode_text = fast.decode(fast.encode(text))
slow_decode_text = slow.decode(slow.encode(text))
# a space appear between "</s>" and "assistant"
print(f"fast decode text={fast_decode_text}")
print(f"slow decode text={slow_decode_text}")

output is

text=<s>user: hi</s>assistant: hello</s>
fast tokenize=[1, 2188, 28747, 12014, 2, 13892, 28747, 6312, 28709, 2]
slow tokenize=[1, 1838, 28747, 12014, 2, 489, 11143, 28747, 6312, 28709, 2]
fast decode text=<s> user: hi</s> assistant: hello</s>
slow decode text=<s> user: hi</s>assistant: hello</s>

Also, I found that there is a difference between fast and slow encoding when legacy=False, which one does mistralai/Mistral-7B-V0.1 use?

ArthurZucker · 2024-03-07T07:25:48Z

Yes, as I said this is being fixed by #28881 but it is not in main yet.
Mistral uses legacy = True by default. It was train with it

ArthurZucker transferred this issue from huggingface/tokenizers Mar 5, 2024

ArthurZucker mentioned this issue Mar 5, 2024

[ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

Merged

ArthurZucker mentioned this issue Mar 26, 2024

Llama tokenizers behaviours different with special tokens. #29872

Closed

4 tasks

ArthurZucker closed this as completed in #29453 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral Tokenizer.decode() add a space when use_fast=True #29452

Mistral Tokenizer.decode() add a space when use_fast=True #29452

HugeHeart commented Mar 5, 2024 •

edited by ArthurZucker

Loading

ArthurZucker commented Mar 5, 2024

HugeHeart commented Mar 7, 2024 •

edited

Loading

ArthurZucker commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

HugeHeart commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

Mistral Tokenizer.decode() add a space when use_fast=True #29452

Mistral Tokenizer.decode() add a space when use_fast=True #29452

Comments

HugeHeart commented Mar 5, 2024 • edited by ArthurZucker Loading

ArthurZucker commented Mar 5, 2024

HugeHeart commented Mar 7, 2024 • edited Loading

ArthurZucker commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

HugeHeart commented Mar 7, 2024

ArthurZucker commented Mar 7, 2024

HugeHeart commented Mar 5, 2024 •

edited by ArthurZucker

Loading

HugeHeart commented Mar 7, 2024 •

edited

Loading