Adding Llama FastTokenizer support. #22264

Narsil · 2023-03-20T10:26:48Z

Requires Adding ByteFallback support for tokenizers. tokenizers#1183 version
Only support byte_fallback for llama, raise otherwise (safety net).
Lots of questions are special tokens

How to test:

#! pip install -e https://github.com/huggingface/tokenizers@byte_fallback#egg=tokenizers

from transformers.convert_slow_tokenizer import convert_slow_tokenizer
from transformers import AutoTokenizer
from tokenizers import Tokenizer

tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")

if False:
    new_tokenizer = Tokenizer.from_file("tok.json")
else:
    new_tokenizer = convert_slow_tokenizer(tokenizer)
    new_tokenizer.save("tok.json")

strings = [
    "This is a test",
    "生活的真谛是",
    "生活的真谛是[MASK]。",
    # XXX: This one is problematic because of special tokens
    # "<s> Something something",
]

for string in strings:
    encoded = tokenizer(string)["input_ids"]
    encoded2 = new_tokenizer.encode(string).ids

    assert encoded == encoded2, f"{encoded} != {encoded2}"

    decoded = tokenizer.decode(encoded)
    decoded2 = new_tokenizer.decode(encoded2)

    assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-03-20T10:40:31Z

The documentation is not available anymore as the PR was closed or merged.

sgugger · 2023-03-20T13:06:50Z

Thanks for the ping. We'll need the actual fast tokenizer file to merge this though 😅

Narsil · 2023-03-20T13:55:14Z

True, I uncovered more issues around multiple space handling, I'm nailing down on the pre_tokenizer combo for it.

Narsil · 2023-03-21T11:33:54Z

More troublesome than anticipated.

When encoding " Hello" from a pure BPE perspectivve, tokenizers does [259, 10994] (" " + Hello)
whereas spm does [29871, 15043] (" " + " Hello") which from a pure ids & merges perspectives seems worse.

I though of fixing that using a pre_tokenizer that splits words onto their own.

However on encoding " ird" this time spm DOES do [259, 1823].
Seems this is where the score comes into play.

ArthurZucker

Nicely done! 😄 I have to take care of a few things on the slow side and should be done

src/transformers/convert_slow_tokenizer.py

ArthurZucker · 2023-03-24T10:14:20Z

tests/models/llama/test_tokenization_llama.py

+        # Options to consider in order to implement:
+        #  - Change `add_prefix_space` to ternary, False, True, "force", "force" being
+        # the new version which always prefixes
+        #  - Add a new extra pre_tokenizer which doesn't pretokenize but does this job.


Add a new extra pre_tokenizer which doesn't pretokenize but does this job.
Since this was added we don't need that comment anymore no

ArthurZucker · 2023-03-24T10:15:30Z

tests/models/llama/test_tokenization_llama.py

+        # These are known differences
+        self.assertEqual(pyth_tokenizer.decode([30112, 869]), "ا.")
+        # XXX Extra space
+        # self.assertEqual(rust_tokenizer._tokenizer.decode([30112, 869]), "ا .")
+        self.assertEqual(rust_tokenizer.decode([30112, 869]), "ا.")


Should go away with cleanup_tokenization_space here #22341

(flagging to take care of this test if this is merged first)

tests/models/llama/test_tokenization_llama.py

OlivierDehaene · 2023-04-04T07:23:41Z

What is the status of this PR?

tests/models/llama/test_tokenization_llama.py

docs/source/en/model_doc/llama.mdx

src/transformers/models/llama/tokenization_llama_fast.py

src/transformers/models/llama/__init__.py

sgugger · 2023-04-04T15:25:09Z

setup.py

@@ -176,7 +176,7 @@
    "tf2onnx",
    "timeout-decorator",
    "timm",
-    "tokenizers>=0.11.1,!=0.11.3,<0.14",
+    "tokenizers==0.13.3rc1",


Will need to be change to a minimum pin.

sgugger · 2023-04-04T15:26:03Z

src/transformers/convert_slow_tokenizer.py

+                piece_score = vocab_scores.get(merge, None)
+                if piece_score:
+                    merges += [(piece_l, piece_r, piece_score)]
+        merges = sorted(merges, key=lambda val: val[2], reverse=reverse)


This needs to be in its own PR with a flag for breaking change.

It's not breaking anymore.

It still has a very strong potential to be breaking as it touches code functionality, and it will be easier to isolate it in a git bisect if it goes in its own PR. So I insist.
You can just reopen the PR you closed and amend it with those changes.

Fair enough.

https://github.com/huggingface/transformers/pull/22582/files

sgugger · 2023-04-04T15:26:38Z

src/transformers/models/llama/tokenization_llama_fast.py

+from ...tokenization_utils_fast import PreTrainedTokenizerFast
+
+
+class LlamaTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
+    """
+
+    def __init__(
+        self,
+        *args,
+        clean_up_tokenization_spaces=False,
+        **kwargs,
+    ):
+        super().__init__(*args, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs)


Needs copyright doc etc.

sgugger · 2023-04-04T15:28:59Z

tests/models/llama/test_tokenization_llama.py

+        # This is excruciatingly slow since it has to recreate the entire merge
+        # list from the original vocabulary in spm


Maybe we need a smaller tokenizer then?

Well, we could create a dummy one, but we're never going to be sure to have every argument down the same.

This is supposed to be a sanity check that conversion = some static reference value. I'm not sure checking all the time this conversion is necessary, but it's nice test to have if regressions ever happen.

src/transformers/models/llama/tokenization_llama_fast.py

sgugger

Thanks for isolating the change in the conversion in #22582, this PR will need to be rebased after it's merged.

Still one comment on building a smaller tokenizer for the tests if possible and fletching out the fast tokenizer module.

sgugger · 2023-04-05T13:23:59Z

src/transformers/models/llama/tokenization_llama_fast.py

@@ -0,0 +1,19 @@
+from ...tokenization_utils_fast import PreTrainedTokenizerFast


Still missing copyright here.

sgugger · 2023-04-05T13:24:32Z

src/transformers/models/llama/tokenization_llama_fast.py

+
+class LlamaTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.


Doc could be expanded here

sgugger · 2023-04-05T13:24:55Z

src/transformers/models/llama/tokenization_llama_fast.py

+        *args,
+        clean_up_tokenization_spaces=False,
+        **kwargs,


We usually show the args and at least the special tokens kwargs in the signature of those.

I did add the special tokens.

I have no idea what the args are supposed to be.PreTrainedTokenizerFast is also using *args.

Here is what XLNet does:

transformers/src/transformers/models/xlnet/tokenization_xlnet_fast.py

Line 63 in 126eafe

class XLNetTokenizerFast(PreTrainedTokenizerFast):

Loosely copied from there.
I removed the arguments we're not using and added clean_up_tokenization_spaces

Narsil · 2023-04-05T14:05:23Z

For the doc builder, we're going to need an update on the docker image so that it pulls 0.13.3 to generate the doc.

- Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator.

sgugger

Good to go once all tests pass. Thanks!

stefan-it · 2023-04-06T08:27:06Z

Hi @Narsil ,

the warning.warn to raise RuntimeError change in src/transformers/convert_slow_tokenizer.py breaks a lot of things: I wanted to fine-tune a mT5 model and it is now no longer possible (I'm using the PyTorch example from documentation.)

How is it possible to rubustify it -> also DeBERTa v3 has byte fallback vocab (but I didn't test it yet) 🤔

Narsil · 2023-04-06T09:28:14Z

Hi @Narsil ,

the warning.warn to raise RuntimeError change in src/transformers/convert_slow_tokenizer.py breaks a lot of things: I wanted to fine-tune a mT5 model and it is now no longer possible (I'm using the PyTorch example from

How is it possible to rubustify it -> also DeBERTa v3 has byte fallback vocab (but I didn't test it yet) thinking

First of all we could revert by all means, but since now tokenizers has ByteFallback we could make it 1-1 for those, that was the idea behind upping to an error.

It's a relatively sizeable issue if there are models deployed out there which have inconsistent behavior regarding this though (slow using byte fallback, fast not using it). I'm not sure why it was a warning in the first place.

DeBERTa v3

Let's have a look too.

As a user, what's your opinion here, should we just fix the various conversion scripts, or would you rather keep the warning with the previous pitfalls ?

Narsil · 2023-04-06T09:40:03Z

Both are using Unigram with ByteFallback which isn't supported yet.

fxmarty · 2023-04-07T13:26:29Z

@Narsil After this commit AutoTokenizer.from_pretrained is extremely slow, spending time in convert_slow_tokenizer.py at every call. Is it expected? Or I am doing something wrong?

Narsil · 2023-04-07T14:02:12Z

Which repo are you using? We need to create the fast files on the repo.

Converting from slow is super slow and there's nothing to be done about it (tokenizers needs to recreate a structure by doing O(n2) search over the vocab because spm does not store this information.

Narsil · 2023-04-07T14:02:22Z

@ArthurZucker

fxmarty · 2023-04-07T14:16:34Z

I see thanks!

* Adding Llama FastTokenizer support. - Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator. * Fixing comments. * Adding more to docstring. * Doc rewriting.

Narsil requested review from SaulLu, sgugger and ArthurZucker March 20, 2023 10:27

Narsil changed the title ~~Adding Llama FastTokenizer support.~~ [WIP] Adding Llama FastTokenizer support. Mar 20, 2023

Narsil marked this pull request as draft March 20, 2023 11:17

Narsil mentioned this pull request Mar 23, 2023

add LlamaConverter huggingface/tokenizers#1191

Closed

ArthurZucker approved these changes Mar 24, 2023

View reviewed changes

Narsil mentioned this pull request Mar 29, 2023

added llama huggingface/text-generation-inference#130

Closed

OlivierDehaene mentioned this pull request Apr 4, 2023

Is there need to upgrade transformers to the latest version to support the LLaMa model? huggingface/text-generation-inference#146

Closed

Narsil force-pushed the fast_llama_tokenizer branch from 3f20703 to 3819151 Compare April 4, 2023 10:04

Narsil changed the title ~~[WIP] Adding Llama FastTokenizer support.~~ Adding Llama FastTokenizer support. Apr 4, 2023

Narsil marked this pull request as ready for review April 4, 2023 10:04

ArthurZucker reviewed Apr 4, 2023

View reviewed changes

tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved

tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Apr 4, 2023

View reviewed changes

docs/source/en/model_doc/llama.mdx Outdated Show resolved Hide resolved

ArthurZucker reviewed Apr 4, 2023

View reviewed changes

src/transformers/models/llama/tokenization_llama_fast.py Show resolved Hide resolved

ArthurZucker reviewed Apr 4, 2023

View reviewed changes

src/transformers/models/llama/__init__.py Outdated Show resolved Hide resolved

sgugger reviewed Apr 4, 2023

View reviewed changes

ArthurZucker reviewed Apr 5, 2023

View reviewed changes

src/transformers/models/llama/tokenization_llama_fast.py Outdated Show resolved Hide resolved

sgugger reviewed Apr 5, 2023

View reviewed changes

Narsil added 2 commits April 5, 2023 16:07

Fixing comments.

ea90a99

Narsil force-pushed the fast_llama_tokenizer branch from 7e257e5 to ea90a99 Compare April 5, 2023 14:15

Narsil added 2 commits April 5, 2023 17:24

Adding more to docstring.

2588912

Doc rewriting.

fbef95c

sgugger approved these changes Apr 5, 2023

View reviewed changes

Narsil merged commit 1670be4 into huggingface:main Apr 6, 2023

Narsil deleted the fast_llama_tokenizer branch April 6, 2023 07:53

This was referenced Apr 6, 2023

Revert error back into warning for byte fallback conversion. #22607

Merged

Revert error back into warning for byte fallback conversion. #22609

Closed

merrymercy mentioned this pull request Apr 6, 2023

[FIX] Fix tokenizer lm-sys/FastChat#251

Merged

		# This is excruciatingly slow since it has to recreate the entire merge
		# list from the original vocabulary in spm

		@@ -0,0 +1,19 @@
		from ...tokenization_utils_fast import PreTrainedTokenizerFast

Adding Llama FastTokenizer support. #22264

Adding Llama FastTokenizer support. #22264

Conversation

Narsil commented Mar 20, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Mar 20, 2023 • edited Loading

sgugger commented Mar 20, 2023

Narsil commented Mar 20, 2023

Narsil commented Mar 21, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlivierDehaene commented Apr 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil commented Apr 5, 2023

sgugger left a comment • edited Loading

Choose a reason for hiding this comment

stefan-it commented Apr 6, 2023 • edited Loading

Narsil commented Apr 6, 2023

Narsil commented Apr 6, 2023

fxmarty commented Apr 7, 2023 • edited Loading

Narsil commented Apr 7, 2023

Narsil commented Apr 7, 2023

fxmarty commented Apr 7, 2023

Narsil commented Mar 20, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 20, 2023 •

edited

Loading

sgugger left a comment •

edited

Loading

stefan-it commented Apr 6, 2023 •

edited

Loading

fxmarty commented Apr 7, 2023 •

edited

Loading