-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Llama FastTokenizer support. #22264
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Thanks for the ping. We'll need the actual fast tokenizer file to merge this though 😅 |
True, I uncovered more issues around multiple space handling, I'm nailing down on the pre_tokenizer combo for it. |
More troublesome than anticipated. When encoding I though of fixing that using a However on encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! 😄 I have to take care of a few things on the slow side and should be done
# Options to consider in order to implement: | ||
# - Change `add_prefix_space` to ternary, False, True, "force", "force" being | ||
# the new version which always prefixes | ||
# - Add a new extra pre_tokenizer which doesn't pretokenize but does this job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add a new extra pre_tokenizer which doesn't pretokenize but does this job.
Since this was added we don't need that comment anymore no
# These are known differences | ||
self.assertEqual(pyth_tokenizer.decode([30112, 869]), "ا.") | ||
# XXX Extra space | ||
# self.assertEqual(rust_tokenizer._tokenizer.decode([30112, 869]), "ا .") | ||
self.assertEqual(rust_tokenizer.decode([30112, 869]), "ا.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should go away with cleanup_tokenization_space
here #22341
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(flagging to take care of this test if this is merged first)
What is the status of this PR? |
3f20703
to
3819151
Compare
setup.py
Outdated
@@ -176,7 +176,7 @@ | |||
"tf2onnx", | |||
"timeout-decorator", | |||
"timm", | |||
"tokenizers>=0.11.1,!=0.11.3,<0.14", | |||
"tokenizers==0.13.3rc1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to be change to a minimum pin.
piece_score = vocab_scores.get(merge, None) | ||
if piece_score: | ||
merges += [(piece_l, piece_r, piece_score)] | ||
merges = sorted(merges, key=lambda val: val[2], reverse=reverse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be in its own PR with a flag for breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not breaking anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still has a very strong potential to be breaking as it touches code functionality, and it will be easier to isolate it in a git bisect if it goes in its own PR. So I insist.
You can just reopen the PR you closed and amend it with those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from ...tokenization_utils_fast import PreTrainedTokenizerFast | ||
|
||
|
||
class LlamaTokenizerFast(PreTrainedTokenizerFast): | ||
""" | ||
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
*args, | ||
clean_up_tokenization_spaces=False, | ||
**kwargs, | ||
): | ||
super().__init__(*args, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs copyright doc etc.
# This is excruciatingly slow since it has to recreate the entire merge | ||
# list from the original vocabulary in spm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need a smaller tokenizer then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we could create a dummy one, but we're never going to be sure to have every argument down the same.
This is supposed to be a sanity check that conversion = some static reference value. I'm not sure checking all the time this conversion is necessary, but it's nice test to have if regressions ever happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for isolating the change in the conversion in #22582, this PR will need to be rebased after it's merged.
Still one comment on building a smaller tokenizer for the tests if possible and fletching out the fast tokenizer module.
@@ -0,0 +1,19 @@ | |||
from ...tokenization_utils_fast import PreTrainedTokenizerFast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing copyright here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
class LlamaTokenizerFast(PreTrainedTokenizerFast): | ||
""" | ||
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc could be expanded here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded.
*args, | ||
clean_up_tokenization_spaces=False, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually show the args and at least the special tokens kwargs in the signature of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add the special tokens.
I have no idea what the args are supposed to be.PreTrainedTokenizerFast
is also using *args
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what XLNet does:
class XLNetTokenizerFast(PreTrainedTokenizerFast): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loosely copied from there.
I removed the arguments we're not using and added clean_up_tokenization_spaces
For the doc builder, we're going to need an update on the docker image so that it pulls 0.13.3 to generate the doc. |
- Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator.
7e257e5
to
ea90a99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go once all tests pass. Thanks!
Hi @Narsil , the How is it possible to rubustify it -> also DeBERTa v3 has byte fallback vocab (but I didn't test it yet) 🤔 |
How is it possible to rubustify it -> also DeBERTa v3 has byte fallback vocab (but I didn't test it yet) thinking First of all we could revert by all means, but since now It's a relatively sizeable issue if there are models deployed out there which have inconsistent behavior regarding this though (slow using byte fallback, fast not using it). I'm not sure why it was a warning in the first place.
Let's have a look too. As a user, what's your opinion here, should we just fix the various conversion scripts, or would you rather keep the warning with the previous pitfalls ? |
Both are using Unigram with ByteFallback which isn't supported yet. |
@Narsil After this commit |
Which repo are you using? We need to create the fast files on the repo. Converting from slow is super slow and there's nothing to be done about it (tokenizers needs to recreate a structure by doing O(n2) search over the vocab because spm does not store this information. |
I see thanks! |
* Adding Llama FastTokenizer support. - Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator. * Fixing comments. * Adding more to docstring. * Doc rewriting.
tokenizers
. tokenizers#1183 versionHow to test:
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.