forked from huggingface/transformers
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding Llama FastTokenizer support. (huggingface#22264)
* Adding Llama FastTokenizer support. - Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator. * Fixing comments. * Adding more to docstring. * Doc rewriting.
- Loading branch information
Showing
11 changed files
with
266 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# coding=utf-8 | ||
# Copyright 2020 The HuggingFace Inc. team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from ...tokenization_utils_fast import PreTrainedTokenizerFast | ||
from ...utils.versions import require_version | ||
|
||
|
||
require_version("tokenizers>=0.13.3") | ||
|
||
|
||
class LlamaTokenizerFast(PreTrainedTokenizerFast): | ||
""" | ||
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. | ||
This uses notably ByteFallback and no normalization. | ||
``` | ||
from transformers import LlamaTokenizerFast | ||
tokenizer = LlaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer") | ||
tokenizer.encode("Hello this is a test") | ||
>>> [1, 15043, 445, 338, 263, 1243] | ||
``` | ||
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should | ||
refer to this superclass for more information regarding those methods. | ||
Args: | ||
vocab_file (`str`): | ||
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that | ||
contains the vocabulary necessary to instantiate a tokenizer. | ||
tokenizer_file (`str`): | ||
[tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that | ||
contains everything needed to load the tokenizer. | ||
clean_up_tokenization_spaces (`str`, *optional*, defaults to `False`): | ||
Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra | ||
spaces. | ||
bos_token (`str`, *optional*, defaults to `"<s>"`): | ||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. | ||
eos_token (`str`, *optional*, defaults to `"</s>"`): | ||
The end of sequence token. | ||
unk_token (`str`, *optional*, defaults to `"<unk>"`): | ||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | ||
token instead. | ||
""" | ||
|
||
padding_side = "left" | ||
|
||
def __init__( | ||
self, | ||
vocab_file=None, | ||
tokenizer_file=None, | ||
clean_up_tokenization_spaces=False, | ||
unk_token="<unk>", | ||
bos_token="<s>", | ||
eos_token="</s>", | ||
**kwargs, | ||
): | ||
super().__init__( | ||
vocab_file=vocab_file, | ||
tokenizer_file=tokenizer_file, | ||
clean_up_tokenization_spaces=clean_up_tokenization_spaces, | ||
unk_token=unk_token, | ||
bos_token=bos_token, | ||
eos_token=eos_token, | ||
**kwargs, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.