Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on the BPE tokenizer #3252

Merged
merged 35 commits into from
Oct 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
bfaab6f
Work on the BPE tokenizer
goerch Sep 18, 2023
89e74c6
Try to fix build problem
goerch Sep 18, 2023
7770423
Fix debug assertion failure
goerch Sep 18, 2023
37cf135
Fix MSVC Unicode BOM problem
goerch Sep 18, 2023
91a527a
Cleanup and an improvement
goerch Sep 18, 2023
208d3d7
Fix compiler warning
goerch Sep 18, 2023
407f76d
Cleanup
goerch Sep 18, 2023
311fcf1
Test doesn't work over the full range of Unicodes
goerch Sep 18, 2023
c85cb29
Update .gitignore and Makefile
goerch Sep 18, 2023
048e659
Another Makefile rule
goerch Sep 18, 2023
c0990bb
Testing Aquila
goerch Sep 19, 2023
1b7c369
Moving byte decoding back to `token_to_piece` ...
goerch Sep 19, 2023
a4e9448
Guarding some unusable code pathes
goerch Sep 19, 2023
17ca832
Streamlining code and adding some more assertions
goerch Sep 19, 2023
4abbfb5
Adding a comment
goerch Sep 19, 2023
59a30b7
Adding another assertion
goerch Sep 19, 2023
a6070b7
Fixed vocabulary guarding assertions
goerch Sep 19, 2023
16c06fe
Merge branch 'master' into falcon-tokenizer
goerch Sep 29, 2023
c09330e
Fix PR for recent change
goerch Sep 29, 2023
9cfb714
Fix PR for recent change
goerch Sep 29, 2023
607e3bf
Fix for compiler warning
goerch Sep 29, 2023
fad8a77
Fix PR for recent change
goerch Sep 29, 2023
6a16c36
Fix PR for recent change
goerch Sep 29, 2023
a2ddaad
Fix PR for recent change
goerch Sep 29, 2023
3fa8c55
Fix for compiler warning
goerch Sep 29, 2023
d6d7d0f
Fixes for more compiler warnings
goerch Sep 29, 2023
37af613
Remove unused code
goerch Oct 1, 2023
2117e23
Fix initialization of static maps
goerch Oct 1, 2023
28778f8
Add scores and token types back, adapt gptneox
goerch Oct 2, 2023
a9a2af9
Update llama.cpp
goerch Oct 2, 2023
dccd1db
Update unicode.h
goerch Oct 2, 2023
02b9ccf
Update unicode.h
goerch Oct 2, 2023
3d162cc
Ported Starcoder and added some assertions
goerch Oct 2, 2023
5aee498
Fix coding style
goerch Oct 2, 2023
3e518e2
Apply @jploski 's fix for missing tokens
goerch Oct 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -90,4 +90,5 @@ tests/test-quantize-perf
tests/test-sampling
tests/test-tokenizer-0-llama
tests/test-tokenizer-0-falcon
tests/test-tokenizer-1
tests/test-tokenizer-1-llama
tests/test-tokenizer-1-bpe
9 changes: 7 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml simple batched save-load-state server embd-input-test gguf llama-bench baby-llama beam-search speculative benchmark-matmult parallel finetune export-lora tests/test-c.o

# Binaries only useful for tests
TEST_TARGETS = tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1-llama
TEST_TARGETS = tests/test-llama-grammar tests/test-grammar-parser tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0-llama tests/test-tokenizer-0-falcon tests/test-tokenizer-1-llama tests/test-tokenizer-1-bpe

# Code coverage output files
COV_TARGETS = *.gcno tests/*.gcno *.gcda tests/*.gcda *.gcov tests/*.gcov lcov-report gcovr-report
Expand Down Expand Up @@ -62,9 +62,11 @@ test: $(TEST_TARGETS)
if [ "$$test_target" = "tests/test-tokenizer-0-llama" ]; then \
./$$test_target $(CURDIR)/models/ggml-vocab-llama.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-0-falcon" ]; then \
continue; \
./$$test_target $(CURDIR)/models/ggml-vocab-falcon.gguf; \
elif [ "$$test_target" = "tests/test-tokenizer-1-llama" ]; then \
continue; \
elif [ "$$test_target" = "tests/test-tokenizer-1-bpe" ]; then \
continue; \
else \
echo "Running test $$test_target..."; \
./$$test_target; \
Expand Down Expand Up @@ -667,6 +669,9 @@ tests/test-tokenizer-0-falcon: tests/test-tokenizer-0-falcon.cpp build-info.h gg
tests/test-tokenizer-0-llama: tests/test-tokenizer-0-llama.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

tests/test-tokenizer-1-bpe: tests/test-tokenizer-1-bpe.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

tests/test-tokenizer-1-llama: tests/test-tokenizer-1-llama.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

Expand Down
1 change: 1 addition & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -921,6 +921,7 @@ std::string llama_detokenize_bpe(llama_context * ctx, const std::vector<llama_to
result += piece;
}

// NOTE: the original tokenizer decodes bytes after collecting the pieces.
return result;
}

Expand Down
47 changes: 7 additions & 40 deletions convert-falcon-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,28 +20,6 @@
import gguf


def bytes_to_unicode():
# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
return dict(zip(bs, (chr(n) for n in cs)))


def count_model_parts(dir_model: Path) -> int:
num_parts = 0
for filename in os.listdir(dir_model):
Expand Down Expand Up @@ -133,6 +111,8 @@ def parse_args() -> argparse.Namespace:
print("gguf: get tokenizer metadata")

tokens: list[bytearray] = []
scores: list[float] = []
toktypes: list[int] = []

tokenizer_json_file = dir_model / 'tokenizer.json'
if not tokenizer_json_file.is_file():
Expand All @@ -155,28 +135,15 @@ def parse_args() -> argparse.Namespace:
tokenizer = AutoTokenizer.from_pretrained(dir_model)

reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}

for i in range(vocab_size):
if i in reverse_vocab:
try:
text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
except KeyError:
cebtenzzre marked this conversation as resolved.
Show resolved Hide resolved
text = bytearray()
for c in reverse_vocab[i]:
if ord(c) < 256: # single byte character
text.append(byte_decoder[ord(c)])
else: # multibyte special token character
text.extend(c.encode('utf-8'))
else:
print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.")
pad_token = f"[PAD{i}]".encode("utf8")
text = bytearray(pad_token)

tokens.append(text)
tokens.append(reverse_vocab[i])
scores.append(0.0) # dummy
toktypes.append(gguf.TokenType.NORMAL)
Comment on lines +140 to +142
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work as-is, because scores and toktypes were removed from this file in a previous PR. Also, won't this throw KeyError now if the model's vocab_size is larger than reverse_vocab?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in order to satisfy the needs specified in #3405, we will need to at least provide a way (via toktypes) to differentiate between added tokens (tokenizer.get_added_vocab()) and normal tokens (NORMAL vs USER_DEFINED?). I would be glad to make a PR if you think that's the right way to go about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work as-is, because scores and toktypes were removed from this file in a previous PR. Also, won't this throw KeyError now if the model's vocab_size is larger than reverse_vocab?

Until now I've only thought about the one-to-one correspondence between piece and token-id in llama.cpp. It would be surprising for me if this correspondence is violated in the original model. For Falcon and Aquila we seem to be good here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in order to satisfy the needs specified in #3405, we will need to at least provide a way (via toktypes) to differentiate between added tokens (tokenizer.get_added_vocab()) and normal tokens (NORMAL vs USER_DEFINED?). I would be glad to make a PR if you think that's the right way to go about this.

I wrote about my current understanding of added tokens here. In short: all added tokens looked like CONTROL tokens to me for now (maybe because I have to see sentencepiece's USER_DEFINED tokens in the wild yet).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work as-is, because scores and toktypes were removed from this file in a previous PR.

One more remark: as far as I remember scores are needed for BPE merges. Token types NORMAL and CONTROL are used in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scores and toktypes were only removed because they were only placeholder values, since the GGUF spec says they are optional. If they are used for something meaningful in the future they can certainly be added back.

Should these lines be removed, if they do not apply to Falcon? They necessarily imply that vocab_size > len(reverse_vocab) may be true.

# The number of tokens in tokenizer.json can differ from the expected vocab size.
# This causes downstream issues with mismatched tensor sizes when running the inference
vocab_size = hparams["vocab_size"] if "vocab_size" in hparams else len(tokenizer_json["model"]["vocab"])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these lines be removed, if they do not apply to Falcon? They necessarily imply that vocab_size > len(reverse_vocab) may be true.

I removed unused code from the conversion scripts but would not touch that code until we agree that tokenizer.json/tokenizer.model should be considered as the source of truth for the tokenizers. Furthermore I'm not sure about our strategy for the consolidation of conversion scripts (I'm thinking about moving tokenizer support to gguf.py in a later PR, for example).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work as-is, because scores and toktypes were removed from this file in a previous PR.

Oops, missed the impact of this completely when merging main: tried to repair just now.


gguf_writer.add_token_list(tokens)
gguf_writer.add_token_scores(scores)
gguf_writer.add_token_types(toktypes)

special_vocab = gguf.SpecialVocab(dir_model, load_merges = True)
special_vocab.add_to_gguf(gguf_writer)
Expand Down
48 changes: 7 additions & 41 deletions convert-gptneox-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,29 +19,6 @@
sys.path.insert(1, str(Path(__file__).parent / 'gguf-py' / 'gguf'))
import gguf

# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py


def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
return dict(zip(bs, (chr(n) for n in cs)))


def count_model_parts(dir_model: Path) -> int:
num_parts = 0
Expand Down Expand Up @@ -130,6 +107,8 @@ def parse_args() -> argparse.Namespace:
print("gguf: get tokenizer metadata")

tokens: list[bytearray] = []
scores: list[float] = []
toktypes: list[int] = []

tokenizer_json_file = dir_model / 'tokenizer.json'
if not tokenizer_json_file.is_file():
Expand All @@ -150,28 +129,15 @@ def parse_args() -> argparse.Namespace:
tokenizer = AutoTokenizer.from_pretrained(dir_model)

reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}

for i in range(vocab_size):
if i in reverse_vocab:
try:
text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
except KeyError:
text = bytearray()
for c in reverse_vocab[i]:
if ord(c) < 256: # single byte character
text.append(byte_decoder[ord(c)])
else: # multibyte special token character
text.extend(c.encode('utf-8'))
else:
print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.")
pad_token = f"[PAD{i}]".encode("utf8")
text = bytearray(pad_token)

tokens.append(text)
tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]")
scores.append(0.0) # dummy
toktypes.append(gguf.TokenType.NORMAL)

gguf_writer.add_token_list(tokens)
gguf_writer.add_token_scores(scores)
gguf_writer.add_token_types(toktypes)

special_vocab = gguf.SpecialVocab(dir_model, load_merges = True)
special_vocab.add_to_gguf(gguf_writer)
Expand Down
47 changes: 7 additions & 40 deletions convert-starcoder-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,28 +20,6 @@
import gguf


def bytes_to_unicode():
# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
return dict(zip(bs, (chr(n) for n in cs)))


def count_model_parts(dir_model: Path) -> int:
num_parts = 0
for filename in os.listdir(dir_model):
Expand Down Expand Up @@ -117,6 +95,8 @@ def parse_args() -> argparse.Namespace:
print("gguf: get tokenizer metadata")

tokens: list[bytearray] = []
scores: list[float] = []
toktypes: list[int] = []

tokenizer_json_file = dir_model / 'tokenizer.json'
if not tokenizer_json_file.is_file():
Expand All @@ -139,28 +119,15 @@ def parse_args() -> argparse.Namespace:
tokenizer = AutoTokenizer.from_pretrained(dir_model)

reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
byte_encoder = bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}

for i in range(vocab_size):
if i in reverse_vocab:
try:
text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
except KeyError:
text = bytearray()
for c in reverse_vocab[i]:
if ord(c) < 256: # single byte character
text.append(byte_decoder[ord(c)])
else: # multibyte special token character
text.extend(c.encode('utf-8'))
else:
print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.")
pad_token = f"[PAD{i}]".encode("utf8")
text = bytearray(pad_token)

tokens.append(text)
tokens.append(reverse_vocab[i] if i in reverse_vocab else f"[PAD{i}]")
scores.append(0.0) # dummy
toktypes.append(gguf.TokenType.NORMAL)

gguf_writer.add_token_list(tokens)
gguf_writer.add_token_scores(scores)
gguf_writer.add_token_types(toktypes)

special_vocab = gguf.SpecialVocab(dir_model, load_merges = True)
special_vocab.add_to_gguf(gguf_writer)
Expand Down
24 changes: 5 additions & 19 deletions convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,29 +339,15 @@ def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> No
def bpe_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
tokenizer = self.bpe_tokenizer
from transformers.models.gpt2 import tokenization_gpt2 # type: ignore[import]
byte_encoder = tokenization_gpt2.bytes_to_unicode()
byte_decoder = {v: k for k, v in byte_encoder.items()}
score = 0.0
for i, item in enumerate(tokenizer):
text: bytes = item.encode("utf-8")
# FIXME: These shouldn't be hardcoded, but it's probably better than the current behavior?
if i <= 258 and text.startswith(b'<') and text.endswith(b'>'):
if i == 0 and text == b'<unk>':
toktype = gguf.TokenType.UNKNOWN
elif i == 1 or i == 2:
toktype = gguf.TokenType.CONTROL
elif i >= 3 and text.startswith(b'<0x'):
toktype = gguf.TokenType.BYTE
else:
toktype = gguf.TokenType.NORMAL
else:
toktype = gguf.TokenType.NORMAL
yield text, score, toktype
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.items()}

for i, _ in enumerate(tokenizer):
yield reverse_vocab[i], 0.0, gguf.TokenType.NORMAL

def added_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
for text in self.added_tokens_list:
score = -1000.0
yield text.encode("utf-8"), score, gguf.TokenType.USER_DEFINED
yield text.encode("utf-8"), score, gguf.TokenType.CONTROL

def all_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
yield from self.bpe_tokens()
Expand Down
Loading
Loading