Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work on the BPE tokenizer #3252
Work on the BPE tokenizer #3252
Changes from 26 commits
bfaab6f
89e74c6
7770423
37cf135
91a527a
208d3d7
407f76d
311fcf1
c85cb29
048e659
c0990bb
1b7c369
a4e9448
17ca832
4abbfb5
59a30b7
a6070b7
16c06fe
c09330e
9cfb714
607e3bf
fad8a77
6a16c36
a2ddaad
3fa8c55
d6d7d0f
37af613
2117e23
28778f8
a9a2af9
dccd1db
02b9ccf
3d162cc
5aee498
3e518e2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work as-is, because scores and toktypes were removed from this file in a previous PR. Also, won't this throw KeyError now if the model's vocab_size is larger than reverse_vocab?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in order to satisfy the needs specified in #3405, we will need to at least provide a way (via toktypes) to differentiate between added tokens (
tokenizer.get_added_vocab()
) and normal tokens (NORMAL vs USER_DEFINED?). I would be glad to make a PR if you think that's the right way to go about this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until now I've only thought about the one-to-one correspondence between piece and token-id in
llama.cpp
. It would be surprising for me if this correspondence is violated in the original model. For Falcon and Aquila we seem to be good here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote about my current understanding of added tokens here. In short: all added tokens looked like CONTROL tokens to me for now (maybe because I have to see
sentencepiece
's USER_DEFINED tokens in the wild yet).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more remark: as far as I remember scores are needed for BPE merges. Token types NORMAL and CONTROL are used in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scores and toktypes were only removed because they were only placeholder values, since the GGUF spec says they are optional. If they are used for something meaningful in the future they can certainly be added back.
Should these lines be removed, if they do not apply to Falcon? They necessarily imply that
vocab_size > len(reverse_vocab)
may be true.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed unused code from the conversion scripts but would not touch that code until we agree that
tokenizer.json
/tokenizer.model
should be considered as the source of truth for the tokenizers. Furthermore I'm not sure about our strategy for the consolidation of conversion scripts (I'm thinking about moving tokenizer support togguf.py
in a later PR, for example).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, missed the impact of this completely when merging
main
: tried to repair just now.