-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to fix pre-tokenizer #5613
base: master
Are you sure you want to change the base?
Conversation
https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js Did you take a look on the javascript version and python version of transformer? I think it might be useful Great effort anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but needs more work. I think there is some unnecessary complexity in the implementation - indirections, classes, etc. These should be eliminated.
There should be standalone regex tests - see my comments.
Regarding the question about which models we support - I think nobody knows at this point. Every models picks randomly tokenization options and don't care. It's impossible to understand what options exists and are used - we'll implement features on a case-by-case basis and 3rd party projects can always fallback to the Python bloat for tokenization
} | ||
} | ||
|
||
std::vector<std::string> to_category_code(const std::vector<uint32_t> & UNICODE_TYPES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be to_category_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made a mistake here.
}; | ||
} | ||
|
||
class UNICODE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid this class
and have basic function calls + static containers that are lazy initialized on first use?
Prefix all functions with unicode_
. For example: `unicode_to_codepoints()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can work, but I'm not sure it's the best way to do it. See my comment below. The good thing about using a class is that we can make many instances of it, each one having different category definitions, and they won't mess with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good thing about using a class is that we can make many instances of it, each one having different category definitions, and they won't mess with each other.
Ideally, there should be a singe unicode configuration. In what case would we need to have differing category definitions?
#include "unicode.h" | ||
#include "unordered_set" | ||
|
||
class llm_regex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this class
- use functions with regex_
prefix
Move everything from unicode_regex.h
into unicode.h
} | ||
|
||
llm_regex() { | ||
unicode_engine.overload_category(REGEX_RANGES::Whitespace, "WHITESPACE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this done explicitly instead of being done by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In developing a universal Unicode engine, I face a notable challenge: the Unicode 15.0
standard, as defined by the Unicode Consortium, does not specify a definition for whitespace. Furthermore, the interpretation of \s
whitespace can vary among different regex engines.
uint32_t get_category(const uint32_t & codepoint) { | ||
return category_implement(codepoint); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are such indirections necessary? Just implement get_category
- no need for private methods
|
||
class llm_regex { | ||
public: | ||
std::vector<std::string> gpt2_style(const std::string & str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These implementations should be put behind a common API that accepts a regex string. Something like:
std::vector<std::vector<uint32_t>> regex_split(const std::string & str, const std::string & regex);
We then check if the regex is one that is implemented - if not we throw an error.
We should add tests where we compare the outputs from regex_split
with reference Python implementations
BTW, when testing the tokenizer on large-scale datasets like |
UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded ( |
I have a different perspective on this. If the token string already contained Unicode replacement characters, I'm curious how combining two or more such tokens could still result in a valid UTF-8 sequence. It seems counterintuitive, doesn't it? Perhaps we can clarify this with a straightforward experiment to see what actually happens. |
@cebtenzzre You are right, I still get |
What can be done to move this forward? |
This marks my second effort at resolving the issues with the pre-tokenizer in llama.cpp. I've developed a universal
Unicode engine
alongside a specializedregex engine
. Whileregex engine
has its limitations, only supporting very limited functionalities, it serves our needs well and offers impressive speed.I have a question regarding tokenizers. Is the
falcon
model the only one utilizing theBPE
tokenizer at this point? I'm asking because if that's not the case, we might encounter some issues.My concern arises from the potential issues due to the diversity in pre-tokenization among models. The current understanding, as reflected in both the master branch and this pull request, suggests that the
bpe_gpt2_preprocess
function is exclusively for thefalcon
model. However, if other models also useBPE
, this assumption could lead to complications.