-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration into 🤗 Transformers.js (+ generalization/minor improvements) #9
Comments
Thanks for the kind words and good job with the adaptation! I'm happy to hear that my code is (sort of) going into Transformers.js. One major difference in your adaptation is that I'm running the byte-pair encoding only once for the entire input string, whereas you are running it multiple times (once per whitespace-separated word if I'm not mistaken, please correct me if I'm wrong). I'm referring to this in your PR where you call
Have you checked if this change has any performance implications? The performance impact here is probably negligible, but I would check anyway. Your version might be even faster. I was also wondering if this is guaranteed to produce the same results? I believe it's guaranteed to produce the same results for all tokenization schemes which never have whitespace in the middle of the token (e.g. "_grabbed" has whitespace in the beginning, whereas "grab_bed" would have whitespace in the middle).
Thanks for spotting this issue. When I wrote this code I must have thought "nobody ever needs to tokenize more than 100 000 characters at the time", but it only took a few weeks for all those superhot models to come out, and here we are. Note that your fractional divider is word length, which works great, because words aren't very large (at least not any real words in any real language)... but if I used your fix directly in my version, my divider would be the entire input string length. For example, if the input string was 1 000 000 characters long, then my divider would be 1 000 000. I'm not sure at what point we get numerical stability issues when we multiply / divide large numbers. I adapted your fix to v1.1.3 release now in a bit hacky way, where I tried to avoid numerical instability issues like this:
This should support inputs up to 10000000000 length without numerical stability issues (and inputs at that length will be too large to process anyway, so I think this should be good enough for all real inputs).
Thanks for pointing this out, fixed in v1.1.3 now. |
By the way, thanks for crediting me in the code comments at llama tests and PriorityQueue! Could I ask you to add a similar code comment credit to the |
So, this depends on the tokenizer: more specifically, the (1) pretokenization and (2) splitting on special characters. For example:
And it is these "pretokenized" texts which each run through the
Yeah, I actually did do quite a bit of performance testing! Although the GPT-4 and Llama tokenizers have many differences (e.g., vocab size; GPT-4=100277, Llama2=32000), GPT-4's tokenization is much faster than Llama (only noticeable with longer pieces of text). An even more important benefit of pretokenization is that it's much more likely for smaller words/texts to be reused (meaning we can implement a very basic caching system). As an example, to tokenize 500k characters, it takes GPT-4 ~350ms vs. Llama at ~800ms (even with the >3x larger vocabulary/merge lists). TLDR: pretokenization helps a lot for larger texts, but the difference isn't noticeable for smaller texts!
I've written an extensive set of unit tests, which only pass if they exactly match the python implementation, so I believe the implementation does operate correctly (there is also no splitting on whitespace, as mentioned above)
Exactly! I doubt this problem will ever cause any issue in reality, but I just prefer it over a "magic number" approach. The "divide by word length" should hopefully future-proof it.
Good point to consider. Though, I think that since the denominator is the same in all cases (and it's only the numerator which changes), I don't think there will be any possible situation where this condition fails.😅
Absolutely! 🤗 Sorry - I forgot to update the JSDoc description for it. |
Ah, very nice explanation. So the GPT-4 tokenizer uses a pretokenizer with a regex that splits on whitespace and punctuation, leading to a situation where you end up calling bpe on individual words, which is more performant than calling bpe on the entire input string, and this also allows you to set up simple caching for additional performance gain. And the LLaMa tokenizer doesn't use a pretokenizer, so your implementation ends up calling bpe on the entire input string, same as in my implementation. You're probably right that numerical stability issues won't lead to incorrect results when dividing both numbers by the same denominator. I simplified the mergePrio calculation now, using the same fraction as you did in your implementation. |
Exactly! 🔥
Looks good 👍 Thanks again for your amazing work! I'll close the issue since I think everything's resolved now 😇 |
First of all, I just wanted to thank you for this implementation! It is significantly faster than my original version of the Llama tokenizer (which would often just break for extremely large input texts). I've decided to generalize your implementation so that it can be applied to other BPE-based tokenizers (e.g., GPT-4). Here's my current PR for it to be added to Transformers.js.
Here's an example playground for some other tokenizers: demo.
I also made a few fixes/improvements to your version:
x * 100000 + leftNode.origPos
"trick", I instead add a fractional "positional bias" to the score, which will allow for tiebreaks to be resolved correctly. As an example, if the input text (origPos) is > 100k characters, and in some rare situation where the difference in bpe_ranks is low, this could result in incorrect behaviour.top
not defined inllama-tokenizer-js/llama-tokenizer.js
Line 140 in a9c05be
Feel free to use any of this in your version too!
The text was updated successfully, but these errors were encountered: