-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
byte_decoder -> byte_encoder #3021
Conversation
Thanks for looking into this. Would like some more eyes on this to confirm it's OK to change Btw, I don't see any difference in the output of |
@ggerganov To give an uncommon example, when converting to ggml format, if the vocab contains Chinese and English in a subword, this will cause a bug. |
I'm not as fast as you guys, but I fully agree that changes without accompanying tests showing the effects are rarely acceptable. |
@goerch, should we add a testcase for the convert scripts? |
All we really need is an example of a model (and a prompt, if conversion succeeds either way) that can reproduce the incorrect behavior with the old code and correct behavior with the new code. Realistic unit tests for model conversion are not straightforward. |
I added a test and some fixes for Falcon-7B, i.e. the |
The tokenizer conversion on master does not work with e.g. mpt-7b, which I am trying to write a GGUF conversion script for. byte_decoder is obviously wrong, as it maps unicode to integers, but it is trying to index it with an integer:
But this solution does not work either:
Because text is a bytearray and byte_encoder returns a unicode string. edit: I think the correct fix is to replace |
byte_encoder is mapping 0-255 -> unicode chr
byte_decoder is mapping unicode chr -> 0-255
ord returns the int of unicode chr, so it should use byte_encoder
@ggerganov take a look~