byte_decoder -> byte_encoder #3021

simonJJJ · 2023-09-05T04:11:22Z

byte_encoder is mapping 0-255 -> unicode chr
byte_decoder is mapping unicode chr -> 0-255
ord returns the int of unicode chr, so it should use byte_encoder

@ggerganov take a look~

ggerganov · 2023-09-05T06:55:15Z

Thanks for looking into this. Would like some more eyes on this to confirm it's OK to change

Btw, I don't see any difference in the output of test-tokenizer-0 with this change.
What would be expected effects of this?

simonJJJ · 2023-09-05T07:17:43Z

@ggerganov To give an uncommon example, when converting to ggml format, if the vocab contains Chinese and English in a subword, this will cause a bug.

ggerganov · 2023-09-05T07:53:56Z

Thanks for clarifying!
Let's wait for @klosax or @goerch to see if they have any feedback and we can merge.
I just lack the understanding of how unicode works and this is a blind change for me.

goerch · 2023-09-05T20:40:32Z

Thanks for clarifying! Let's wait for @klosax or @goerch to see if they have any feedback and we can merge. I just lack the understanding of how unicode works and this is a blind change for me.

I'm not as fast as you guys, but I fully agree that changes without accompanying tests showing the effects are rarely acceptable.

simonJJJ · 2023-09-06T03:36:09Z

@goerch, should we add a testcase for the convert scripts?

cebtenzzre · 2023-09-06T04:25:22Z

should we add a testcase for the convert scripts?

All we really need is an example of a model (and a prompt, if conversion succeeds either way) that can reproduce the incorrect behavior with the old code and correct behavior with the new code. Realistic unit tests for model conversion are not straightforward.

goerch · 2023-09-18T20:38:04Z

@goerch, should we add a testcase for the convert scripts?

I added a test and some fixes for Falcon-7B, i.e. the GPT2 tokenizer, which seems to work for me. Would be great if you could test if this helps.

cebtenzzre · 2023-09-27T20:35:09Z

The tokenizer conversion on master does not work with e.g. mpt-7b, which I am trying to write a GGUF conversion script for. byte_decoder is obviously wrong, as it maps unicode to integers, but it is trying to index it with an integer:

Traceback (most recent call last):
  File "convert_mpt_hf_to_gguf.py", line 121, in <module>
    text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "convert_mpt_hf_to_gguf.py", line 121, in <listcomp>
    text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
                      ~~~~~~~~~~~~^^^
KeyError: ' '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "convert_mpt_hf_to_gguf.py", line 126, in <module>
    text.append(byte_decoder[ord(c)])
                ~~~~~~~~~~~~^^^^^^^^
KeyError: 32

But this solution does not work either:

Traceback (most recent call last):
  File "convert_mpt_hf_to_gguf.py", line 126, in <module>
    text.append(byte_encoder[ord(c)])
TypeError: 'str' object cannot be interpreted as an integer

Because text is a bytearray and byte_encoder returns a unicode string.

edit: I think the correct fix is to replace ord(c) with just c, to match what it does if an exception wasn't thrown.

byte_decoder -> byte_encoder

cbdc564

ggerganov requested a review from klosax September 5, 2023 06:50

ggerganov added the need feedback Testing and feedback with results are needed label Sep 5, 2023

KerfuffleV2 mentioned this pull request Sep 19, 2023

Work on the BPE tokenizer #3252

Merged

cebtenzzre closed this Sep 29, 2023

cebtenzzre mentioned this pull request Oct 1, 2023

MPT support in llama.cpp #3417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

byte_decoder -> byte_encoder #3021

byte_decoder -> byte_encoder #3021

simonJJJ commented Sep 5, 2023

ggerganov commented Sep 5, 2023

simonJJJ commented Sep 5, 2023

ggerganov commented Sep 5, 2023

goerch commented Sep 5, 2023 •

edited

Loading

simonJJJ commented Sep 6, 2023

cebtenzzre commented Sep 6, 2023

goerch commented Sep 18, 2023

cebtenzzre commented Sep 27, 2023 •

edited

Loading

byte_decoder -> byte_encoder #3021

byte_decoder -> byte_encoder #3021

Conversation

simonJJJ commented Sep 5, 2023

ggerganov commented Sep 5, 2023

simonJJJ commented Sep 5, 2023

ggerganov commented Sep 5, 2023

goerch commented Sep 5, 2023 • edited Loading

simonJJJ commented Sep 6, 2023

cebtenzzre commented Sep 6, 2023

goerch commented Sep 18, 2023

cebtenzzre commented Sep 27, 2023 • edited Loading

goerch commented Sep 5, 2023 •

edited

Loading

cebtenzzre commented Sep 27, 2023 •

edited

Loading