Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tokenizer tests #3742

Merged
merged 7 commits into from
Oct 24, 2023
Merged

Conversation

Galunid
Copy link
Collaborator

@Galunid Galunid commented Oct 23, 2023

Conversion scripts used 96981f3

Issues:

  • Persimmon script doesn't allow for --vocab-only,

  • GPT-Neox tokenizer fails with std::unordered_map illegal access seen in other gpt2 tokenizer based models. I applied the fix from Missing tokenizer tests #3730 (comment) and the test passed. I didn't include the passing version here, only the failing one, let me know which one you want @goerch

  • Refact fails with byte not found in vocab (you can see in CI)

  • Starcoder fails with byte not found in vocab (you can see in CI)

Models used:

closes #3730

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

I'm testing refact and test-tokenizer-1-bpe is failing

    // TODO: why doesn't this work for the full range of Unicodes?
    // for (uint32_t cp = 0x10000; cp < 0x0010ffff; ++cp) {
    for (uint32_t cp = 0x10000; cp < 0x00080000; ++cp) {
[...]

for codepoint 0x40000. That doesn't look like a random problem. I'm not sure if we should only check assigned Unicode planes here (my unicode-fu is weak) or if this is a real problem?

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

I'm not sure if we should only check assigned Unicode planes here (my unicode-fu is weak) or if this is a real problem?

Restricting test-tokenizer-1-bpe to the assigned Unicode planes like so

    // NOTE: only testing assigned Unicode planes
    // for (uint32_t cp = 0x10000; cp < 0x0010ffff; ++cp) {
    for (uint32_t cp = 0x10000; cp < 0x00040000; ++cp) {
        std::string str = codepoint_to_utf8(cp);
        std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
        std::string check = llama_detokenize_bpe(ctx, tokens);
        if (str != check) {
            fprintf(stderr, "%s : error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
                __func__, cp, check.c_str(), check.length(), str.c_str(), str.length());
            return 4;
        }
    }
    for (uint32_t cp = 0x000e0000; cp < 0x0010ffff; ++cp) {
        std::string str = codepoint_to_utf8(cp);
        std::vector<llama_token> tokens = llama_tokenize(ctx, str, false);
        std::string check = llama_detokenize_bpe(ctx, tokens);
        if (str != check) {
            fprintf(stderr, "%s : error: codepoint %x detokenizes to '%s'(%zu) instead of '%s'(%zu)\n",
                __func__, cp, check.c_str(), check.length(), str.c_str(), str.length());
            return 4;
        }
    }

fixes refact and is working for all my models (didn't check starcoder).

@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

After applying your change to the test starcoder passes too

$ make test
Running tests...
Test project /home/kris/llama.cpp/build
      Start  1: test-quantize-fns
 1/18 Test  #1: test-quantize-fns ................   Passed    0.02 sec
      Start  2: test-quantize-perf
 2/18 Test  #2: test-quantize-perf ...............   Passed    0.07 sec
      Start  3: test-sampling
 3/18 Test  #3: test-sampling ....................   Passed    0.01 sec
      Start  4: test-tokenizer-0-llama
 4/18 Test  #4: test-tokenizer-0-llama ...........   Passed    0.07 sec
      Start  5: test-tokenizer-0-falcon
 5/18 Test  #5: test-tokenizer-0-falcon ..........   Passed    0.29 sec
      Start  6: test-tokenizer-1-llama
 6/18 Test  #6: test-tokenizer-1-llama ...........   Passed    2.88 sec
      Start  7: test-tokenizer-1-baichuan
 7/18 Test  #7: test-tokenizer-1-baichuan ........   Passed    3.09 sec
      Start  8: test-tokenizer-1-falcon
 8/18 Test  #8: test-tokenizer-1-falcon ..........   Passed    1.88 sec
      Start  9: test-tokenizer-1-aquila
 9/18 Test  #9: test-tokenizer-1-aquila ..........   Passed    1.95 sec
      Start 10: test-tokenizer-1-mpt
10/18 Test #10: test-tokenizer-1-mpt .............   Passed    1.33 sec
      Start 11: test-tokenizer-1-bloom
11/18 Test #11: test-tokenizer-1-bloom ...........   Passed    4.56 sec
      Start 12: test-tokenizer-1-gpt-neox
12/18 Test #12: test-tokenizer-1-gpt-neox ........Subprocess aborted***Exception:   0.91 sec
      Start 13: test-tokenizer-1-refact
13/18 Test #13: test-tokenizer-1-refact ..........   Passed    1.24 sec
      Start 14: test-tokenizer-1-starcoder
14/18 Test #14: test-tokenizer-1-starcoder .......   Passed    1.26 sec
      Start 15: test-grammar-parser
15/18 Test #15: test-grammar-parser ..............   Passed    0.00 sec
      Start 16: test-llama-grammar
16/18 Test #16: test-llama-grammar ...............   Passed    0.00 sec
      Start 17: test-grad0
17/18 Test #17: test-grad0 .......................   Passed    4.41 sec
      Start 18: test-rope
18/18 Test #18: test-rope ........................   Passed    0.04 sec

94% tests passed, 1 tests failed out of 18

Total Test time (real) =  24.01 sec

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

Start 12: test-tokenizer-1-gpt-neox
12/18 Test #12: test-tokenizer-1-gpt-neox ........Subprocess aborted***Exception: 0.91 sec

Interesting. I see

      Start 12: test-tokenizer-1-gpt-neox
12/17 Test #12: test-tokenizer-1-gpt-neox ........   Passed    1.87 sec

Hm.

@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

Pay no attention to neox, it's not relevant here (uses old model that was failing on map illegal access)

@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

Should I also include your patch to restrict tokenizer tests to unicode planes in this PR?

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

Should I also include your patch to restrict tokenizer tests to unicode planes in this PR?

Yes, please.

Copy link
Collaborator

@goerch goerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

@Galunid : sorry for mixing it up when using Github online editor (will never try again;). May I ask you to repair the mess (don't know enough about git to fix it on your branch)? Thanks again.

@Galunid Galunid force-pushed the add-tokenizer-tests branch from efd3c22 to 1244b00 Compare October 23, 2023 17:57
@Galunid
Copy link
Collaborator Author

Galunid commented Oct 23, 2023

@goerch Done, no worries ;)

@goerch
Copy link
Collaborator

goerch commented Oct 23, 2023

OK, I'll wait until tomorrow morning (+10 hours) before merging.

@goerch goerch merged commit daab3d7 into ggerganov:master Oct 24, 2023
32 checks passed
@Galunid Galunid mentioned this pull request Oct 24, 2023
12 tasks
mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Oct 27, 2023
* master: (350 commits)
  speculative : ensure draft and target model vocab matches (ggerganov#3812)
  llama : correctly report GGUFv3 format (ggerganov#3818)
  simple : fix batch handling (ggerganov#3803)
  cuda : improve text-generation and batched decoding performance (ggerganov#3776)
  server : do not release slot on image input (ggerganov#3798)
  batched-bench : print params at start
  log : disable pid in log filenames
  server : add parameter -tb N, --threads-batch N (ggerganov#3584) (ggerganov#3768)
  server : do not block system prompt update (ggerganov#3767)
  sync : ggml (conv ops + cuda MSVC fixes) (ggerganov#3765)
  cmake : add missed dependencies (ggerganov#3763)
  cuda : add batched cuBLAS GEMM for faster attention (ggerganov#3749)
  Add more tokenizer tests (ggerganov#3742)
  metal : handle ggml_scale for n%4 != 0 (close ggerganov#3754)
  Revert "make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)"
  issues : separate bug and enhancement template + no default title (ggerganov#3748)
  Update special token handling in conversion scripts for gpt2 derived tokenizers (ggerganov#3746)
  llama : remove token functions with `context` args in favor of `model` (ggerganov#3720)
  Fix baichuan convert script not detecing model (ggerganov#3739)
  make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing tokenizer tests
3 participants