Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: Invalid token: 51865 #85

Closed
smoores-dev opened this issue Dec 24, 2024 · 8 comments
Closed

ERROR: Invalid token: 51865 #85

smoores-dev opened this issue Dec 24, 2024 · 8 comments

Comments

@smoores-dev
Copy link

smoores-dev commented Dec 24, 2024

We've got a few Storyteller users reporting this error (sorry about the mangled stack trace, recently had to start bundling parts of echogarden to reduce Storyteller's image size):

ERROR: Invalid token: 51865
    err: {
      "type": "Error",
      "message": "Invalid token: 51865",
      "stack":
          Error: Invalid token: 51865
              at Whisper.assertIsValidToken (/app/.next/standalone/web/work-dist/worker.cjs:69550:17)
              at /app/.next/standalone/web/work-dist/worker.cjs:69507:41
              at Array.forEach (<anonymous>)
              at Whisper.tokensToText (/app/.next/standalone/web/work-dist/worker.cjs:69507:16)
              at Whisper.tokenToText (/app/.next/standalone/web/work-dist/worker.cjs:69504:21)
              at parseResultObject (/app/.next/standalone/web/work-dist/worker.cjs:67585:33)
              at async ChildProcess.<anonymous> (/app/.next/standalone/web/work-dist/worker.cjs:67533:38)
    }

I believe all users reporting this error are using a CUDA (12.6) build of whisper.cpp, and are using the large-v3-turbo model. The invalid token is always 51865, which one user noted 1 token past the vocabulary size of whisper.

A little bit more info from one user:

My workaround has always been to change the model from large-v3-turbo to large-v3-turbo-q_5, hit continue, and then after it transcribes a few more mp3s, it'll give the same error and I'll change it back and continue in this way until it finishes. I'm using CUDA 12.6

Let me know if it would be helpful to share the offending audio files!

@rotemdan
Copy link
Member

rotemdan commented Dec 24, 2024

For a multilingual model like large-v3-turbo the end of the valid token range token is (should be non-inclusive):

timestampTokensEnd: 50364 + 1501 // exactly 51865

Then the validation is:

	isValidToken(token: number) {
		return token < this.tokenConfig.timestampTokensEnd
	}

Meaning that 51865 is one above the highest timestamp token accepted and is rejected.

I've never seen a token with a value of 51865 before (and so never seen this error). That doesn't look right since it would go beyond the 30.0s range, meaning 1501 * 0.02 = 30.02s.

It's possible that my assumption wasn't correct here. I'm not sure.

In v2.0.14 I changed the code to accept this particular unusual value but clamp the second value to 30s when converted.

	isValidToken(token: number) {
		return token <= this.tokenConfig.timestampTokensEnd
	}

I'd rather still error on larger values, since the error reporting can be helpful to find bugs and other odd edge cases.

@smoores-dev
Copy link
Author

Amazing, thanks @rotemdan. I'll make a new Storyteller release and confirm that this fixes the issue. Erroring on larger values makes sense to me; I'll report back if we see that popping up.

@rotemdan
Copy link
Member

The reference Python implementation bounds the special token range like this:

    specials = [
        "<|endoftext|>",
        "<|startoftranscript|>",
        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
    ]

In for i in range(1501), 1501 is excluded as a valid timestamp token and doesn't receive a string identifier.

In my implementation, this token is not in the range of valid tokens and should never be output. I verified the ONNX model returns 51865 elements in the logit vector (indexed 0 to 51864) for multilingual and 51864 elements (indexes 0 to 51863) for English only models.

The out-of-range token seems to be received from the output of whisper.cpp, so it's possible it is an internal issue with whisper.cpp. I don't know.

@smoores-dev
Copy link
Author

Happy to open an issue there (those folks are quite a lot less responsive than you are, unfortunately; I've still never heard back about the other token issue I opened months ago) and see if they have any thoughts!

smoores-dev added a commit to smoores-dev/storyteller that referenced this issue Dec 24, 2024
@rotemdan
Copy link
Member

rotemdan commented Dec 24, 2024

If I open the ONNX models using Netron (an ONNX model viewer), the output tensor has a fixed length for its last axis, of 51865 for tiny multilingual model:

Screenshot_29

And 51864 for tiny.en:

Screenshot_30

But however, I just realized that in the large-v3-turbo ONNX it does have a length of 51866!

Screenshot_32

Very surprising.

I'm not sure what the last token means. Since the turbo model was added to the reference Python implementation pretty silently, this fact isn't may not be well documented. I'll need to go into the Python code to find any explanation for this.

@smoores-dev
Copy link
Author

Ah, woah! That explains why folks are only seeing this on large-v3-turbo, I guess

@rotemdan
Copy link
Member

rotemdan commented Dec 24, 2024

I looked also at the ONNX graph for whisper-large-v3 and it also has the extra token in the logit tensor.

I haven't seen any mention in the Python source code to what that means. Web searches didn't bring anything. This discussion just confirmed the what I said before:

Screenshot_33

Even 51865 is a valid token for the large models, I don't know why and when it should be output. It's a mystery for now, I guess.

@smoores-dev
Copy link
Author

🤷 well, the good news is that your patch worked, and Storyteller is working with the turbo model now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants