-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Error for Hindi transcription #1700
Comments
This is an example of a file on which it fails - https://drive.google.com/file/d/1_BFuNOAqM3yv4P2A0i8KOT_RZ6LnYSCt/view?usp=sharing |
I'll explore this further. There might be an issue with the tokenizer. |
Looked at the issue you referenced, looks like it is similar to this #1313 (comment) |
Another side observation - I have found the accuracy for Hindi transcription via whisper.cpp to be much lower than when using openai API directly. |
Agreed. You don't even need to use their API. Just by using |
When doing transcription in Hindi for a file, I encounter invalid unicode character.
I have noticed this with many Hindi files.
Used whisper-large-v2 mode for inference on CPU. Have noticed the same issue when inferencing on GPU as well.
I am guessing the issue is: whisper model token output (BPE encoded) is not getting correctly mapped to unicode characters.
The text was updated successfully, but these errors were encountered: