Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

Open
tylike opened this issue May 25, 2024 · 1 comment
Open

Comments

@tylike
Copy link

tylike commented May 25, 2024

Environment:

  • OS: Windows 11 (cmd with UTF-8 output)
  • Whisper.cpp Version: 1.6.0
  • Model: V3 large

Command Used:

main.exe -m ggml-large-v3.bin -of d:\VideoInfo\80\subtitle -d 60000 -osrt -ojf -otxt d:\VideoInfo\80\80.wav -l auto --prompt "这是**简体中文**内容,每一段落尽量长,使用标点符号逗号、句号、感叹号、问号、双引号等。**不要乱码**"

Issue:

  • The .txt and .srt files are generated correctly.
  • The .json file contains garbled/incorrect characters.

Additional Details:

  • When using English audio, the .json, .srt, and .txt files are all generated correctly.

Steps to Reproduce:

  1. Run the command provided with a Chinese audio file.
  2. Check for garbled characters in the .json file.

en_subtitle.json
en_subtitle.srt.txt
80.zip

@tamo
Copy link
Contributor

tamo commented May 26, 2024

Apparently escape_double_quotes_and_backslashes is not valid for mutibyte strings.
Maybe we should use replace or replace_all function for escaping.
Also there may be other problems too.

EDIT:
It seems to be a tokenizer problem.
Two characters (瑞典) became four tokens:

	"transcription": [
		{
...
			"text": "945年6月,瑞典著名犹太人建筑师马克思·甘佩尔接到了一份邀请函。",
			"tokens": [
...
				{
					"text": "9",
...
				},
				{
					"text": "45",
...
				},
				{
					"text": "年",
...
				},
				{
					"text": "6",
...
				},
				{
					"text": "月",
...
				},
				{
					"text": ",",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "著",

Already reported here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants