JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

tylike · 2024-05-25T02:04:59Z

Environment:

OS: Windows 11 (cmd with UTF-8 output)
Whisper.cpp Version: 1.6.0
Model: V3 large

Command Used:

main.exe -m ggml-large-v3.bin -of d:\VideoInfo\80\subtitle -d 60000 -osrt -ojf -otxt d:\VideoInfo\80\80.wav -l auto --prompt "这是**简体中文**内容,每一段落尽量长,使用标点符号逗号、句号、感叹号、问号、双引号等。**不要乱码**"

Issue:

The .txt and .srt files are generated correctly.
The .json file contains garbled/incorrect characters.

Additional Details:

When using English audio, the .json, .srt, and .txt files are all generated correctly.

Steps to Reproduce:

Run the command provided with a Chinese audio file.
Check for garbled characters in the .json file.

en_subtitle.json
en_subtitle.srt.txt
80.zip

The text was updated successfully, but these errors were encountered:

tamo · 2024-05-26T04:53:16Z

~~Apparently escape_double_quotes_and_backslashes is not valid for mutibyte strings.~~
~~Maybe we should use replace or replace_all function for escaping.~~
~~Also there may be other problems too.~~

EDIT:
It seems to be a tokenizer problem.
Two characters (瑞典) became four tokens:

	"transcription": [
		{
...
			"text": "945年6月,瑞典著名犹太人建筑师马克思·甘佩尔接到了一份邀请函。",
			"tokens": [
...
				{
					"text": "9",
...
				},
				{
					"text": "45",
...
				},
				{
					"text": "年",
...
				},
				{
					"text": "6",
...
				},
				{
					"text": "月",
...
				},
				{
					"text": ",",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "�",
...
				},
				{
					"text": "著",

Already reported here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

tylike commented May 25, 2024

tamo commented May 26, 2024 •

edited

Loading

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

Comments

tylike commented May 25, 2024

tamo commented May 26, 2024 • edited Loading

tamo commented May 26, 2024 •

edited

Loading