Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model output asr often lost fragment text #1214

Open
RichardQin1 opened this issue Dec 24, 2024 · 4 comments
Open

model output asr often lost fragment text #1214

RichardQin1 opened this issue Dec 24, 2024 · 4 comments

Comments

@RichardQin1
Copy link

RichardQin1 commented Dec 24, 2024

After running the model for ASR recognition, some content is often missing
audio link:(https://share-github.tos-cn-beijing.volces.com/test.mp3)

import whisperx
from faster_whisper import WhisperModel

mp3_audio = whisperx.load_audio('test.mp3')
prompt = ' 新闻今日谈 林秀芹 李炜 时事评论员 '
language = 'zh'
asr_model = WhisperModel("large-v2", device='cuda', compute_type='float16')
segments, info = asr_model.transcribe(mp3_audio,
                                              beam_size=5,
                                              vad_filter=True,
                                              language=language,
                                              initial_prompt=prompt,
                                              hotwords=prompt,
                                              )
tmp_segments = []
for segment in segments:
    simplified_text = segment.text
    if hasattr(segment, 'words') and segment.words:
        tmp_segments.append(
            {"start": add_time + segment.start, "end": add_time + segment.end,
             "text": simplified_text, "words": segment.words})
    else:
        tmp_segments.append(
            {"start": add_time + segment.start, "end": add_time + segment.end,
             "text": simplified_text})  # , "words": segment.words
asr_result = {'segments': tmp_segments, 'language': language}

current output:

{
'language': 'zh',
 'segments': [
    {'end': 21.89, 'start': 17.49, 'text': '我是林秀芹 首先联合话题关注的是中德关系的新的进展'},   ......    
    {'end': 755.53, 'start': 748.93, 'text': '当然 谢谢李伟先生带来的分析 我们先休息下来 但关注的是世界经济论坛非洲峰会的相关话题 稍后再见'}, 
    {'end': 787.29, 'start': 781.09, 'text': '谈非洲峰会呢 六号在南非闭幕 这一次的非洲峰会可以说是吸引全世界一个关注目光'},
 ... ... ]}

correct output:

{
'language': 'zh',
 'segments': [
    {'end': 17.49, 'start': 14.8, 'text': '大家好 欢迎收看今天的 新闻今日谈'},   # lost content
    {'end': 21.89, 'start': 17.49, 'text': '我是林秀芹 首先联合话题关注的是中德关系的新的进展'},   ......    
    {'end': 755.53, 'start': 748.93, 'text': '当然 谢谢李伟先生带来的分析 我们先休息下来 但关注的是世界经济论坛非洲峰会的相关话题 稍后再见'}, 
    {'end': 781, 'start': 778, 'text': '欢迎回来 世界经济论坛'}, # lost content
    {'end': 787.29, 'start': 781.09, 'text': '非洲峰会呢 六号在南非闭幕 这一次的非洲峰会可以说是吸引全世界一个关注目光'},
 ... ... ]}

env:

faster-whisper               1.1.0

How to adjust parameters or modify code to ensure normal output
help plz.

@Purfview
Copy link
Contributor

Check if VAD didn't cut off those missing segments.

@RichardQin1
Copy link
Author

Check if VAD didn't cut off those missing segments.

how to check vad? sorry,im beginner

@RichardQin1
Copy link
Author

RichardQin1 commented Dec 25, 2024

Check if VAD didn't cut off those missing segments.

image
Does this prove that the time lost by audio was discarded by VAD?
How should I optimize @Purfview

@jschoen42
Copy link

the audio is probably too quiet for correct speaker recognition

  • with your original audio (converted to wav mono 16000 KHz)

original

-> VAD result: [00:17.424 -> 12:36.176], [13:02.288 -> 25:36.400]

  • the same audio normalized to 0 db

normalized

-> VAD result [00:14.864 -> 12:36.144], [13:01.264 -> 25:35.088]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants