model.transcribe() modified to perform batch inference on audio files #662
Replies: 17 comments 32 replies
-
How does it work for you? I tried to test it but it seems it doesn't work, I'm getting an error
|
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
results = model.transcribe(["audio1.mp3", "audio2.mp3"], ,initial_prompt=prompt) the prompt dose not work. see. #277
|
Beta Was this translation helpful? Give feedback.
-
Hi @Blair-Johnson ! Thank you very much. Your code runned quickly but when i used beam_size=5 the code returned the error
Can you help me fix this error? |
Beta Was this translation helpful? Give feedback.
-
Hey @Blair-Johnson, thanks for your great contribution! When I try to run batch processing using your fork (using batch size of 16), I'm experiencing significantly lower inference speed compared to vanilla inference (it takes longer using batch processing than with single example inference). I'm trying to transcribe ~10hrs of data segmented into ~50s segments. Do you have any pointers? Thanks! |
Beta Was this translation helpful? Give feedback.
-
This is incredible! |
Beta Was this translation helpful? Give feedback.
-
To use this repo I need to create a new environment and instead of installing the official whisper I should install the whisper from this repo right? |
Beta Was this translation helpful? Give feedback.
-
Hi there! OpenAI just release its latest large-v3 model in whisper, is it possiable for batch-whisper to support this update? |
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
@Blair-Johnson Maybe you should open a draft PR anyway, it'd be a bit easier to inspect the diff and so on :) |
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
I get ValueError with this implementation:
|
Beta Was this translation helpful? Give feedback.
-
你好,来信我已收到,我会尽快处理,谢谢!
|
Beta Was this translation helpful? Give feedback.
-
BTW, we have just added support for batched long-form transcription to Transformers: huggingface/transformers#27658. It will be in today's release. With batched transcription and a batch size of 16, Transformers generate is 4x faster than this codebase. Check the usage section in the PR description to try it out huggingface/transformers#27658 |
Beta Was this translation helpful? Give feedback.
-
it doesnt work for me, i get error |
Beta Was this translation helpful? Give feedback.
-
In case people didn't see my comment, here's what I did to get it to work for a list of audio files : import torchaudio
from transformers import WhisperForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import numpy as np
import torch
import time
import whisper
# Assuming files is the list of audio files
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3", "audio4.mp3", "audio5.mp3"]
ds = load_dataset("/path/to/files/", data_files=files)["train"]
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
raw_audio = [x["array"].astype(np.float32) for x in ds["audio"]]
processor = AutoProcessor.from_pretrained("openai/whisper-medium.en")
inputs = processor(raw_audio, return_tensors="pt", truncation=False, padding="longest", return_attention_mask=True, sampling_rate=16_000)
inputs = inputs.to("cuda", torch.float16)
model_medium = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en", torch_dtype=torch.float16)
model_medium.to("cuda")
# activate `temperature_fallback` and repetition detection filters and condition on prev text
result = model_medium.generate(**inputs, condition_on_prev_tokens=False, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), logprob_threshold=-1.0, compression_ratio_threshold=1.35, return_timestamps=True)
decoded = processor.batch_decode(result, skip_special_tokens=True) |
Beta Was this translation helpful? Give feedback.
-
I am not able to use the below code, it's throwing an error for me
how to use that?
|
Beta Was this translation helpful? Give feedback.
-
What's Different
A few people have posted questions asking about how whisper could be used to efficiently process audio clips in parallel. The underlying encoder and decoder both support batched inference, so it should be possible to batch clips together for increased throughput on GPUs. I modified the implementation of the
transcribe()
function to branch to an alternatebatch_transcribe()
whenever the user supplies a list of audio files. Everything is essentially the same as the defaultmodel.transcribe()
implementation, however the mel spectrograms and conditional prompts are batched together for the model inference stages of the pipeline. This allows us to get substantially sub-linear scaling of throughput on GPUs that have additional head-room when running serial transcription.Example usage:
Results
Testing transcription on a 3.5 hour podcast batched together with itself in groups of 1, 2, 4, 8, 16, and 32 we can see that we get significant speedups through batching on a NVIDIA A100 (this is the
largev1
model). We see sub-linear scaling until a batch size of 16, after which the GPU becomes saturated and the scaling becomes linear (but still 3-5x higher throughput than serial).When clips of different lengths are used, the internal model batch sizes will be reduced whenever a shorter clip is done being transcribed. This means that it is more efficient to batch transcribe clips of a similar length together. A possible pipeline for many audio clips would involve sorting them by length and batching neighboring clips by some optimal batch size for the GPU in question.
You can check out the batched version of whisper in my fork here.
I haven't tested every use-case to verify that these modifications are 100% non-breaking so there's no PR at the moment.
Beta Was this translation helpful? Give feedback.
All reactions