model.transcribe() modified to perform batch inference on audio files #662

Blair-Johnson · 2022-12-09T01:50:02Z

Blair-Johnson
Dec 9, 2022

What's Different
A few people have posted questions asking about how whisper could be used to efficiently process audio clips in parallel. The underlying encoder and decoder both support batched inference, so it should be possible to batch clips together for increased throughput on GPUs. I modified the implementation of the transcribe() function to branch to an alternate batch_transcribe() whenever the user supplies a list of audio files. Everything is essentially the same as the default model.transcribe() implementation, however the mel spectrograms and conditional prompts are batched together for the model inference stages of the pipeline. This allows us to get substantially sub-linear scaling of throughput on GPUs that have additional head-room when running serial transcription.

Example usage:

import whisper

model = whisper.load_model("base")
results = model.transcribe(["audio1.mp3", "audio2.mp3"])
print(results[0]['text'])
print(results[1]['text'])

Results
Testing transcription on a 3.5 hour podcast batched together with itself in groups of 1, 2, 4, 8, 16, and 32 we can see that we get significant speedups through batching on a NVIDIA A100 (this is the largev1 model). We see sub-linear scaling until a batch size of 16, after which the GPU becomes saturated and the scaling becomes linear (but still 3-5x higher throughput than serial).

When clips of different lengths are used, the internal model batch sizes will be reduced whenever a shorter clip is done being transcribed. This means that it is more efficient to batch transcribe clips of a similar length together. A possible pipeline for many audio clips would involve sorting them by length and batching neighboring clips by some optimal batch size for the GPU in question.

You can check out the batched version of whisper in my fork here.

I haven't tested every use-case to verify that these modifications are 100% non-breaking so there's no PR at the moment.

frankiedrake · 2023-01-04T13:40:07Z

frankiedrake
Jan 4, 2023

How does it work for you? I tried to test it but it seems it doesn't work, I'm getting an error

File "/home/.../python3.8/site-packages/whisper/decoding.py", line 622, in run
    tokens: Tensor = torch.tensor([self.initial_tokens]).repeat(n_audio, 1)
TypeError: must be real number, not list

15 replies

Blair-Johnson Jan 4, 2023
Author

@frankiedrake The repo has been updated. You should now be able to specify language as a list on a per-clip basis: model.transcribe(['clip1.mp3', 'clip2.mp3'], language=['en', 'uk']) or with a single string broadcast to all clips model.transcribe(['clip1.mp3', 'clip2.mp3'], language='uk'). Let me know if you have additional issues!

frankiedrake Jan 4, 2023

I've just fixed probably a typo in decoding.py:77
language: Optional[Union[str, List[str]]] = None
instead of
language: Optional[Union[str, list[str]]] = None
and it seems working, thank you!

frankiedrake Jan 5, 2023

Hello, again @Blair-Johnson, Just noticed that passing language param as string or omitting it causes errors in decoding.py:620
return [tuple(tokens) for tokens in res_tokens] TypeError: 'int' object is not iterable

I tried to fix it but it causes errors in other places, so I decided not to mess with it)

Blair-Johnson Jan 5, 2023
Author

Could you provide an example to reproduce the issue? You should also be able to file an issue on the repo- that'll give you more space to share details.

frankiedrake Jan 6, 2023

Sorry, I guess I messed up with installation repositories. I think it is okay now

kinghmy · 2023-01-04T13:40:28Z

kinghmy
Jan 4, 2023

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

qqfly1to19 · 2023-01-09T13:18:40Z

qqfly1to19
Jan 9, 2023

results = model.transcribe(["audio1.mp3", "audio2.mp3"], ,initial_prompt=prompt) the prompt dose not work. see. #277

prompt='以下是普通话的句子'
result = model.transcribe([audioFiles],language='zh',verbose=True,initial_prompt=prompt)

0 replies

huynhthanh98 · 2023-01-18T13:45:14Z

huynhthanh98
Jan 18, 2023

Hi @Blair-Johnson !

Thank you very much. Your code runned quickly but when i used beam_size=5 the code returned the error

ValueError: torch.Size([1, 3])[0] % 5 != 0

Can you help me fix this error?

1 reply

Blair-Johnson Jan 19, 2023
Author

The beam search decoder isn't supported at the moment, and this is an error related to that. I have an experimental version where the beam search options work, but the changes introduced performance issues in the default decoding behavior. It needs more work; I can create a branch with those changes for you.

a-froghyar · 2023-02-01T16:22:40Z

a-froghyar
Feb 1, 2023

Hey @Blair-Johnson, thanks for your great contribution! When I try to run batch processing using your fork (using batch size of 16), I'm experiencing significantly lower inference speed compared to vanilla inference (it takes longer using batch processing than with single example inference). I'm trying to transcribe ~10hrs of data segmented into ~50s segments. Do you have any pointers? Thanks!

7 replies

Blair-Johnson Feb 2, 2023
Author

Whisper does work well on long audio, but under the hood it breaks it into 30s segments and processes them separately. It will however, condition the text it generates for a segment on the predicted text from the previous segment to improve continuity. If you break a longer clip into segments and process them in batches non-sequentially, then you'll want to disable this conditioning. The quality of transcription will likely suffer. Audio batching works better when you have a number of different clips you'd like processed together.

chesha1 Feb 2, 2023

Whisper does work well on long audio, but under the hood it breaks it into 30s segments and processes them separately. It will however, condition the text it generates for a segment on the predicted text from the previous segment to improve continuity. If you break a longer clip into segments and process them in batches non-sequentially, then you'll want to disable this conditioning. The quality of transcription will likely suffer. Audio batching works better when you have a number of different clips you'd like processed together.

Thanks for your explaination. I have another question: when whisper processes audio which is shorter than 30s, it will also give an output of segmentation. Is this segmentation the same thing you mentioned above? The long audio is segmented by semantic information of text content or audio feature?

Blair-Johnson Feb 2, 2023
Author

Whisper does work well on long audio, but under the hood it breaks it into 30s segments and processes them separately. It will however, condition the text it generates for a segment on the predicted text from the previous segment to improve continuity. If you break a longer clip into segments and process them in batches non-sequentially, then you'll want to disable this conditioning. The quality of transcription will likely suffer. Audio batching works better when you have a number of different clips you'd like processed together.

Thanks for your explaination. I have another question: when whisper processes audio which is shorter than 30s, it will also give an output of segmentation. Is this segmentation the same thing you mentioned above? The long audio is segmented by semantic information of text content or audio feature?

Could you clarify what you mean by segmentation? I believe that clips shorter than 30s get padded to 30s before they're processed.

chesha1 Feb 3, 2023

Whisper does work well on long audio, but under the hood it breaks it into 30s segments and processes them separately. It will however, condition the text it generates for a segment on the predicted text from the previous segment to improve continuity. If you break a longer clip into segments and process them in batches non-sequentially, then you'll want to disable this conditioning. The quality of transcription will likely suffer. Audio batching works better when you have a number of different clips you'd like processed together.

Thanks for your explaination. I have another question: when whisper processes audio which is shorter than 30s, it will also give an output of segmentation. Is this segmentation the same thing you mentioned above? The long audio is segmented by semantic information of text content or audio feature?

Could you clarify what you mean by segmentation? I believe that clips shorter than 30s get padded to 30s before they're processed.

model.transcribe() returns a dict. This dict has an element whose key is 'segments' and value is a list containing text, timestamps and many other parameters. Sometimes length of this list is more than 1, even when input audio is shorter than 30s.

Blair-Johnson Feb 3, 2023
Author

Whisper does work well on long audio, but under the hood it breaks it into 30s segments and processes them separately. It will however, condition the text it generates for a segment on the predicted text from the previous segment to improve continuity. If you break a longer clip into segments and process them in batches non-sequentially, then you'll want to disable this conditioning. The quality of transcription will likely suffer. Audio batching works better when you have a number of different clips you'd like processed together.

Thanks for your explaination. I have another question: when whisper processes audio which is shorter than 30s, it will also give an output of segmentation. Is this segmentation the same thing you mentioned above? The long audio is segmented by semantic information of text content or audio feature?

Could you clarify what you mean by segmentation? I believe that clips shorter than 30s get padded to 30s before they're processed.

model.transcribe() returns a dict. This dict has an element whose key is 'segments' and value is a list containing text, timestamps and many other parameters. Sometimes length of this list is more than 1, even when input audio is shorter than 30s.

Ah, yes. The spectrograms that the model ingests represent 30s of audio, but the 'segments' that the model returns are of variable length and determined by the transformer output. Essentially a segment is the series tokens that the model outputs between start and end timestamp tokens. So when the model generates several timestamped sequences of text for a 30s audio clip you will get several returned segments containing information about the transcription at different times in the 30s clip (ex: 0-5s, 5-8s, 8-15s, etc.).

jake1271 · 2023-03-01T04:17:53Z

jake1271
Mar 1, 2023

This is incredible!
Regarding the commandline usage, when specifying multiple file, will they also be processed in batch mode?

1 reply

Blair-Johnson Mar 1, 2023
Author

At the moment, no. It would be fairly easy to make that modification though; you would just need to pass the filenames in as a list to the transcribe method here. Result will then be a list of transcripts rather than a single transcript, so you will need to iterate over it to write each one to a file.

treya-lin · 2023-06-20T05:06:08Z

treya-lin
Jun 20, 2023

To use this repo I need to create a new environment and instead of installing the official whisper I should install the whisper from this repo right?

1 reply

Blair-Johnson Jun 20, 2023
Author

Yes, that's correct.

S1E9 · 2023-11-09T04:39:17Z

S1E9
Nov 9, 2023

Hi there! OpenAI just release its latest large-v3 model in whisper, is it possiable for batch-whisper to support this update?

1 reply

Blair-Johnson Nov 9, 2023
Author

I would take a look at the huggingface pipelines for whisper-v3, I think they support batch decoding. https://huggingface.co/openai/whisper-large-v3

kinghmy · 2023-11-09T04:39:39Z

kinghmy
Nov 9, 2023

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

vadimkantorov · 2023-11-27T12:18:43Z

vadimkantorov
Nov 27, 2023

@Blair-Johnson Maybe you should open a draft PR anyway, it'd be a bit easier to inspect the diff and so on :)

0 replies

kinghmy · 2023-11-27T12:19:05Z

kinghmy
Nov 27, 2023

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

mohith7548 · 2024-01-19T08:04:35Z

mohith7548
Jan 19, 2024

I get ValueError with this implementation:

ValueError: tuple.index(x): x not in tuple
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e4288798-fca8-417f-8b49-f0a77d8c8ba0/lib/python3.8/site-packages/whisper/transcribe.py in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, **decode_options)
     73     """
     74     if type(audio) == list:
---> 75         return batch_transcribe(model=model,
     76                                 audio=audio,
     77                                 verbose=verbose,

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e4288798-fca8-417f-8b49-f0a77d8c8ba0/lib/python3.8/site-packages/whisper/transcribe.py in batch_transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, **decode_options)
    469             decode_options["prompt"] = [all_tokens[imap[i]][prompt_reset_since[imap[i]]:] for i in range(len(batch_segments))]
    470             decode_options["language"] = [l for i,l in enumerate(languages) if continue_processing[i]]
--> 471             results: List[DecodingResult] = decode_with_fallback(torch.stack(batch_segments))
    472             batch_tokens = [torch.tensor(result.tokens) for result in results]
    473 

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e4288798-fca8-417f-8b49-f0a77d8c8ba0/lib/python3.8/site-packages/whisper/transcribe.py in decode_with_fallback(segment)
    377 
    378             options = DecodingOptions(**kwargs, temperature=t)
--> 379             decode_result = model.decode(segment, options)
    380 
    381             needs_fallback = False

/databricks/python/lib/python3.8/site-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e4288798-fca8-417f-8b49-f0a77d8c8ba0/lib/python3.8/site-packages/whisper/decoding.py in decode(model, mel, options)
    858         mel = mel.unsqueeze(0)
    859 
--> 860     result = DecodingTask(model, options).run(mel)
    861 
    862     if single:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e4288798-fca8-417f-8b49-f0a77d8c8ba0/lib/python3.8/site-packages/whisper/decoding.py in __init__(self, model, options)
    481             # branch to handle batched case
    482             self.sample_begin: List[int] = [len(tokens) for tokens in self.initial_tokens]
--> 483             self.sot_index: List[int] = [tokens.index(self.tokenizers[i].sot) for i,tokens in enumerate(self.initial_tokens)]
    484 
    485             # inference: implements the forward pass through the decoder, including kv caching

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e4288798-fca8-417f-8b49-f0a77d8c8ba0/lib/python3.8/site-packages/whisper/decoding.py in <listcomp>(.0)
    481             # branch to handle batched case
    482             self.sample_begin: List[int] = [len(tokens) for tokens in self.initial_tokens]
--> 483             self.sot_index: List[int] = [tokens.index(self.tokenizers[i].sot) for i,tokens in enumerate(self.initial_tokens)]
    484 
    485             # inference: implements the forward pass through the decoder, including kv caching

ValueError: tuple.index(x): x not in tuple

0 replies

kinghmy · 2024-01-19T08:04:57Z

kinghmy
Jan 19, 2024

你好，来信我已收到，我会尽快处理，谢谢！

0 replies

patrickvonplaten · 2024-01-19T12:24:29Z

patrickvonplaten
Jan 19, 2024

BTW, we have just added support for batched long-form transcription to Transformers: huggingface/transformers#27658. It will be in today's release.

With batched transcription and a batch size of 16, Transformers generate is 4x faster than this codebase.

Check the usage section in the PR description to try it out huggingface/transformers#27658

3 replies

Blair-Johnson Jan 19, 2024
Author

@mohith7548 I recommend using the transformers version.

sidharthanup Mar 26, 2024

@patrickvonplaten I do not see info on how to process multiple mp3 files in parallel in the usage section of the PR. Can you help me with this?

sidharthanup Mar 26, 2024

Never mind. I got it to work! This is what I did:

import torchaudio 
from transformers import WhisperForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import numpy as np
import torch
import time
import whisper

# Assuming files is the list of audio files
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3", "audio4.mp3", "audio5.mp3"]

ds = load_dataset("/path/to/files/", data_files=files)["train"]
ds = ds.cast_column("audio", Audio(sampling_rate=16000))

raw_audio = [x["array"].astype(np.float32) for x in ds["audio"]]

processor = AutoProcessor.from_pretrained("openai/whisper-medium.en")
inputs = processor(raw_audio, return_tensors="pt", truncation=False, padding="longest", return_attention_mask=True, sampling_rate=16_000)
inputs = inputs.to("cuda", torch.float16)

model_medium = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en", torch_dtype=torch.float16)
model_medium.to("cuda")

# activate `temperature_fallback` and repetition detection filters and condition on prev text
result = model_medium.generate(**inputs, condition_on_prev_tokens=False, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), logprob_threshold=-1.0, compression_ratio_threshold=1.35, return_timestamps=True)

decoded = processor.batch_decode(result, skip_special_tokens=True)

afsara-ben · 2024-03-20T17:51:21Z

afsara-ben
Mar 20, 2024

it doesnt work for me, i get error TypeError: expected np.ndarray (got list)

2 replies

AntyRia Apr 10, 2024

me too, do u fixed it?

matlafu Apr 11, 2024

I might be wrong, but the way I understood it is that the batch functionality was available only in his fork of the repo. But that version has been deprecated, so it might not work anymore :( I encountered the same error today and could not make it work no matter what I tried

sidharthanup · 2024-04-11T17:15:32Z

sidharthanup
Apr 11, 2024

In case people didn't see my comment, here's what I did to get it to work for a list of audio files :

import torchaudio 
from transformers import WhisperForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
import numpy as np
import torch
import time
import whisper

# Assuming files is the list of audio files
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3", "audio4.mp3", "audio5.mp3"]

ds = load_dataset("/path/to/files/", data_files=files)["train"]
ds = ds.cast_column("audio", Audio(sampling_rate=16000))

raw_audio = [x["array"].astype(np.float32) for x in ds["audio"]]

processor = AutoProcessor.from_pretrained("openai/whisper-medium.en")
inputs = processor(raw_audio, return_tensors="pt", truncation=False, padding="longest", return_attention_mask=True, sampling_rate=16_000)
inputs = inputs.to("cuda", torch.float16)

model_medium = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en", torch_dtype=torch.float16)
model_medium.to("cuda")

# activate `temperature_fallback` and repetition detection filters and condition on prev text
result = model_medium.generate(**inputs, condition_on_prev_tokens=False, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), logprob_threshold=-1.0, compression_ratio_threshold=1.35, return_timestamps=True)

decoded = processor.batch_decode(result, skip_special_tokens=True)

0 replies

manish-kumar-iisc · 2024-11-29T13:54:02Z

manish-kumar-iisc
Nov 29, 2024

I am not able to use the below code, it's throwing an error for me

model = whisper.load_model("base")
results = model.transcribe(["audio1.mp3", "audio2.mp3"])

how to use that?

I have installed the whisper using url.
Do i need to install from the fork?
@Blair-Johnson

Getting below error

1 reply

Blair-Johnson Nov 29, 2024
Author

You would need to install from the fork for the provided code snippets to work. There are a few bugs you may or may not encounter, and the code hasn't been maintained for newer versions of whisper. I would recommend taking a look at the transformers implementation which should be up-to-date and have optimizations like flash attention and can use torch.compile. Even operating serially, you might see better overall throughput with these optimizations. Good luck!

model.transcribe() modified to perform batch inference on audio files #662

Replies: 17 comments · 32 replies

Blair-Johnson Jan 4, 2023 Author

Blair-Johnson Jan 5, 2023 Author

Blair-Johnson Jan 19, 2023 Author

Blair-Johnson Feb 2, 2023 Author

Blair-Johnson Feb 2, 2023 Author

Blair-Johnson Feb 3, 2023 Author

Blair-Johnson Mar 1, 2023 Author

Blair-Johnson Jun 20, 2023 Author

Blair-Johnson Nov 9, 2023 Author

Blair-Johnson Jan 19, 2024 Author

Blair-Johnson Nov 29, 2024 Author

Replies: 17 comments 32 replies

Blair-Johnson Jan 4, 2023
Author

Blair-Johnson Jan 5, 2023
Author

Blair-Johnson Jan 19, 2023
Author

Blair-Johnson Feb 2, 2023
Author

Blair-Johnson Feb 2, 2023
Author

Blair-Johnson Feb 3, 2023
Author

Blair-Johnson Mar 1, 2023
Author

Blair-Johnson Jun 20, 2023
Author

Blair-Johnson Nov 9, 2023
Author

Blair-Johnson Jan 19, 2024
Author

Blair-Johnson Nov 29, 2024
Author