OpenAI Whisper - Up to 3x CPU Inference Speedup using Dynamic Quantization #454

MiscellaneousStuff · 2022-11-02T17:33:44Z

MiscellaneousStuff
Nov 2, 2022

Applying a simple post-training, Dynamic Quantization process included with PyTorch to OpenAI Whisper provides great speedups for CPU based deployment. This is of particular interest for people running OpenAI Whisper models on laptops which lack hardware acceleration. Anecdotal results show that accuracy for the smaller models is the same, if not slightly higher after quantization but is very slightly reduced for the largest model.

Below results are for transcribing 30 seconds of audio:

Whisper Model	Pre-Quant (secs)	Post-Quant (secs)	Speedup
tiny	2.3	3.1	0.74x slowdown
base	5.2	3.2	1.62x speedup
small	19.1	6.9	2.76x speedup
medium	60.7	23.1	2.62x speedup

Others have found even greater speedups for the large model, around roughly x3.25.

openai-whisper-cpu (GitHub)

nlgtuankiet · 2022-11-10T02:34:52Z

nlgtuankiet
Nov 10, 2022

I tried your fork on large model, the result is promising and I'm still validating the output.
Can this apply to GPU for speed over quality for trace-off?

As I understand, you convert the model from f16 to int8 which reduces the quality but the execution will be faster.

The Nvidia tensor core also supports int8 and int4

https://youtu.be/yyR0ZoCeBO8?t=20

ps: please correct me if I'm wrong. I'm new to python and machine learning stuff.

4 replies

MiscellaneousStuff Nov 10, 2022
Author

Hello there,

In principle you should be able to apply TensorRT to the model and get a similar increase in performance for GPU deployment.
However, as the GPUs inference speed is so much faster than real-time anyways (around 0.5 seconds for 30 seconds of real-time audio), this would only be useful if you was transcribing a large amount of audio (podcasts, movies, large amounts of audio files, etc.).

So if you are deploying this on the GPU / cloud GPUs (incl. colab), then I would test what the baseline speed of transcribing your audio is first before trying to get your model to work on TensorRT as it will require more work and likely not be needed.

nlgtuankiet Nov 10, 2022

likely not be needed.

I am transcribing learning materials (similar to podcasts) to make them "searchable" for educational purposes
The audio language is not EN and I found out that the accuracy of models below large is not quite good so I have to use large model for better accuracy.

With colab T4 I was getting around 1.6x
The amount of audio duration is 75 days, T4 instance is 178.85$ / month so it would cost me ~280$
The x2 speed would save ~140$.
I believe there are many people who are doing the same thing (converting audio to text to make them searchable).
In my humble opinion, the amount of money and electricity saved on a global scale is worth it.

bghorvath Jan 18, 2023

I believe there are many people who are doing the same thing (converting audio to text to make them searchable).

I was in the same boat, I ended up buying a used RTX 3060, because at this point it was cheaper than paying for a cloud GPU. And the speed is 1.74x realtime.

gosuimbalyndh Dec 15, 2023

Hello.
Are you Vietnamese? Ông người VN hả?
I'm also doing some transcription with OpenAI Whisper but my GPU is <8GB VRAM. As a result, it's unable to handle large model although I really wanna try that. Not sure if it's necessary to upgrade my GPU.
Thank and and best wishes

nlgtuankiet · 2022-11-10T08:29:13Z

nlgtuankiet
Nov 10, 2022

I tested with an M1 mac, and the model's size didn't change for all models. The speed is the same
Can this be the case?
https://discuss.pytorch.org/t/can-apple-m1-support-quantization/153722

1 reply

MiscellaneousStuff Nov 10, 2022
Author

By the looks of that article, the M1 chip on the more recent MacBooks uses a different CPU architecture to the existing Intel/AMD and ARM architectures so quantization isn't currently supported. I believe this would also be an issue for static quantization as well so unfortunately this may just be an issue for recent Macs. As for the GPU / whatever accelerator the Mac uses, I would imagine this would be a similar issue as well.

ifrz · 2023-01-21T23:56:57Z

ifrz
Jan 21, 2023

Hi @MiscellaneousStuff

Looking at the table the tiny model is slower after applying dynamic quantization, have you tried to quantize all the model except the convolutional layers? I had a similar problem with wav2vec2 base and distilled (the models with less parameters) if I quantized the whole model the times were worse but if I didn't apply it on the 8 convolutional layers the times were slightly better. This behaviour did not happen with the large model.

1 reply

MiscellaneousStuff Jan 22, 2023
Author

Hello there. The OpenAI whisper model is basically just a Transformer model with Mel spectrogram inputs which are fed into a CNN and then the Transformer component and then the predicted tokens are outputted. Dynamic Quantisation doesn’t support CNNs right now so only the linear layers within the Transformer are quantised using this method.

tnlin · 2023-02-27T07:31:32Z

tnlin
Feb 27, 2023

I tried to quantize the model on the V100 gpu, but I didn't get a significant speedup.
I don't know if anyone has the same problem?

1 reply

MiscellaneousStuff Feb 27, 2023
Author

This library doesn't work with GPUs because PyTorch doesn't support Dynamic Quantization with GPUs on CUDA. Your best bet would be to try something like TensorRT.

turicas · 2023-04-29T03:23:20Z

turicas
Apr 29, 2023

Hello! Is there any specific reason to not have this optimization merged into the main repository (even if not's not the default behavior)?

0 replies

Skilatchi2020 · 2023-06-12T03:51:22Z

Skilatchi2020
Jun 12, 2023

someone help me with the installation ? i'm new

0 replies

augustomaillo · 2024-03-27T21:53:11Z

augustomaillo
Mar 27, 2024

I'm use this approach with a suggestion to reduce hallucinations, but I'm not getting those speedups in CPU. I dont get it, quantization should works even with compressed data. Any idea?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI Whisper - Up to 3x CPU Inference Speedup using Dynamic Quantization #454

{{title}}

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenAI Whisper - Up to 3x CPU Inference Speedup using Dynamic Quantization #454

Replies: 7 comments · 7 replies

MiscellaneousStuff Nov 10, 2022 Author

MiscellaneousStuff Nov 10, 2022 Author

MiscellaneousStuff Jan 22, 2023 Author

MiscellaneousStuff Feb 27, 2023 Author

Replies: 7 comments 7 replies

MiscellaneousStuff Nov 10, 2022
Author

MiscellaneousStuff Nov 10, 2022
Author

MiscellaneousStuff Jan 22, 2023
Author

MiscellaneousStuff Feb 27, 2023
Author