OpenAI Whisper - Up to 3x CPU Inference Speedup using Dynamic Quantization #454
Replies: 7 comments 7 replies
-
I tried your fork on large model, the result is promising and I'm still validating the output. As I understand, you convert the model from f16 to int8 which reduces the quality but the execution will be faster. The Nvidia tensor core also supports int8 and int4 https://youtu.be/yyR0ZoCeBO8?t=20 ps: please correct me if I'm wrong. I'm new to python and machine learning stuff. |
Beta Was this translation helpful? Give feedback.
-
I tested with an M1 mac, and the model's size didn't change for all models. The speed is the same |
Beta Was this translation helpful? Give feedback.
-
Looking at the table the tiny model is slower after applying dynamic quantization, have you tried to quantize all the model except the convolutional layers? I had a similar problem with wav2vec2 base and distilled (the models with less parameters) if I quantized the whole model the times were worse but if I didn't apply it on the 8 convolutional layers the times were slightly better. This behaviour did not happen with the large model. |
Beta Was this translation helpful? Give feedback.
-
I tried to quantize the model on the V100 gpu, but I didn't get a significant speedup. |
Beta Was this translation helpful? Give feedback.
-
Hello! Is there any specific reason to not have this optimization merged into the main repository (even if not's not the default behavior)? |
Beta Was this translation helpful? Give feedback.
-
someone help me with the installation ? i'm new |
Beta Was this translation helpful? Give feedback.
-
I'm use this approach with a suggestion to reduce hallucinations, but I'm not getting those speedups in CPU. I dont get it, quantization should works even with compressed data. Any idea? |
Beta Was this translation helpful? Give feedback.
-
Applying a simple post-training, Dynamic Quantization process included with PyTorch to OpenAI Whisper provides great speedups for CPU based deployment. This is of particular interest for people running OpenAI Whisper models on laptops which lack hardware acceleration. Anecdotal results show that accuracy for the smaller models is the same, if not slightly higher after quantization but is very slightly reduced for the largest model.
Below results are for transcribing 30 seconds of audio:
Others have found even greater speedups for the
large
model, around roughly x3.25.openai-whisper-cpu (GitHub)
Beta Was this translation helpful? Give feedback.
All reactions