-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper-large-v3 compatibility #1530
Conversation
openai whisper large-v3 introduces change from 80 to 128 in mel input feature. exposing n_mels is required to propagate the input size to the audio feature extractor
dabcfa0
to
3836555
Compare
Hi could you also add to here: https://github.com/OpenNMT/CTranslate2/blob/master/python/ctranslate2/converters/transformers.py#L1929-L2042
obtained from here: https://github.com/openai/whisper/blob/fcfeaf1b61994c071bba62da47d7846933576ac9/whisper/__init__.py#L45 |
2ffb482
to
17b96e4
Compare
17b96e4
to
7615e41
Compare
The current check for multilingual support seems to be hardcoded with a specific vocabulary size: For instance, I believe the whisper-latest-v3 model has a vocabulary size of 51866, which is one more than the hardcoded value. This discrepancy could lead to the multilingual feature being incorrectly disabled for this model. Probably a more dynamic check need to be implemented to ensure compatibility with future models. edit: oh sorry, did not notice its already fixed on PR. |
A fix is already in this PR : CTranslate2/src/models/whisper.cc Line 73 in 7615e41
|
@vince62s Can you merge this? |
@@ -2039,4 +2039,16 @@ def main(): | |||
(26, 12), | |||
(27, 15), | |||
], | |||
"openai/whisper-large-v3": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible worth adding a comment on the source of these since its different from the source on L1928.
openai whisper large-v3 introduces change from 80 to 128 in mel input feature.
exposing n_mels is required to propagate the input size to the audio feature extractor
we also need to add the large-v3 alignment heads
a fix is required in the computation of _is_multilingual