-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: code for whisper-large-v3 #548
Conversation
fyi @gradjitta -- using your translated model here |
Is this enough without CTranslate2 update? |
I think that PR updates some docstrings and adds the What would be helpful to add in CTranslate2 is the corresponding |
I don't think that propagating num_languages is necessary as it only matters for one language. We can assume that people forcing the inference to yue know that they should not use large-v2 or large-v1. yue needs to be added to _LANGUAGE_CODES n_mels can be infered from the model params, which is why I am updating ctranslate2 to propagate this param. |
v3 supports code-switching https://twitter.com/jeremyphoward/status/1721696652506100175 so multi-language support does matter |
When trying to use .transcribe(language='de') an error is returned, as the model is "not multilingual". |
Can you give me precisely where in your PR the num_languages is necessary ? I don't think your PR is complete I can only see a change that is more a warning for a user sending a not supported language into the params which would not make the inference fail. A change might be required when loading the hugging face tokenizer, is the v3 compatible huggingface tokenizer available yet ? |
reproduced, also with another conversion: https://github.com/guillaumekln/faster-whisper/pull/549/files#diff-dff5046df32208d1eaf3b13702e9c9a4c5b44ac2451aa59a6f31aebf8b4d66e3R24 seems like the conversion is tricky to get the
|
In ctranslate2 is_multilingual is inferred from the size of the vocabulary, the condition is not met anymore |
Can you give me precisely where in your PR the num_languages is necessary ? |
you're right - after thinking about it more, I think it has no effect on the current release. however -- if the next version adds another language, then we will be in the same boat. I think it's useful to keep this logic here, I took it from the upstream Python repo and feel like it's more future proof. the |
In the openai repo they have the language logic because they wrapped the tiktoken tokenizer. I agree that is_multilingual is not very robust, especially if a model is release with less languages or with a different amount of tokens. I think that the logic should remain in ctranslate2 but could be improved if another issue comes up in the future. I will submit a PR when the HF model will be released and after having run some tests. |
It's now released: https://huggingface.co/openai/whisper-large-v3 |
i have a working conversion uploaded to bababababooey/faster-whisper-large-v3 yeah the i added a manual |
okay i noticed quality with the english only model/tokenizer was significantly worse (missing punctuation, not segmenting sentences, etc) than when forcing the multilingual tokenizer, so i'm using multilingual for everything. seems like im getting the same quality as the openai whisper package now i think the only thing missing now is getting
|
@stillmatic Opened basically the same PR yesterday lol |
The HF model was released but the tokenizer is not in the required format and I am not able to convert it to the fast tokenizer format tokenizer.json I will wait for the official files. If people are in a rush to use large-v3 they can use any of the proposed PR |
As I mentioned on CTranslate2/pull/1530
I could make the model work doing the changes suggested by @flyingleafe and removing the last line for the vocabulary.json edit: oh sorry, did not notice its already fixed on the @funboarder13920 PR. |
cool - let's wait for @funboarder13920 PR to be merged, the multilingual fix might be the only thing necessary there? the other question is how to handle the
fyi - pytorch weights are also available on hub: https://huggingface.co/openai/whisper-large-v3/blob/main/pytorch_model.bin |
I will make a num_languages available in the ctranslate2 model. I don't see when we will need the num_languages though. |
I think we should be able to succeed now just on this ... testing on CTranslate2 head commit now |
faster_whisper/utils.py
Outdated
@@ -21,6 +21,7 @@ | |||
"large-v1": "guillaumekln/faster-whisper-large-v1", | |||
"large-v2": "guillaumekln/faster-whisper-large-v2", | |||
"large": "guillaumekln/faster-whisper-large-v2", | |||
"large-v3": "gradjitta/ct2-whisper-large-v3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repo is not official, was not generated from the official HF openai/whisper-large-v3 repo and the tokenizer.json is likely wrong, there is no yue in the languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which would you suggest using? @bungerr has one at bababababooey/faster-whisper-large-v3
but didn't copy the preprocessor_config.json
over, which is useful to get the feature extractor setup correctly with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oop i just added it in, had to copy the alignment heads over from there into config.json & forgot
also accidentally wiped my local openai/whisper-large-v3 repo, pulling again & will reconvert when the new wheels finish building https://github.com/OpenNMT/CTranslate2/actions/runs/6801730648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: tokenizer.json is not provided in the official repo so if you want to run the ct2 translation script, you'll have to either use hf's openai_to_hf converter script (which may have been updated since i last checked(?)), or just do what i did and load+save from AutoTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("openai/whisper-large-v3")
tokenizer.save_pretrained("./tokenizerjson")
I am still waiting for the large-v3 HF hub to be aligned with all the other models, specifically the tokenizer.json config before trying anything |
@funboarder13920 There is a very simple way to obtain the
This yields all the necessary files for the tokenizer including |
@funboarder13920 alternatively you can use conversion script from my PR to transformers heh (huggingface/transformers#27336) I pinged Sanchit there though regarding |
@flyingleafe what about |
@Purfview I figured the issue out, so basically the difference between This token is BOS token so by the logic of I opened the PR in CTranslate2 fixing this: OpenNMT/CTranslate2#1546 The other issue which was present in the converter is that they did not get the suppressed token values from HF model's |
@flyingleafe I'm not an expert, but your fixes make sense. Are you sure that there are no other quirks with tokens? |
@flyingleafe Thanks! I'm gonna have a look later today! |
@flyingleafe I didn't noticed any difference so far. |
@Purfview can you send me some of your test samples maybe? maybe my samples are too easy, so far the results I see are virtually identical to the half of the new commit to the HF checkpoint makes sense (bos/eos/pad token ids are wrong indeed), but the reduction of the vocabulary size actually does not make sense, will test HF version today with and without this change to figure out why they did that. |
Check this. |
I ran some benchmarks to compare large-v2 with large-v3. large-v2 is currently way better than large-v3.
Next week I will benchmark openai/large-v3 against hf/large-v3 against ct2/large-v3 and openai/large-v2 against openai/large-v3 to narrow down the possibilities. |
@funboarder13920 can confirm, on the example provided by the @Purfview large-v3 hallucinates a lot of BS in the end and in-between phrases, while large-v2 does not; no problems with punctuation though |
after this long discussion any update on the final result as I want to be if it's ok to update to v3, at least with no negative impact copmare to v2 |
what language do you use to test ? test on mulitiple language or just English? |
From my own testing, the same problem occurs with whisper v3 translation (JP to EN) task. For most parts, whisper v3 has better grammar, but it tends to hallucinate and fill in words by itself, which can cause the context to change at times. This occurs even with the Hugging face's implementation, so I am not sure if the problem lies with the conversion from open-ai's model to hf or the model itself. Whisper v2 is slightly better in terms of "word for word" so context don't change as much but hallucination is still a problem. |
@flyingleafe |
@@ -113,6 +114,9 @@ def __init__( | |||
are saved in the standard Hugging Face cache directory. | |||
local_files_only: If True, avoid downloading the file and return the path to the | |||
local cached file if it exists. | |||
feature_size: Number of mel filters to use for feature extraction. If not set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anymore
@@ -142,7 +146,25 @@ def __init__( | |||
"openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en") | |||
) | |||
|
|||
self.feature_extractor = FeatureExtractor() | |||
feature_extractor_file = os.path.join(model_path, "preprocessor_config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move that into a specific method ?
And use the n_mels from ct2 as a fallback ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to get to this tomorrow. extremely limited bandwidth with American holidays currently
@@ -1,5 +1,5 @@ | |||
av==10.* | |||
ctranslate2>=3.17,<4 | |||
ctranslate2>=3.21,<4 | |||
huggingface_hub>=0.13 | |||
tokenizers>=0.13,<0.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support to tokenizer 0.15 version, otherwise it is inconvenient to install faster-whisper and transformers/tokenizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know if any compatibility fixes are necessary? or will just tokenizers>=0.13
suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Guillaume did that because ctranslate2 relies heavily on the tokenizers, to prevent a new tokenizers version to introduce breaking change into faster-whisper. We can go for tokenizers>=0.13,<0.16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made some tests on ourside, the tokenizers 0.15 works normally, so: tokenizers>=0.13,<0.16 looks good for me.
@@ -1,3 +1,3 @@ | |||
"""Version information.""" | |||
|
|||
__version__ = "0.9.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this commit is for the release
@@ -21,6 +21,7 @@ | |||
"large-v1": "guillaumekln/faster-whisper-large-v1", | |||
"large-v2": "guillaumekln/faster-whisper-large-v2", | |||
"large": "guillaumekln/faster-whisper-large-v2", | |||
"large-v3": "bababababooey/faster-whisper-large-v3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who is owning that ?
Can systran create its hub to keep ownership of the models and easily update old models and upload new models ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who is owning that ?
Some user posted that link on issues.
Best would be to keep model under official acc like "systran", maybe move all models there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we are waiting for the release on CTranslate2, then push the new converted model tomorrow (with the fix by OpenNMT/CTranslate2#1546) to Systran organization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, several models are now available on Systran organization: https://huggingface.co/Systran (including the large-v3 converted by the last CTranslate2 3.22.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nguyendc-systran thank you! So essentially just waiting for @stillmatic to do the last fixes you mentioned before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Purfview I think you accidentally pasted the same URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AvivSham Fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I missed out any but in tokenizer.json, there is a difference in token 50363, nospeech vs nocaptions.
The same difference is in vocabulary.json.
ctranslate-large v2 uses nocaptions which is the same as what flyingleaf is using.
However, in hf-large-v3, it uses nospeech which is the same as what systran is using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nguyendc-systran thank you! So essentially just waiting for @stillmatic to do the last fixes you mentioned before merge?
IMHO, yes.
Another point may be interesting / relevant is about this benchmark: #548 (comment).
Not sure if @funboarder13920 had a chance for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the inference between openai/hf/faster_whisper are not exactly the same but very similar. I guess I was witnessing the differences between v2 and v3.
@@ -1,5 +1,5 @@ | |||
av==10.* | |||
ctranslate2>=3.17,<4 | |||
ctranslate2>=3.21,<4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New release 3.22.0 on Ctranslate2 is in progress: OpenNMT/CTranslate2#1561
Please update this accordingly to have this fix when converting HF to CT2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, please ping when it's on pypi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for k in [ | ||
"n_fft", | ||
"hop_length", | ||
"feature_size", | ||
"sampling_rate", | ||
"chunk_length", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor remark: in this new method could you make these parameters less "hard-code" way because they come from this class: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/feature_extractor.py#L8-L12
@nguyendc-systran in https://huggingface.co/Systran/faster-whisper-large-v3/tree/main |
Hello, Ct2 support vocabulary in both format |
Added some changes and reflections from this thread in a new PR #578 |
Hi @stillmatic, @Oscaarjs et al. |
Still, nospeech vs nocaptions difference in token 50363 wasn't answered. |
I think these are the necessary changes
feature_size
/num_mels
configurable - loaded frompreprocessor_config.json
, which comes from the HF repos