Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: code for whisper-large-v3 #548

Closed
wants to merge 9 commits into from
Closed

Conversation

stillmatic
Copy link

@stillmatic stillmatic commented Nov 7, 2023

I think these are the necessary changes

  • make feature_size/num_mels configurable - loaded from preprocessor_config.json, which comes from the HF repos
  • adds HF large-v3 conversion from @bungerr
  • add yue cantonese token
  • bumps version to 0.10.0

@stillmatic
Copy link
Author

fyi @gradjitta -- using your translated model here

@Purfview
Copy link
Contributor

Purfview commented Nov 7, 2023

Is this enough without CTranslate2 update?

@stillmatic
Copy link
Author

stillmatic commented Nov 7, 2023

Is this enough without OpenNMT/CTranslate2#1530?

I think that PR updates some docstrings and adds the n_mels() function, which indicates how many mels the model expects, which is helpful, but none of the logic changes inside encode. in the faster-whisper code, the feature extraction is done in Python (and already has an n_mels arg) and just passes the extracted array to CTranslate2, which works.

What would be helpful to add in CTranslate2 is the corresponding num_languages function, instead of relying on the hack here.

@funboarder13920
Copy link

I don't think that propagating num_languages is necessary as it only matters for one language. We can assume that people forcing the inference to yue know that they should not use large-v2 or large-v1.
large-v3 will be set to the new default large model

yue needs to be added to _LANGUAGE_CODES

n_mels can be infered from the model params, which is why I am updating ctranslate2 to propagate this param.

@stillmatic
Copy link
Author

I don't think that propagating num_languages is necessary as it only matters for one language. We can assume that people forcing the inference to yue know that they should not use large-v2 or large-v1.

v3 supports code-switching https://twitter.com/jeremyphoward/status/1721696652506100175 so multi-language support does matter

@ThomasKluiters
Copy link

When trying to use .transcribe(language='de') an error is returned, as the model is "not multilingual".

@funboarder13920
Copy link

Can you give me precisely where in your PR the num_languages is necessary ?

I don't think your PR is complete I can only see a change that is more a warning for a user sending a not supported language into the params which would not make the inference fail.
You didn't add the new language to _LANGUAGE_CODES so your change has no effect.

A change might be required when loading the hugging face tokenizer, is the v3 compatible huggingface tokenizer available yet ?

@stillmatic
Copy link
Author

When trying to use .transcribe(language='de') an error is returned, as the model is "not multilingual".

reproduced, also with another conversion: https://github.com/guillaumekln/faster-whisper/pull/549/files#diff-dff5046df32208d1eaf3b13702e9c9a4c5b44ac2451aa59a6f31aebf8b4d66e3R24

seems like the conversion is tricky to get the is_multilingual support (this logic hasn't changed in the PR)

You didn't add the new language to _LANGUAGE_CODES so your change has no effect.

https://github.com/guillaumekln/faster-whisper/blob/9c378b681c5c73795e01f7d6aaec368ece6499cf/faster_whisper/tokenizer.py#L278

@funboarder13920
Copy link

funboarder13920 commented Nov 7, 2023

In ctranslate2 is_multilingual is inferred from the size of the vocabulary, the condition is not met anymore

@funboarder13920
Copy link

Can you give me precisely where in your PR the num_languages is necessary ?

@stillmatic
Copy link
Author

you're right - after thinking about it more, I think it has no effect on the current release. however -- if the next version adds another language, then we will be in the same boat. I think it's useful to keep this logic here, I took it from the upstream Python repo and feel like it's more future proof. the is_multilingual logic could be similarly handled, using the same logic as upstream

@funboarder13920
Copy link

In the openai repo they have the language logic because they wrapped the tiktoken tokenizer.
Faster-whisper is using the huggingface tokenizer and does not really handle the language encoding, everything is done in the hugginface tokenizer.
I think there won't be any change in the hugginface tokenizer, it won't require any parameter change either, we just need the new vocabulary file in the download folder and the tokenizer should be v3 compatible.
In the future, if openai releases a new model with different languages, just having the new vocab file and the weights in the ctranslate2 format will be enough.

I agree that is_multilingual is not very robust, especially if a model is release with less languages or with a different amount of tokens. I think that the logic should remain in ctranslate2 but could be improved if another issue comes up in the future.

I will submit a PR when the HF model will be released and after having run some tests.

@Purfview
Copy link
Contributor

Purfview commented Nov 7, 2023

when the HF model will be released

It's now released: https://huggingface.co/openai/whisper-large-v3

@bungerr
Copy link

bungerr commented Nov 7, 2023

i have a working conversion uploaded to bababababooey/faster-whisper-large-v3
(see: #544 )

yeah the is_multilingual check never hits so you'll get something like The current model is English-only but the language parameter is set to 'yue'; using 'en' instead. if you try to use another lang. i just kind of lazily commented out all the multilingual checks in transcribe.py to test and it didnt break anything

i added a manual is_multilingual flag to WhisperModel() init as a temporary hack in the meantime

bungerr@8bafc9f

@bungerr
Copy link

bungerr commented Nov 8, 2023

okay i noticed quality with the english only model/tokenizer was significantly worse (missing punctuation, not segmenting sentences, etc) than when forcing the multilingual tokenizer, so i'm using multilingual for everything. seems like im getting the same quality as the openai whisper package now

i think the only thing missing now is getting is_multilingual properly set in the ct2 model? it breaks language detection (currently defaults to 'en'). if you try to force it, it throws a runtimeerror

  File "C:\git\faster-whisper-3\faster_whisper\transcribe.py", line 335, in transcribe
    results = self.model.detect_language(encoder_output)[0]
RuntimeError: detect_language can only be called on multilingual models

@stillmatic @funboarder13920

@flyingleafe
Copy link

@stillmatic Opened basically the same PR yesterday lol
I think that it is best to determine is_multilingual and num_languages, as well as mel parameters, systematically from the configs provided by HF, if those are available (first two are derived from vocab_size in config.json, the latter are given in preprocessor_config.json). The latter part of this is done in my PR, feel free to merge that in here and I will close my one, since yours was the first.

@funboarder13920
Copy link

The HF model was released but the tokenizer is not in the required format and I am not able to convert it to the fast tokenizer format tokenizer.json
The pytorch_model.bin is not available either but I was able to convert the model.safetensors to the bin format.

I will wait for the official files. If people are in a rush to use large-v3 they can use any of the proposed PR

@RafaRed
Copy link

RafaRed commented Nov 8, 2023

As I mentioned on CTranslate2/pull/1530

The current check for multilingual support seems to be hardcoded with a specific vocabulary size:
_is_multilingual = vocabulary.size() == 51865;
Link to code

For instance, I believe the whisper-latest-v3 model has a vocabulary size of 51866, which is one more than the hardcoded value. This discrepancy could lead to the multilingual feature being incorrectly disabled for this model.

Probably a more dynamic check need to be implemented to ensure compatibility with future models.

I could make the model work doing the changes suggested by @flyingleafe and removing the last line for the vocabulary.json "<|30.02|>" but, of course it will affect the model and it's not a solution.

edit: oh sorry, did not notice its already fixed on the @funboarder13920 PR.

@stillmatic
Copy link
Author

stillmatic commented Nov 8, 2023

cool - let's wait for @funboarder13920 PR to be merged, the multilingual fix might be the only thing necessary there?

the other question is how to handle the num_languages and related tokenizer things. I would like to get that from CTranslate2, as that would be consistent with other operations that rely on vocab size (ie the is_multilingual check).

question on the preprocessor_config.json -- can we get it from the config.json on the hub that already exists? we're already downloading it here: https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/utils.py#L78C10-L78C16 - has num_mel_bins. let me try and update to use this file, so that we do not need to download an additional new one. -- makes sense to just use preprocessor_config, I've updated here, ty for the pointer

fyi - pytorch weights are also available on hub: https://huggingface.co/openai/whisper-large-v3/blob/main/pytorch_model.bin

@funboarder13920
Copy link

I will make a num_languages available in the ctranslate2 model. I don't see when we will need the num_languages though.

@stillmatic
Copy link
Author

I think we should be able to succeed now just on this ... testing on CTranslate2 head commit now

@@ -21,6 +21,7 @@
"large-v1": "guillaumekln/faster-whisper-large-v1",
"large-v2": "guillaumekln/faster-whisper-large-v2",
"large": "guillaumekln/faster-whisper-large-v2",
"large-v3": "gradjitta/ct2-whisper-large-v3",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repo is not official, was not generated from the official HF openai/whisper-large-v3 repo and the tokenizer.json is likely wrong, there is no yue in the languages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which would you suggest using? @bungerr has one at bababababooey/faster-whisper-large-v3 but didn't copy the preprocessor_config.json over, which is useful to get the feature extractor setup correctly with.

Copy link

@bungerr bungerr Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oop i just added it in, had to copy the alignment heads over from there into config.json & forgot

also accidentally wiped my local openai/whisper-large-v3 repo, pulling again & will reconvert when the new wheels finish building https://github.com/OpenNMT/CTranslate2/actions/runs/6801730648

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: tokenizer.json is not provided in the official repo so if you want to run the ct2 translation script, you'll have to either use hf's openai_to_hf converter script (which may have been updated since i last checked(?)), or just do what i did and load+save from AutoTokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/whisper-large-v3")
tokenizer.save_pretrained("./tokenizerjson")

@funboarder13920
Copy link

I am still waiting for the large-v3 HF hub to be aligned with all the other models, specifically the tokenizer.json config before trying anything

@flyingleafe
Copy link

@funboarder13920 There is a very simple way to obtain the tokenizer.json file from the HF model.

from transformers import WhisperTokenizer, WhisperTokenizerFast

slow = WhisperTokenizer.from_pretrained("openai/whisper-large-v3")
fast = WhisperTokenizerFast(__slow_tokenizer=slow)
fast.save_pretrained("/path/to/dir")

This yields all the necessary files for the tokenizer including tokenizer.json

@flyingleafe
Copy link

@funboarder13920 alternatively you can use conversion script from my PR to transformers heh (huggingface/transformers#27336)

I pinged Sanchit there though regarding tokenizer.json in the official checkpoint to save everybody some hassle, so I think the issue should be resolved soon.

@Purfview
Copy link
Contributor

@flyingleafe what about config.json issue mentioned there -> #548 (comment)

@flyingleafe
Copy link

@Purfview I figured the issue out, so basically the difference between large-v2 and large-v3 tokenizer checkpoints is that <|endoftext|> token is included in the additional special tokens list for previous checkpoints versions, but excluded from it for large-v3.

This token is BOS token so by the logic of tokenizers library it should not be in this list in the first place, so the fact it was there before is some kind of bug apparently. But CTranslate2 converter relied on this token being present in the list.

I opened the PR in CTranslate2 fixing this: OpenNMT/CTranslate2#1546

The other issue which was present in the converter is that they did not get the suppressed token values from HF model's generation_config. Those are important for proper transcription (all the faster-whisper checkpoints by @guillaumekln so far included those, and they were equivalent to HF ones). I fixed that too.
https://huggingface.co/flyingleafe/faster-whisper-large-v3 <- this checkpoint is made using those fixes, together with this PR seems to be working well. @ThomasKluiters please feel free to test your punctuation issues on this checkpoint, I suspect those could be due to nonexistent suppress_tokens set in your previous attempts.

@Purfview
Copy link
Contributor

@flyingleafe I'm not an expert, but your fixes make sense. Are you sure that there are no other quirks with tokens?
Going to test it, I'll report later.

@ThomasKluiters
Copy link

@flyingleafe Thanks! I'm gonna have a look later today!

@Purfview
Copy link
Contributor

@flyingleafe I didn't noticed any difference so far.
Btw, there is new commit: https://huggingface.co/openai/whisper-large-v3/commit/99ef777c8f3c81bade67c14cfa359a3d6303b788

@flyingleafe
Copy link

@Purfview can you send me some of your test samples maybe? maybe my samples are too easy, so far the results I see are virtually identical to large-v2.

the half of the new commit to the HF checkpoint makes sense (bos/eos/pad token ids are wrong indeed), but the reduction of the vocabulary size actually does not make sense, will test HF version today with and without this change to figure out why they did that.

@Purfview
Copy link
Contributor

send me some of your test samples

Check this.

@funboarder13920
Copy link

funboarder13920 commented Nov 16, 2023

I ran some benchmarks to compare large-v2 with large-v3. large-v2 is currently way better than large-v3.
Some possible explanations:

  • large-v3 is not accurate
  • there is an issue with the conversion (openai to hf or hf to ct2)
  • there is an issue with the code from HF we are using
  • there is an issue with this code
  • the default inference params are different from large-v2 ?

Next week I will benchmark openai/large-v3 against hf/large-v3 against ct2/large-v3 and openai/large-v2 against openai/large-v3 to narrow down the possibilities.

@flyingleafe
Copy link

@funboarder13920 can confirm, on the example provided by the @Purfview large-v3 hallucinates a lot of BS in the end and in-between phrases, while large-v2 does not; no problems with punctuation though

@iorilu
Copy link

iorilu commented Nov 18, 2023

after this long discussion

any update on the final result

as I want to be if it's ok to update to v3, at least with no negative impact copmare to v2

@iorilu
Copy link

iorilu commented Nov 18, 2023

I ran some benchmarks to compare large-v2 with large-v3. large-v2 is currently way better than large-v3. Some possible explanations:

  • large-v3 is not accurate
  • there is an issue with the conversion (openai to hf or hf to ct2)
  • there is an issue with the code from HF we are using
  • there is an issue with this code
  • the default inference params are different from large-v2 ?

Next week I will benchmark openai/large-v3 against hf/large-v3 against ct2/large-v3 and openai/large-v2 against openai/large-v3 to narrow down the possibilities.

what language do you use to test ? test on mulitiple language or just English?

@blackpolarz
Copy link

From my own testing, the same problem occurs with whisper v3 translation (JP to EN) task. For most parts, whisper v3 has better grammar, but it tends to hallucinate and fill in words by itself, which can cause the context to change at times. This occurs even with the Hugging face's implementation, so I am not sure if the problem lies with the conversion from open-ai's model to hf or the model itself. Whisper v2 is slightly better in terms of "word for word" so context don't change as much but hallucination is still a problem.
I also noticed that "avg_logprob" in whisper v3 tends to be lower when testing the same audio on whisper v2.

@nguyendc-systran nguyendc-systran self-assigned this Nov 21, 2023
@doublex
Copy link

doublex commented Nov 22, 2023

@flyingleafe
Did you use this model in your test?
https://huggingface.co/flyingleafe/faster-whisper-large-v3

@@ -113,6 +114,9 @@ def __init__(
are saved in the standard Hugging Face cache directory.
local_files_only: If True, avoid downloading the file and return the path to the
local cached file if it exists.
feature_size: Number of mel filters to use for feature extraction. If not set,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anymore

@@ -142,7 +146,25 @@ def __init__(
"openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
)

self.feature_extractor = FeatureExtractor()
feature_extractor_file = os.path.join(model_path, "preprocessor_config.json")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move that into a specific method ?
And use the n_mels from ct2 as a fallback ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to get to this tomorrow. extremely limited bandwidth with American holidays currently

@@ -1,5 +1,5 @@
av==10.*
ctranslate2>=3.17,<4
ctranslate2>=3.21,<4
huggingface_hub>=0.13
tokenizers>=0.13,<0.15

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support to tokenizer 0.15 version, otherwise it is inconvenient to install faster-whisper and transformers/tokenizers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if any compatibility fixes are necessary? or will just tokenizers>=0.13 suffice?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Guillaume did that because ctranslate2 relies heavily on the tokenizers, to prevent a new tokenizers version to introduce breaking change into faster-whisper. We can go for tokenizers>=0.13,<0.16

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made some tests on ourside, the tokenizers 0.15 works normally, so: tokenizers>=0.13,<0.16 looks good for me.

@@ -1,3 +1,3 @@
"""Version information."""

__version__ = "0.9.0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this commit is for the release

@@ -21,6 +21,7 @@
"large-v1": "guillaumekln/faster-whisper-large-v1",
"large-v2": "guillaumekln/faster-whisper-large-v2",
"large": "guillaumekln/faster-whisper-large-v2",
"large-v3": "bababababooey/faster-whisper-large-v3",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is owning that ?
Can systran create its hub to keep ownership of the models and easily update old models and upload new models ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is owning that ?

Some user posted that link on issues.

Best would be to keep model under official acc like "systran", maybe move all models there.

Copy link
Collaborator

@nguyendc-systran nguyendc-systran Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we are waiting for the release on CTranslate2, then push the new converted model tomorrow (with the fix by OpenNMT/CTranslate2#1546) to Systran organization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, several models are now available on Systran organization: https://huggingface.co/Systran (including the large-v3 converted by the last CTranslate2 3.22.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyendc-systran thank you! So essentially just waiting for @stillmatic to do the last fixes you mentioned before merge?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Purfview I think you accidentally pasted the same URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AvivSham Fixed it.

Copy link

@blackpolarz blackpolarz Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I missed out any but in tokenizer.json, there is a difference in token 50363, nospeech vs nocaptions.
The same difference is in vocabulary.json.
ctranslate-large v2 uses nocaptions which is the same as what flyingleaf is using.
However, in hf-large-v3, it uses nospeech which is the same as what systran is using.

Copy link
Collaborator

@nguyendc-systran nguyendc-systran Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyendc-systran thank you! So essentially just waiting for @stillmatic to do the last fixes you mentioned before merge?

IMHO, yes.
Another point may be interesting / relevant is about this benchmark: #548 (comment).
Not sure if @funboarder13920 had a chance for that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the inference between openai/hf/faster_whisper are not exactly the same but very similar. I guess I was witnessing the differences between v2 and v3.

@@ -1,5 +1,5 @@
av==10.*
ctranslate2>=3.17,<4
ctranslate2>=3.21,<4
Copy link
Collaborator

@nguyendc-systran nguyendc-systran Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New release 3.22.0 on Ctranslate2 is in progress: OpenNMT/CTranslate2#1561
Please update this accordingly to have this fix when converting HF to CT2

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, please ping when it's on pypi

Copy link
Contributor

@hoonlight hoonlight Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +155 to +161
for k in [
"n_fft",
"hop_length",
"feature_size",
"sampling_rate",
"chunk_length",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor remark: in this new method could you make these parameters less "hard-code" way because they come from this class: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/feature_extractor.py#L8-L12

@Oscaarjs
Copy link
Contributor

@nguyendc-systran in https://huggingface.co/Systran/faster-whisper-large-v3/tree/main vocabulary is a .json but for the other models e.g. https://huggingface.co/Systran/faster-whisper-large-v2/tree/main its a .txt file. Is this intended?

@minhthuc2502
Copy link
Collaborator

minhthuc2502 commented Nov 23, 2023

@nguyendc-systran in https://huggingface.co/Systran/faster-whisper-large-v3/tree/main vocabulary is a .json but for the other models e.g. https://huggingface.co/Systran/faster-whisper-large-v2/tree/main its a .txt file. Is this intended?

Hello, Ct2 support vocabulary in both format json and txt. Recently, after this PR all the conversions will generate vocabulary.json instead of txt. That's why in https://huggingface.co/Systran/faster-whisper-large-v3/tree/main, we have vocabulary.json

@Oscaarjs Oscaarjs mentioned this pull request Nov 24, 2023
@Oscaarjs
Copy link
Contributor

Added some changes and reflections from this thread in a new PR #578

@nguyendc-systran
Copy link
Collaborator

Hi @stillmatic, @Oscaarjs et al.
Thank you very much for the contribution and all discussion in this thread.
I take the liberty of closing this discussion as the remaining works has been done and merged by #578
Thanks both of you ;)

@Purfview
Copy link
Contributor

Purfview commented Nov 24, 2023

Still, nospeech vs nocaptions difference in token 50363 wasn't answered.

@flyingleafe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.