Language selection #2

ArtyomZemlyak · 2022-09-26T09:35:18Z

I'm glad you shared this implementation.
A steep increase in performance relative to the torch on the CPU.

It is possible that you already know, but found how to enable recognition of a certain language.
We just can put in line 2012 main.cpp this:

std::vector<whisper_vocab::id> prompt = { vocab.token_sot, vocab.token_lang, vocab.token_task };

This 3 tokens formed here:
https://github.com/openai/whisper/blob/8cf36f3508c9acd341a45eb2364239a3d81458b9/whisper/tokenizer.py#L324-L331

For specific use in main.cpp, you can simply specify the desired index manually. But for regular users, it would be cool to specify which language they would prefer to see in the output.

ggerganov · 2022-09-26T10:07:48Z

Thanks for looking into this. I will definitely add support for language selection.
I wasn't 100% sure if it's just the starting tokens that I need to modify to make it work with other languages, but it looks like this is the case.

I will probably add a CLI argument to be able to select the language.

Regarding the performance:
Yes, I suppose the speed for the smaller models should be better compared to torch. For the bigger models however, my matrix multiplication cannot match the performance of the original implementation. I measured about 2-3 times slower performance for 1024 x 4096 matrix sizes on my M1 MacBook.

Also, my sampling strategy is very basic - this is another thing that makes whisper.cpp go faster, but of course, the results won't be as good as a proper beam search implementation.

ArtyomZemlyak · 2022-09-26T10:14:10Z

I can share the latest tests.

By accuracy:

Little difference on "ideal" recordings.
But a rather strong degradation for bad recordings with a lot of noise.

Device

CPU: i7 11800H
GPU: RTX 3080 Laptop
Python: 3.8.12
torch: 1.8.2

Torch

Model	Time, s	CPU/GPU	RAM, GB	VRAM, GB	DISK, GB
tiny	488	CPU	0.5		0.074
base	564	CPU	1		0.142
small	*3	CPU	2.5		0.472
medium	*20	CPU	6		1.492
large	*30	CPU	10		3.014
tiny	24	GPU	3.4	2.7	0.074
base	29	GPU	3.4	2.7	0.142
small	41	GPU	3.6	3.5	0.472
medium	89	GPU	4.3	6.1	1.492
large	-	GPU	-	-	3.014

C++ ggml

Model	Time, s	CPU/GPU	RAM, GB	VRAM, GB	DISK, GB
tiny	18	CPU	0.4		0.076
large	520	CPU	2.5		3.022

ggerganov · 2022-09-26T10:25:34Z

Much appreciated!

Few things:

Your CPU has 8 cores, so make sure to use -t 8 to run whisper.cpp with 8 threads. By default it uses 4
What does the *30 mean for the large / CPU run with Torch ?
How long is the audio recording that you used?

ArtyomZemlyak · 2022-09-26T10:30:54Z

Yes, -t 8 used for this test.
Time=488*3, 488*20, 488*30. I did not want to wait for the end of the test, since it was already clear that the speed of processing was sooo low. And I just pointed out that it's about that much slower than the tiny model.
7 files, total 200 seconds (and for ggml I subtracted model loading time for this result)

- Achieved big performance improvement + memory usage reduction - Can now translate / transcribe different languages

ggerganov · 2022-09-28T18:21:02Z

I just added an option to be able to select a language.
For example, the following command will translate French audio to English using the small model:

./download-ggml-model.sh small
./main -m models/ggml-small.bin -f fr0.wav --language fr --translate

Additionally, I was able to reduce the memory usage at runtime using flash attention and 16-bit float key/value memory. Also the inference speed has also improved as a result.

ArtyomZemlyak · 2022-09-28T18:57:34Z

New test on my same device.

Model	time, s	CPU/GPU
tiny	16	CPU
base	32	CPU
large	472	CPU

~10 % for large model better

frankiedrake · 2023-01-05T13:31:44Z

@ggerganov Language question again :) Is it possible to add a possibility to specify the langs array for each file respectively? To be able to specify different languages for different files

ggerganov · 2023-01-05T19:46:39Z

It's relatively easy to add this functionality.
Feel free to open an issue with feature request

- Achieved big performance improvement + memory usage reduction - Can now translate / transcribe different languages

* Fix MSVC compile error C3688 Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC. * Significantly improve inference quality In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference. * Significantly improve inference quality At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue. * Addressed a few minor issues Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`. * Significantly improve inference quality Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information. * Add annotation and performance improvement * Calculate FFT only when fft_in are not all zero * Some minor performance improvement * Fixed a bug impacting inference quality * The first version after all the analysis is completed. * Fix some bugs and add debug mode * Fixed several bugs * Temporarily disable speed-up mode and add debug mode. * Add debug mode * Disable speed-up mode and add debug mode * Fix CI error (#1) * Fix error * Fix error * Fixed several bugs including [BLANK_AUDIO] problem * Remove Hard-coded hann window * Some Final Fix (#2) * Fix error * Fix error * Probably the last commit * Probably the last commit * whisper : minor coding style changes * whisper : remove debug from public API --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Fix MSVC compile error C3688 Instead of simply using 'add_compile_options(/utf-8)' to address the MSVC compile error C3688, a better approach would be to handle it in a way that prevents passing '/utf-8' to NVCC. * Significantly improve inference quality In the function `log_mel_spectrogram_worker_thread`, there's an array out-of-bounds issue occurring during the calculation of complex number moduli. This issue is causing disruptions in the FFT spectrum, which, in turn, is reducing the quality of inference. * Significantly improve inference quality At last, I've pinpointed the actual source of the problem. Given that the frequency spectrum generated from real input data is symmetrical around the Nyquist frequency, there's a for-loop within the `log_mel_spectrogram_worker_thread` function that attempts to fold the frequency spectrum. Regrettably, a bug within this for-loop is causing a frame shift in the frequency spectrum. The previous attempt to remedy this, which involved using `fft_size + 1` when calculating the modulus, was merely a band-aid solution and did not address the underlying issue. * Addressed a few minor issues Fixed the issue of `fft_out` continuously expanding. Resolved the fallback caused by using 'break' instead of `fft_in[j] = 0`. * Significantly improve inference quality Thanks for your patience everyone. It's finally sorted out. Now, the right side of the FFT spectrum is being flipped over to the left, and the amplitudes at corresponding positions on the left and right are added together (the spectrum on the left needs to be shifted by one position), then the average is calculated. FFT_OUT[0] is no longer discarded, making full use of the limited space to pack in more information. * Add annotation and performance improvement * Calculate FFT only when fft_in are not all zero * Some minor performance improvement * Fixed a bug impacting inference quality * The first version after all the analysis is completed. * Fix some bugs and add debug mode * Fixed several bugs * Temporarily disable speed-up mode and add debug mode. * Add debug mode * Disable speed-up mode and add debug mode * Fix CI error (ggerganov#1) * Fix error * Fix error * Fixed several bugs including [BLANK_AUDIO] problem * Remove Hard-coded hann window * Some Final Fix (ggerganov#2) * Fix error * Fix error * Probably the last commit * Probably the last commit * whisper : minor coding style changes * whisper : remove debug from public API --------- Co-authored-by: Georgi Gerganov <[email protected]>

Patch

alpezajosip · 2024-04-25T12:03:53Z

Is there a way to set it up so it doesn't translate to English, I am trying to get to work from Croatian to Croatian, right now it translates it to English?

Thank you :)

jilljenn · 2024-07-26T06:27:11Z

Is there a way to set it up so it doesn't translate to English, I am trying to get to work from Croatian to Croatian, right now it translates it to English?

Thank you :)

I guess you should just remove option --translate

ggerganov added the enhancement New feature or request label Sep 26, 2022

ggerganov added a commit that referenced this issue Sep 28, 2022

Flash + language support (ref #2)

47538f1

- Achieved big performance improvement + memory usage reduction - Can now translate / transcribe different languages

ggerganov added a commit that referenced this issue Sep 28, 2022

Flash + language support (ref #2)

f888c23

- Achieved big performance improvement + memory usage reduction - Can now translate / transcribe different languages

ArtyomZemlyak closed this as completed Oct 1, 2022

ANMahmood mentioned this issue Mar 27, 2023

GGML_ASSERT: \whisper.cpp\ggml.c:3904: !ggml_is_transposed(a) #661

Closed

cjia4 mentioned this issue Mar 28, 2023

ggml_new_tensor_impl: not enough space in the scratch memory #671

Closed

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023

Flash + language support (ref ggerganov#2)

b6344df

- Achieved big performance improvement + memory usage reduction - Can now translate / transcribe different languages

huapingchen mentioned this issue May 13, 2023

segmentation fault when use Core ML #919

Closed

jacob-salassi mentioned this issue May 15, 2023

Core ML support #566

Merged

10 tasks

This was referenced May 31, 2023

Run talk example failed #782

Closed

Segmentation fault while running talk on mac m1 max #974

Open

warkcod mentioned this issue Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel #996

Open

Mingkun-Lu mentioned this issue Jul 4, 2023

How does Android interrupt quickly while running fullTranscribe and free up memory #1079

Open

ToSeven mentioned this issue Oct 14, 2023

whisper_init_state: ggml_metal_init() failed #1367

Closed

jettoblack pushed a commit to jettoblack/whisper.cpp that referenced this issue Feb 8, 2024

Merge pull request ggerganov#2 from bobqianic/patch

7047d32

Patch

bradmit mentioned this issue May 23, 2024

Crash with multiple whisper states running at the same time CUDA #2177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language selection #2

Language selection #2

ArtyomZemlyak commented Sep 26, 2022

ggerganov commented Sep 26, 2022 •

edited

Loading

ArtyomZemlyak commented Sep 26, 2022

ggerganov commented Sep 26, 2022

ArtyomZemlyak commented Sep 26, 2022

ggerganov commented Sep 28, 2022

ArtyomZemlyak commented Sep 28, 2022

frankiedrake commented Jan 5, 2023 •

edited

Loading

ggerganov commented Jan 5, 2023

alpezajosip commented Apr 25, 2024

jilljenn commented Jul 26, 2024

Language selection #2

Language selection #2

Comments

ArtyomZemlyak commented Sep 26, 2022

ggerganov commented Sep 26, 2022 • edited Loading

ArtyomZemlyak commented Sep 26, 2022

Device

Torch

C++ ggml

ggerganov commented Sep 26, 2022

ArtyomZemlyak commented Sep 26, 2022

ggerganov commented Sep 28, 2022

ArtyomZemlyak commented Sep 28, 2022

frankiedrake commented Jan 5, 2023 • edited Loading

ggerganov commented Jan 5, 2023

alpezajosip commented Apr 25, 2024

jilljenn commented Jul 26, 2024

ggerganov commented Sep 26, 2022 •

edited

Loading

frankiedrake commented Jan 5, 2023 •

edited

Loading