Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to run openai-whisper ggml model on raspberry pi hardware? #7

Closed
nyadla-sys opened this issue Sep 30, 2022 · 156 comments
Closed
Labels
build Build related issues enhancement New feature or request

Comments

@nyadla-sys
Copy link

nyadla-sys commented Sep 30, 2022

is it possible to run this gghml model on raspberry pi hardware?

@nyadla-sys
Copy link
Author

@ggerganov could you please help on this ?

@ggerganov
Copy link
Owner

It will probably work - why don't you give it a try?

@ggerganov
Copy link
Owner

Good news!

I just tried it on a Raspberry Pi 4 Model B from 2018 and it works!

The tiny.en model takes 140 sec to transcribe a 30 sec audio, but I think this can be improved, because I disabled all SIMD instructions to make it compile. I will improve this in the following days.

If you want to try it, use the raspberry branch:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
git checkout reaspberry
make tiny.en

@ggerganov ggerganov added the enhancement New feature or request label Sep 30, 2022
@nyadla-sys
Copy link
Author

I don't currently have a Raspberry Pi board, but I will run as soon as I get one.

@nyadla-sys nyadla-sys changed the title is it possible to run this gghml model on rasberry pi hardware? is it possible to run this ggml model on rasberry pi hardware? Sep 30, 2022
@ggerganov
Copy link
Owner

You can try running it on whatever Raspberry you have - use the same instructions.

@nyadla-sys
Copy link
Author

@ggerganov Thanks and appreciate for your quick response

@nyadla-sys nyadla-sys changed the title is it possible to run this ggml model on rasberry pi hardware? is it possible to run openai-whisper ggml model on raspberry pi hardware? Sep 30, 2022
@ggerganov
Copy link
Owner

Some more experiments - enabling NEON instructions reduces the time all the way down to just ~15 seconds to process a 30 second audio.

@nyadla-sys
Copy link
Author

this is awesome

@nyadla-sys
Copy link
Author

nyadla-sys commented Sep 30, 2022

@ggerganov is it possible to do the audio streaming on Raspberry Pi and live convert it to captions?

@WilliamTambellini
Copy link
Contributor

@ggerganov
perhaps have a look at
https://github.com/oneapi-src/oneDNN
It would resolve all the compilation issues by letting onednn do the optimizations for the local CPU at runtime whatever the cpu model/brand.

@nyadla-sys
Copy link
Author

@ggerganov On a linux computer, I tried the following commands, and streaming performed as expected.
$ git clone https://github.com/ggerganov/whisper.cpp.git
$ bash ./download-ggml-model.sh tiny.en
$ sudo apt-get install libsdl2-dev
$ make
$ make stream -lSDL2
$ ./stream -m models/ggml-tiny.en.bin

@ggerganov ggerganov added the build Build related issues label Oct 5, 2022
@nyadla-sys
Copy link
Author

nyadla-sys commented Oct 6, 2022

@ggerganov Used the following command to run a stream on a Raspberry Pi 4, but its decoding speed is slow(performance is poor).
Perhaps further improvements are needed.(Currently each 30 seconds audio inference is taking around 15 seconds to execute)
./stream -m models/ggml-tiny.en.bin

@ggerganov
Copy link
Owner

@nyadla-sys
The performance can be improved if the CPU supports the ARM8.2 architecture - it provides 16-bit floating point vector arithmetic. The whisper.cpp implementation already supports this so you just need the correct hardware.

Based on this table, you need a device with a Cortex-A75 CPU:

https://en.wikipedia.org/wiki/Comparison_of_Armv8-A_processors

From a quick google search, none of the existing Raspberry products comes with this processor.

There are rumours that Raspberry Pi 5 will use ARM Cortex-A75 or ARM Cortex-A76 so if that is the case, you should definitely give it a try. I expect the performance to be much better.

@ggerganov ggerganov pinned this issue Oct 9, 2022
@nyadla-sys
Copy link
Author

@ggerganov
is it possible to convert ggml model from fp16 to int8 activations and int8 weights ?

@ggerganov
Copy link
Owner

8-bit is not supported yet - maybe in the future

@trholding
Copy link
Contributor

Do you need me to test this on a raspi-zero? I bet it would be very very slow.

@ggerganov
Copy link
Owner

It will be very slow - yes. But still interesting to see how long it would take to process jfk.wav.

@trholding
Copy link
Contributor

trholding commented Oct 12, 2022

No cigar

I have the old raspi zero w, and it was not connected to internet to update clock.

Short Log:
make: warning: Clock skew detected. Your build may be incomplete.
whisper_model_load: ggml ctx size = 84.99 MB
Illegal instruction

Full Log:

pi@zero:~/X/whisper.pi $ time make tiny.en
make: Warning: File 'Makefile' has modification time 6366223 s in the future
cc  -O3 -std=c11   -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access   -c ggml.c
g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread -c whisper.cpp
In file included from /usr/include/c++/10/bits/stl_algo.h:61,
                 from /usr/include/c++/10/algorithm:62,
                 from whisper.cpp:5:
/usr/include/c++/10/bits/stl_heap.h: In function ‘void std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >; _Distance = int; _Tp = std::pair<double, int>; _Compare = __gnu_cxx::__ops::_Iter_comp_iter<whisper_sample_best(const whisper_vocab&, const float*, bool)::<lambda(const std::pair<double, int>&, const std::pair<double, int>&)> >]’:
/usr/include/c++/10/bits/stl_heap.h:223:5: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
  223 |     __adjust_heap(_RandomAccessIterator __first, _Distance __holeIndex,
      |     ^~~~~~~~~~~~~
/usr/include/c++/10/bits/stl_heap.h: In function ‘void std::__adjust_heap(_RandomAccessIterator, _Distance, _Distance, _Tp, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >; _Distance = int; _Tp = std::pair<double, int>; _Compare = __gnu_cxx::__ops::_Iter_comp_iter<whisper_sample_timestamp(const whisper_vocab&, const float*)::<lambda(const std::pair<double, int>&, const std::pair<double, int>&)> >]’:
/usr/include/c++/10/bits/stl_heap.h:223:5: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
In file included from /usr/include/c++/10/vector:72,
                 from whisper.cpp:15:
/usr/include/c++/10/bits/vector.tcc: In member function ‘void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {std::pair<double, int>}; _Tp = std::pair<double, int>; _Alloc = std::allocator<std::pair<double, int> >]’:
/usr/include/c++/10/bits/vector.tcc:426:7: note: parameter passing for argument of type ‘std::vector<std::pair<double, int> >::iterator’ changed in GCC 7.1
  426 |       vector<_Tp, _Alloc>::
      |       ^~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In function ‘whisper_vocab::id whisper_sample_best(const whisper_vocab&, const float*, bool)’:
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In function ‘whisper_vocab::id whisper_sample_timestamp(const whisper_vocab&, const float*)’:
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<std::pair<double, int>*, std::vector<std::pair<double, int> > >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In member function ‘void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {whisper_result}; _Tp = whisper_result; _Alloc = std::allocator<whisper_result>]’:
/usr/include/c++/10/bits/vector.tcc:426:7: note: parameter passing for argument of type ‘std::vector<whisper_result>::iterator’ changed in GCC 7.1
  426 |       vector<_Tp, _Alloc>::
      |       ^~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc: In member function ‘void std::vector<_Tp, _Alloc>::_M_realloc_insert(std::vector<_Tp, _Alloc>::iterator, _Args&& ...) [with _Args = {whisper_segment}; _Tp = whisper_segment; _Alloc = std::allocator<whisper_segment>]’:
/usr/include/c++/10/bits/vector.tcc:426:7: note: parameter passing for argument of type ‘std::vector<whisper_segment>::iterator’ changed in GCC 7.1
/usr/include/c++/10/bits/vector.tcc: In function ‘int whisper_full(whisper_context*, whisper_full_params, const float*, int)’:
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<whisper_result*, std::vector<whisper_result> >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<whisper_segment*, std::vector<whisper_segment> >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/10/bits/vector.tcc:121:21: note: parameter passing for argument of type ‘__gnu_cxx::__normal_iterator<whisper_segment*, std::vector<whisper_segment> >’ changed in GCC 7.1
  121 |    _M_realloc_insert(end(), std::forward<_Args>(__args)...);
      |    ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
g++ -O3 -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wno-unused-function -pthread main.cpp whisper.o ggml.o -o main
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,       --help           show this help message and exit
  -s SEED,  --seed SEED      RNG seed (default: -1)
  -t N,     --threads N      number of threads to use during computation (default: 1)
  -o N,     --offset N       offset in milliseconds (default: 0)
  -v,       --verbose        verbose output
            --translate      translate from source language to english
  -otxt,    --output-txt     output result in a text file
  -ovtt,    --output-vtt     output result in a vtt file
  -osrt,    --output-srt     output result in a srt file
  -ps,      --print_special  print special tokens
  -nt,      --no_timestamps  do not print timestamps
  -l LANG,  --language LANG  spoken language (default: en)
  -m FNAME, --model FNAME    model path (default: models/ggml-base.en.bin)
  -f FNAME, --file FNAME     input WAV file path

bash ./download-ggml-model.sh tiny.en
Downloading ggml model tiny.en ...
Model tiny.en already exists. Skipping download.

===============================================
Running tiny.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_model_load: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 244.00 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: ggml ctx size =  84.99 MB
Illegal instruction

make: warning:  Clock skew detected.  Your build may be incomplete.

real    5m28.556s
user    5m21.719s
sys     0m4.545s

pi@zero:~/X/whisper.pi $ cat /proc/cpuinfo 
processor       : 0
model name      : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS        : 996.14
Features        : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xb76
CPU revision    : 7

Hardware        : BCM2835
Revision        : 9000c1
Serial          : XXXXXX (Serial Removed)
Model           : Raspberry Pi Zero W Rev 1.1

I think its this flag: -mfpu=neon-fp-armv8 as we are on ARMv6...

Extremely unwell. Will continue experiments soon. I hope this will help you.

@ggerganov
Copy link
Owner

I think its this flag: -mfpu=neon-fp-armv8 as we are on ARMv6...

Yes - you are probably right.
What is the output of: gcc -c -Q -mcpu=native --help=target and cat /proc/cpuinfo

@trholding
Copy link
Contributor

trholding commented Oct 13, 2022

GCC info:

pi@zero:~ $ gcc -c -Q -mcpu=native --help=target 
The following options are target specific:
  -mabi=                                aapcs-linux
  -mabort-on-noreturn                   [disabled]
  -mandroid                             [disabled]
  -mapcs                                [disabled]
  -mapcs-frame                          [disabled]
  -mapcs-reentrant                      [disabled]
  -mapcs-stack-check                    [disabled]
  -march=                               armv6kz+fp
  -marm                                 [enabled]
  -masm-syntax-unified                  [disabled]
  -mbe32                                [enabled]
  -mbe8                                 [disabled]
  -mbig-endian                          [disabled]
  -mbionic                              [disabled]
  -mbranch-cost=                        -1
  -mcallee-super-interworking           [disabled]
  -mcaller-super-interworking           [disabled]
  -mcmse                                [disabled]
  -mcpu=                                arm1176jzf-s
  -mfdpic                               [disabled]
  -mfix-cortex-m3-ldrd                  [disabled]
  -mflip-thumb                          [disabled]
  -mfloat-abi=                          hard
  -mfp16-format=                        none
  -mfpu=                                vfp
  -mgeneral-regs-only                   [disabled]
  -mglibc                               [enabled]
  -mhard-float                          -mfloat-abi=hard
  -mlittle-endian                       [enabled]
  -mlong-calls                          [disabled]
  -mmusl                                [disabled]
  -mneon-for-64bits                     [disabled]
  -mpic-data-is-text-relative           [enabled]
  -mpic-register=             
  -mpoke-function-name                  [disabled]
  -mprint-tune-info                     [disabled]
  -mpure-code                           [disabled]
  -mrestrict-it                         [disabled]
  -msched-prolog                        [enabled]
  -msingle-pic-base                     [disabled]
  -mslow-flash-data                     [disabled]
  -msoft-float                          -mfloat-abi=soft
  -mstructure-size-boundary=            8
  -mthumb                               [disabled]
  -mthumb-interwork                     [disabled]
  -mtls-dialect=                        gnu
  -mtp=                                 cp15
  -mtpcs-frame                          [disabled]
  -mtpcs-leaf-frame                     [disabled]
  -mtune=                     
  -muclibc                              [disabled]
  -munaligned-access                    [enabled]
  -mvectorize-with-neon-double          [disabled]
  -mvectorize-with-neon-quad            [enabled]
  -mword-relocations                    [disabled]

  Known ARM ABIs (for use with the -mabi= option):
    aapcs aapcs-linux apcs-gnu atpcs iwmmxt

  Known __fp16 formats (for use with the -mfp16-format= option):
    alternative ieee none

  Known ARM FPUs (for use with the -mfpu= option):
    auto crypto-neon-fp-armv8 fp-armv8 fpv4-sp-d16 fpv5-d16 fpv5-sp-d16 neon neon-fp-armv8 neon-fp16 neon-vfpv3 neon-vfpv4 vfp
    vfp3 vfpv2 vfpv3 vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16

  Valid arguments to -mtp=:
    auto cp15 soft

  Known floating-point ABIs (for use with the -mfloat-abi= option):
    hard soft softfp

  TLS dialect to use:
    gnu gnu2

CPU Info:

pi@zero:~ $ cat /proc/cpuinfo
processor       : 0
model name      : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS        : 996.14
Features        : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xb76
CPU revision    : 7

Hardware        : BCM2835
Revision        : 9000c1
Serial          : XXXXXXXXXXX (Removed)
Model           : Raspberry Pi Zero W Rev 1.1

Info: https://gist.github.com/fm4dd/c663217935dc17f0fc73c9c81b0aa845

@ggerganov
Copy link
Owner

Yeah, I'm not an expert when it comes to arm architectures and compile flags. Maybe try replacing -mfpu=neon-fp-armv8 with -mfpu=vfp and see if it helps. But likely some of the SIMD intrinsics that I use is not supported on this chip.
Anyway, thanks for giving it a try.

@trholding
Copy link
Contributor

Got Cigar after 35 Minutes!

But it was damn slow and I could take a screenshot only because I was using mosh.

First it did not work with the makefile compiler flags alone. I had to comment out a line in ggml.c

Makefile Line 38:
removed -mfpu=neon-fp-armv8
-mfpu=vfp added,

ifneq ($(filter armv6%,$(UNAME_M)),)
        # Raspberry Pi 0, 1, 2, 3 
        CFLAGS += -mfpu=vfp -mfp16-format=ieee -mno-unaligned-access
endif

ggml.c Line 70:
Comment out or ifdef the following for (32bit / Raspi 0 1 2 3)

// #include <immintrin.h>

whisperPI0

@trholding
Copy link
Contributor

trholding commented Oct 13, 2022

I have a Kindle which is jailbroken, has a alpine distro and X, I'll try it on that too when I am well.

It has a i.MX 6ULL with Cortex-A7 @528 MHz . I had booted a x86_64 custom linux with X on it already with qemu in alpine arm. It sort of worked well to my surprise.

I think for whisper, it may be twice as fast as raspi 0.

How hard would it be for you to support opencl as an additional backend for ggml? It would be a great use case as opencl could help accelerate it even on raspis, amd systems, android phones and other low power devices that have a gpu.

@trholding
Copy link
Contributor

@nyadla-sys I think the answer is yes and probably this could be closed. :)

@fquirin
Copy link

fquirin commented May 2, 2023

Could you run a simple test for comparison:
./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
...
whisper_print_timings:     load time =   152.68 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   301.16 ms
whisper_print_timings:   sample time =    26.09 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  1798.86 ms /     1 runs ( 1798.86 ms per run)
whisper_print_timings:   decode time =   187.75 ms /    25 runs (    7.51 ms per run)
whisper_print_timings:    total time =  2549.35 ms

@StuartIanNaylor
Copy link

whisper_print_timings:     load time =  1384.94 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   328.22 ms
whisper_print_timings:   sample time =    25.17 ms /    25 runs (    1.01 ms per run)
whisper_print_timings:   encode time =  1203.73 ms /     1 runs ( 1203.73 ms per run)
whisper_print_timings:   decode time =   188.75 ms /    25 runs (    7.55 ms per run)
whisper_print_timings:    total time =  3290.21 ms

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 2, 2023

taskset -c 4-7 ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

whisper_print_timings:     load time =   136.70 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   286.95 ms
whisper_print_timings:   sample time =    24.52 ms /    25 runs (    0.98 ms per run)
whisper_print_timings:   encode time =  1171.08 ms /     1 runs ( 1171.08 ms per run)
whisper_print_timings:   decode time =   119.49 ms /    25 runs (    4.78 ms per run)
whisper_print_timings:    total time =  1794.83 ms

Load time though as 2nd run ran from memory

@fquirin
Copy link

fquirin commented May 2, 2023

Thanks. Taskset doesn't really change anything for me, just the usual random fluctuations:

taskset -c 4-7 ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

whisper_print_timings:     load time =   237.44 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   444.52 ms
whisper_print_timings:   sample time =    26.07 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  1719.09 ms /     1 runs ( 1719.09 ms per run)
whisper_print_timings:   decode time =    96.07 ms /    25 runs (    3.84 ms per run)
whisper_print_timings:    total time =  2579.70 ms

I'm very confused 😅

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 2, 2023

Try the vendor supplied distros.
Taskset gives a tad more than without taskset set.

whisper_print_timings:     load time =   135.55 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   293.48 ms
whisper_print_timings:   sample time =    24.70 ms /    25 runs (    0.99 ms per run)
whisper_print_timings:   encode time =  1196.19 ms /     1 runs ( 1196.19 ms per run)
whisper_print_timings:   decode time =   186.75 ms /    25 runs (    7.47 ms per run)
whisper_print_timings:    total time =  1892.93 ms

@fquirin
Copy link

fquirin commented May 2, 2023

Did a sudo apt upgrade and it might have done a little bit ... or I simply get lucky shots from time to time:

whisper_print_timings:     load time =   198.58 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   276.80 ms
whisper_print_timings:   sample time =    26.17 ms /    25 runs (    1.05 ms per run)
whisper_print_timings:   encode time =  1247.96 ms /     1 runs ( 1247.96 ms per run)
whisper_print_timings:   decode time =    97.54 ms /    25 runs (    3.90 ms per run)
whisper_print_timings:    total time =  1902.06 ms

followed by 🤦‍♂️ :

whisper_print_timings:     load time =   148.12 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   428.73 ms
whisper_print_timings:   sample time =    27.10 ms /    25 runs (    1.08 ms per run)
whisper_print_timings:   encode time =  2160.85 ms /     1 runs ( 2160.85 ms per run)
whisper_print_timings:   decode time =   255.73 ms /    25 runs (   10.23 ms per run)
whisper_print_timings:    total time =  3101.65 ms

@ggerganov
Copy link
Owner

@fquirin Is it more stable with -t 3?

@fquirin
Copy link

fquirin commented May 2, 2023

Doesn't look like:

whisper_print_timings:     load time =   205.04 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   538.64 ms
whisper_print_timings:   sample time =    25.99 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  2296.48 ms /     1 runs ( 2296.48 ms per run)
whisper_print_timings:   decode time =    95.20 ms /    25 runs (    3.81 ms per run)
whisper_print_timings:    total time =  3244.47 ms

Directly after this:

whisper_print_timings:     load time =   225.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   366.83 ms
whisper_print_timings:   sample time =    26.04 ms /    25 runs (    1.04 ms per run)
whisper_print_timings:   encode time =  1449.13 ms /     1 runs ( 1449.13 ms per run)
whisper_print_timings:   decode time =    95.01 ms /    25 runs (    3.80 ms per run)
whisper_print_timings:    total time =  2245.81 ms

@fquirin
Copy link

fquirin commented May 2, 2023

Before trying a completely new OS I gave the Ubuntu 23 slim Docker image a chance:

Got a new all-time low ^^:

whisper_print_timings:     load time =   147.25 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   279.83 ms
whisper_print_timings:   sample time =    26.91 ms /    25 runs (    1.08 ms per run)
whisper_print_timings:   encode time =  1149.59 ms /     1 runs ( 1149.59 ms per run)
whisper_print_timings:   decode time =    97.91 ms /    25 runs (    3.92 ms per run)
whisper_print_timings:    total time =  1781.86 ms

but average looks more like:

whisper_print_timings:     load time =   287.24 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   337.92 ms
whisper_print_timings:   sample time =    26.78 ms /    25 runs (    1.07 ms per run)
whisper_print_timings:   encode time =  1549.98 ms /     1 runs ( 1549.98 ms per run)
whisper_print_timings:   decode time =    97.53 ms /    25 runs (    3.90 ms per run)
whisper_print_timings:    total time =  2382.50 ms

I've seen everything from 1.7s to 3.3s in no particular order. Of cause this could still be an issue with the host OS.

@ggerganov
Copy link
Owner

Just tried the 8-bit model on my RPi4 which is running a 32-bit OS:

pi@raspberrypi:~/whisper.cpp $ getconf LONG_BIT
32
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.


whisper_print_timings:     load time =   433.38 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1068.06 ms
whisper_print_timings:   sample time =   192.17 ms /    27 runs (    7.12 ms per run)
whisper_print_timings:   encode time =  9107.05 ms /     1 runs ( 9107.05 ms per run)
whisper_print_timings:   decode time =   762.21 ms /    27 runs (   28.23 ms per run)
whisper_print_timings:    total time = 11918.20 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.


whisper_print_timings:     load time =   429.34 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1062.75 ms
whisper_print_timings:   sample time =    77.46 ms /    27 runs (    2.87 ms per run)
whisper_print_timings:   encode time = 10014.02 ms /     1 runs (10014.02 ms per run)
whisper_print_timings:   decode time =   413.60 ms /    27 runs (   15.32 ms per run)
whisper_print_timings:    total time = 12351.25 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.


whisper_print_timings:     load time =   433.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   890.49 ms
whisper_print_timings:   sample time =    77.42 ms /    27 runs (    2.87 ms per run)
whisper_print_timings:   encode time =  9910.22 ms /     1 runs ( 9910.22 ms per run)
whisper_print_timings:   decode time =   417.30 ms /    27 runs (   15.46 ms per run)
whisper_print_timings:    total time = 12083.65 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.


whisper_print_timings:     load time =   435.73 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1075.16 ms
whisper_print_timings:   sample time =    77.48 ms /    27 runs (    2.87 ms per run)
whisper_print_timings:   encode time =  8273.19 ms /     1 runs ( 8273.19 ms per run)
whisper_print_timings:   decode time =   414.45 ms /    27 runs (   15.35 ms per run)
whisper_print_timings:    total time = 10632.44 ms
pi@raspberrypi:~/whisper.cpp $  

The total time fluctuates around 12s but there is a big variation as well. Last run dropped to 10.6s.
Not sure what is the cause of this variation.

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 2, 2023

Download the Focal OPi image to get it out of the equation.
I don't get how with exact same SBC/Board you get so much variances

whisper_print_timings:    total time =  1791.24 ms
whisper_print_timings:    total time =  1797.02 ms
whisper_print_timings:    total time =  1780.25 ms
whisper_print_timings:    total time =  1791.39 ms
whisper_print_timings:    total time =  1784.43 ms

Consequtive

@fquirin
Copy link

fquirin commented May 2, 2023

Indeed, I quickly flashed Ubuntu Jammy server (Ubuntu 22.04.2 LTS) onto a SD card 😲:

taskset -c 4-7 ./main -m "models/ggml-tiny.bin" -f samples/jfk.wav -t 4 -l en --beam-size 1

whisper_print_timings:    total time =  1799.18 ms
whisper_print_timings:    total time =  1818.56 ms
whisper_print_timings:    total time =  1795.55 ms
whisper_print_timings:    total time =  1811.46 ms

[EDIT]
Benchmark is still pretty slow though (maybe the CPU is throttling due to heat idk):

CPU OS Config Model Th Load Enc. Commit
OrangePi5 Ubuntu 22.04.2 LTS NEON tiny 4 111 3397 05bef0f

@ggerganov
Copy link
Owner

@fquirin and @StuartIanNaylor

Can you bench the Q8_0 model as well?

# quantize to 8-bits
./quantize models/ggml-tiny.bin models/ggml-tiny-q8_0.bin q8_0

@fquirin
Copy link

fquirin commented May 2, 2023

With Q8_0 I'm getting pretty consistent:
whisper_print_timings: total time = 1553.91 ms

and with Q5_0:
whisper_print_timings: total time = 1888.17 ms

@fquirin
Copy link

fquirin commented May 2, 2023

Running the benchmark with only the encoder gives pretty stable results:

CPU OS Config Model Th Load Enc. Commit
OPi5 Ubuntu 22.04.2 LTS NEON tiny 4 106 1179 05bef0f
OPi5 Ubuntu 22.04.2 LTS NEON tiny-q5_0 4 77 1339 05bef0f
OPi5 Ubuntu 22.04.2 LTS NEON tiny-q8_0 4 91 1027 05bef0f

Maybe the 'ggml_mul_mat' benchmark leads to a throttling of the CPU after some time 🤔, but a drop from '3397' to '1179' seems pretty hard.

@StuartIanNaylor
Copy link

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON | tiny | 4 | 122 | 1184 | 05bef0f |
| <todo> | <todo> |  NEON | tiny-q5_0 | 4 | 91 | 1358 | 05bef0f |
| <todo> | <todo> |  NEON | tiny-q8_0 | 4 | 103 | 1042 | 05bef0f |
| <todo> | <todo> |  NEON | base | 4 | 167 | 2908 | 05bef0f |
| <todo> | <todo> |  NEON | base-q5_0 | 4 | 103 | 3203 | 05bef0f |
| <todo> | <todo> |  NEON | base-q8_0 | 4 | 125 | 2542 | 05bef0f |
| <todo> | <todo> |  NEON | small | 4 | 382 | 10883 | 05bef0f |
| <todo> | <todo> |  NEON | small-q5_0 | 4 | 190 | 11475 | 05bef0f |
| <todo> | <todo> |  NEON | small-q8_0 | 4 | 264 | 8009 | 05bef0f |
| <todo> | <todo> |  NEON | medium | 4 | 3253 | 35805 | 05bef0f |
| <todo> | <todo> |  NEON | medium-q5_0 | 4 | 441 | 37224 | 05bef0f |
| <todo> | <todo> |  NEON | medium-q8_0 | 4 | 5922 | 26390 | 05bef0f |
| <todo> | <todo> |  NEON | large | 4 | 46942 | 85866 | 05bef0f |
| <todo> | <todo> |  NEON | large-q5_0 | 4 | 826 | 69961 | 05bef0f |
| <todo> | <todo> |  NEON | large-q8_0 | 4 | 26708 | 47956 | 05bef0f |

@fquirin
Copy link

fquirin commented May 3, 2023

Since we get very similar and stable results in single runs now, I decided to investigate the degrading performance for longer runs a bit more. Here are 2 consecutive benchmark runs (encoder-only):

CPU OS Config Model Th Load Enc. Commit
NEON tiny 4 114 1173 05bef0f
NEON tiny-q5_0 4 70 1342 05bef0f
NEON tiny-q8_0 4 87 1035 05bef0f
NEON small 4 374 12469 05bef0f
NEON medium 4 1063 67746 05bef0f
CPU OS Config Model Th Load Enc. Commit
NEON tiny 4 110 1399 05bef0f
NEON tiny-q5_0 4 70 2060 05bef0f
NEON tiny-q8_0 4 92 1756 05bef0f
NEON small 4 383 22989 05bef0f
NEON medium 4 1098 90906 05bef0f

And here are a few consecutive single runs with the small model:

whisper_print_timings:    total time = 13441.03 ms
whisper_print_timings:    total time = 13952.61 ms
whisper_print_timings:    total time = 14572.80 ms
whisper_print_timings:    total time = 16029.31 ms
whisper_print_timings:    total time = 17203.41 ms

I'd say this is a pretty strong indication that my Orange Pi 5 is throttling after about 30s of cooking the CPU 🤔.
I'm starting to think that Armbian is handling this throttling differently.

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 3, 2023

@fquirin Have you ever just opened up another Cli window and monitored the temps vs clock speed? 75c is the throttle point which is quite low for a cpu.
I have an extremely good cooling solution now with the armour case but as it comes default just about every thing is wrong and was far inferior to a 40mm stick-on I used at 1st.

I can run stress-ng --cpu 8 --vm 2 --vm-bytes 128M --fork 4 constantly and settle on approx 60c max.

phoronix-test-suite benchmark stockfish is supposedly, but the peak of ggml_mul_mat benchmark can reach 11 watts at the plug which is highest I have seen on this SoC.

watch -n1 cat /sys/class/thermal/thermal_zone*/temp
watch -n 1 cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq

If you want to go all out on a cooling solution then https://www.amazon.com/dp/B0C2T9N9L2?
A stick-on 30-40mm with fan is a much cheaper option, but it sounds like what you are using is not adequate.

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 3, 2023

@ggerganov

Hi I did try to run the bench with CLBlast

memcpy: 9.33 GB/s (1 thread)
sum:    136902081526.000000

Running ggml_mul_mat benchmark with 8 threads


Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Using Platform: ARM Platform Device: Mali-LODX r0p0
  64 x   64: Q4_0     0.5 GFLOPS (128 runs) | Q4_1     0.4 GFLOPS (128 runs) | Q4_2     0.4 GFLOPS (128 runs)
  64 x   64: Q5_0     0.5 GFLOPS (128 runs) | Q5_1     0.5 GFLOPS (128 runs) | Q8_0     0.5 GFLOPS (128 runs)
  64 x   64: F16      0.5 GFLOPS (128 runs) | F32      0.5 GFLOPS (128 runs)
 128 x  128: Q4_0     1.9 GFLOPS (128 runs) | Q4_1     1.6 GFLOPS (128 runs) | Q4_2     1.6 GFLOPS (128 runs)
 128 x  128: Q5_0     1.6 GFLOPS (128 runs) | Q5_1     1.8 GFLOPS (128 runs) | Q8_0     1.7 GFLOPS (128 runs)
 128 x  128: F16      1.8 GFLOPS (128 runs) | F32      2.1 GFLOPS (128 runs)
 256 x  256: Q4_0    11.7 GFLOPS (128 runs) | Q4_1    14.3 GFLOPS (128 runs) | Q4_2    12.8 GFLOPS (128 runs)
 256 x  256: Q5_0    11.9 GFLOPS (128 runs) | Q5_1    11.6 GFLOPS (128 runs) | Q8_0    12.9 GFLOPS (128 runs)
 256 x  256: F16     12.7 GFLOPS (128 runs) | F32     13.4 GFLOPS (128 runs)
 512 x  512: Q4_0    42.5 GFLOPS (128 runs) | Q4_1    42.0 GFLOPS (128 runs) | Q4_2    40.7 GFLOPS (128 runs)
 512 x  512: Q5_0    41.0 GFLOPS (128 runs) | Q5_1    41.7 GFLOPS (128 runs) | Q8_0    42.0 GFLOPS (128 runs)
 512 x  512: F16     38.3 GFLOPS (128 runs) | F32     39.0 GFLOPS (128 runs)
1024 x 1024: Q4_0    69.4 GFLOPS ( 33 runs) | Q4_1    71.0 GFLOPS ( 34 runs) | Q4_2    70.6 GFLOPS ( 33 runs)
1024 x 1024: Q5_0    70.4 GFLOPS ( 33 runs) | Q5_1    70.2 GFLOPS ( 33 runs) | Q8_0    69.2 GFLOPS ( 33 runs)
1024 x 1024: F16     66.4 GFLOPS ( 31 runs) | F32     70.0 GFLOPS ( 33 runs)
2048 x 2048: Q4_0    81.3 GFLOPS (  5 runs) | Q4_1    81.4 GFLOPS (  5 runs) | Q4_2    81.1 GFLOPS (  5 runs)
2048 x 2048: Q5_0    80.9 GFLOPS (  5 runs) | Q5_1    81.4 GFLOPS (  5 runs) | Q8_0    81.4 GFLOPS (  5 runs)
2048 x 2048: F16     78.9 GFLOPS (  5 runs) | F32     80.1 GFLOPS (  5 runs)
4096 x 4096: Q4_0    87.0 GFLOPS (  3 runs) | Q4_1    86.9 GFLOPS (  3 runs) | Q4_2    86.9 GFLOPS (  3 runs)
4096 x 4096: Q5_0    86.4 GFLOPS (  3 runs) | Q5_1    86.9 GFLOPS (  3 runs) | Q8_0    86.7 GFLOPS (  3 runs)
4096 x 4096: F16     85.6 GFLOPS (  3 runs) | F32     86.0 GFLOPS (  3 runs)
./extra/bench-all.sh: line 45: 2051349 Segmentation fault      ./bench -w 2 -t $n_threads 2>&1

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
./extra/bench-all.sh: line 50: 2082709 Segmentation fault      ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
./extra/bench-all.sh: line 50: 2082762 Segmentation fault      ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
./extra/bench-all.sh: line 50: 2082834 Segmentation fault      ./bench -m ./models/ggml-$model.bin -t $n_threads 2> /dev/null > /dev/null
corangepi@orangepi5:~$ clinfo
Number of platforms                               1
  Platform Name                                   ARM Platform
  Platform Vendor                                 ARM
  Platform Version                                OpenCL 2.1 v1.g6p0-01eac0.efb7                                                                                          5e2978d783a80fe78be1bfb0efc1
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomi                                                                                          cs cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_l                                                                                          ocal_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes                                                                                           cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd                                                                                           cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups                                                                                           cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup                                                                                          _ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_k                                                                                          hr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uui                                                                                          d cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_                                                                                          memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot                                                                                          _product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_prod                                                                                          uct_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel                                                                                          _termination cl_ext_cxx_for_opencl
  Platform Extensions function suffix             ARM
  Platform Host timer resolution                  1ns

  Platform Name                                   ARM Platform
Number of devices                                 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
  Device Name                                     Mali-LODX r0p0
  Device Vendor                                   ARM
  Device Vendor ID                                0xa8670000
  Device Version                                  OpenCL 2.1 v1.g6p0-01eac0.efb7                                                                                          5e2978d783a80fe78be1bfb0efc1
  Device UUID                                     000067a8-0100-0000-0000-000000                                                                                          000000
  Driver UUID                                     d9495bef-ea91-7c52-8a43-8a3c2f                                                                                          7b49cc
  Valid Device LUID                               No
  Device LUID                                     0000-000000000000
  Device Node Mask                                0
  Device Numeric Version                          0x801000 (2.1.0)
  Driver Version                                  2.1
  Device OpenCL C Version                         OpenCL C 2.0 v1.g6p0-01eac0.ef                                                                                          b75e2978d783a80fe78be1bfb0efc1
  Device C++ for OpenCL Numeric Version           0x400000 (1.0.0)
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               4
  Available core IDs                              0, 2, 16, 18
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple (kernel)     16
  Max sub-groups per work group                   64
  Preferred / native vector sizes
    char                                                16 / 4
    short                                                8 / 2
    int                                                  4 / 1
    long                                                 2 / 1
    half                                                 8 / 2        (cl_khr_fp                                                                                          16)
    float                                                4 / 1
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              3910270976 (3.642GiB)
  Error Correction support                        No
  Max memory allocation                           3910270976 (3.642GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             0
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1048576 (1024KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   32 bytes
    Pitch alignment for 2D image buffers          64 pixels
    Max 2D image size                             65536x65536 pixels
    Max 3D image size                             65536x65536x65536 pixels
    Max number of read image args                 128
    Max number of write image args                64
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    1
  Max pipe packet size                            1024
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     128
  Max constant buffer size                        3910270976 (3.642GiB)
  Max size of kernel argument                     1024
  Queue properties (on host)
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Queue properties (on device)
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                2097152 (2MiB)
    Max size                                      16777216 (16MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Sub-group independent forward progress        Yes
    IL version                                    SPIR-V_1.0
    ILs with version                              SPIR-V                                                                                                                                                     0x400000 (1.0.0)
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomi                                                                                          cs cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_l                                                                                          ocal_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes                                                                                           cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd                                                                                           cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups                                                                                           cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup                                                                                          _ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_k                                                                                          hr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uui                                                                                          d cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_                                                                                          memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot                                                                                          _product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_prod                                                                                          uct_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel                                                                                          _termination cl_ext_cxx_for_opencl
  Device Extensions with Version                  cl_khr_global_int32_base_atomi                                                                                          cs                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_a                                                                                          tomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomic                                                                                          s                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_at                                                                                          omics                              0x400000 (1.0.0)
                                                  cl_khr_byte_addressable_store                                                                                                                              0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                                                                                                                     0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                                                                                                                  0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                                                                                                              0x400000 (1.0.0)
                                                  cl_khr_fp16                                                                                                                                                0x400000 (1.0.0)
                                                  cl_khr_icd                                                                                                                                                 0x400000 (1.0.0)
                                                  cl_khr_egl_image                                                                                                                                           0x400000 (1.0.0)
                                                  cl_khr_image2d_from_buffer                                                                                                                                 0x400000 (1.0.0)
                                                  cl_khr_depth_images                                                                                                                                        0x400000 (1.0.0)
                                                  cl_khr_subgroups                                                                                                                                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_extended_types                                                                                                                             0x400000 (1.0.0)
                                                  cl_khr_subgroup_non_uniform_vo                                                                                          te                                 0x400000 (1.0.0)
                                                  cl_khr_subgroup_ballot                                                                                                                                     0x400000 (1.0.0)
                                                  cl_khr_il_program                                                                                                                                          0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                                                                                                                      0x400000 (1.0.0)
                                                  cl_khr_create_command_queue                                                                                                                                0x400000 (1.0.0)
                                                  cl_khr_spirv_no_integer_wrap_d                                                                                          ecoration                          0x400000 (1.0.0)
                                                  cl_khr_extended_versioning                                                                                                                                 0x400000 (1.0.0)
                                                  cl_khr_device_uuid                                                                                                                                         0x400000 (1.0.0)
                                                  cl_arm_core_id                                                                                                                                             0x400000 (1.0.0)
                                                  cl_arm_printf                                                                                                                                              0x400000 (1.0.0)
                                                  cl_arm_non_uniform_work_group_                                                                                          size                               0x400000 (1.0.0)
                                                  cl_arm_import_memory                                                                                                                                       0x400000 (1.0.0)
                                                  cl_arm_import_memory_dma_buf                                                                                                                               0x400000 (1.0.0)
                                                  cl_arm_import_memory_host                                                                                                                                  0x400000 (1.0.0)
                                                  cl_arm_integer_dot_product_int                                                                                          8                                  0x400000 (1.0.0)
                                                  cl_arm_integer_dot_product_acc                                                                                          umulate_int8                       0x400000 (1.0.0)
                                                  cl_arm_integer_dot_product_acc                                                                                          umulate_saturate_int8              0x400000 (1.0.0)
                                                  cl_arm_scheduling_controls                                                                                                                                   0x3000 (0.3.0)
                                                  cl_arm_controlled_kernel_termi                                                                                          nation                             0x400000 (1.0.0)
                                                  cl_ext_cxx_for_opencl                                                                                                                                      0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM Platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]
  clCreateContext(NULL, ...) [default]            Success [ARM]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platfor                                                                                          m
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in                                                                                           platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in plat                                                                                          form
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.14
  ICD loader Profile                              OpenCL 3.0

@fquirin
Copy link

fquirin commented May 3, 2023

Have you ever just opened up another Cli window and monitored the temps vs clock speed?

On Armbian I used armbianmonitor -m while running the benchmark and saw temperatures above 80°C pretty quickly. CPU clockspeeds seemed to stay above 2 GHz though. A 20% drop in clock speed could probably have a larger effect on the inference time I guess. If you say it starts throttling at 75°C then I'm definitely in the range.

watch -n1 cat /sys/class/thermal/thermal_zone*/temp
watch -n 1 cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq

I'll test that again on the Ubuntu system 👍

If you want to go all out on a cooling solution then https://www.amazon.com/dp/B0C2T9N9L2?

Looks pretty fancy 😎

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 3, 2023

Yeah I got one of the 'armor' cases and its a strange implementation as the fan sits flat on the metal base and has no space to push air through.
So gave it a little space by adding some m3 nuts as spacers.
Even then it was terrible and think the thermal pads are bad but also its puts a lot of pressure on the board and warps it slightly.
So it got a full makeover with 2x Gelid thermal pads both sides of the CPU to provide pressure from both sides.
It also got a 30mm PWM fan so it turns off when idle.

The above is prob the most OP cooler you can get for the Opi5 and after all the additions I prob spent about same.
/sys/devices/virtual/thermal/thermal_zone0/trip_point_0_temp is 75000 and think that is throttle temp
Which is pretty low but that is what its set to.

But as said a 30/40mm stick with a fan will suffice the SoC is capable of 12watts TDP if I rememeber correctly

orangepi@orangepi5:~/whisper.cpp$ taskset -c 4-7 ./main -m models/ggml-tiny-q8_0.bin -f ./samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-tiny-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 7
whisper_model_load: type          = 1
whisper_model_load: mem required  =  172.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   43.18 MB
whisper_model_load: model size    =   43.14 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:10.500]   And so my fellow Americans ask not what your country can do for you ask what you can do for your country.


whisper_print_timings:     load time =   112.34 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   290.00 ms
whisper_print_timings:   sample time =    24.98 ms /    25 runs (    1.00 ms per run)
whisper_print_timings:   encode time =  1050.98 ms /     1 runs ( 1050.98 ms per run)
whisper_print_timings:   decode time =   129.20 ms /    25 runs (    5.17 ms per run)
whisper_print_timings:    total time =  1667.61 ms

@Leeviber
Copy link

Leeviber commented May 5, 2023

@StuartIanNaylor Hi, Have you tried running whisper.cpp with GPU on rk3588? Now that I'm having a lot of trouble trying to cross-compile CLBlast to my rk3588 dev board, I'm wondering how much of a speed boost the GPU acceleration with clblast can bring and if it's worth doing?

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 5, 2023

Yeah its not all roses, I get

WHISPER_CLBLAST=1 make -j4
I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  unknown
I UNAME_M:  aarch64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_CLBLAST
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:  -lclblast -lOpenCL
I CC:       cc (Debian 10.2.1-6) 10.2.1 20210110
I CXX:      g++ (Debian 10.2.1-6) 10.2.1 20210110

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_CLBLAST   -c ggml.c -o ggml.o
cc -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_CLBLAST -c ggml-opencl.c -o ggml-opencl.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c whisper.cpp -o whisper.o
CFLAGS   += -mcpu=native
make: CFLAGS: No such file or directory
make: *** [Makefile:181: ggml-opencl.o] Error 127
make: *** Waiting for unfinished jobs....

I just ignore and run the cli again and it works even though slow.

I think there are problems with the driver and Mali G610 as when you run some of the clblast tuners you get errors forgot which but they are located in /usr/bin/clblast_tuner_copy_fast and all have a different name than sending a param. But the methods needed work.

Out of curiosity I tried something non AMD as don't have a Vega board and a HD630 installs the same and seems to behave simulary.

But run but are slow and its a question to how much is running on GPU as wondering if its trying and it returns a fail.
There is a proc or sys where you can show gpu load just forgot where it is located.

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON BLAS | tiny | 4 | 219 | 2857 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q4_0 | 4 | 156 | 3330 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q4_1 | 4 | 167 | 3022 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q5_0 | 4 | 163 | 3650 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q5_1 | 4 | 169 | 3300 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | tiny-q8_0 | 4 | 183 | 3259 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base | 4 | 291 | 4680 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q4_0 | 4 | 188 | 5261 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q4_1 | 4 | 198 | 4861 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q5_0 | 4 | 198 | 5161 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q5_1 | 4 | 196 | 5025 | 14bee39 |
| <todo> | <todo> |  NEON BLAS | base-q8_0 | 4 | 232 | 5340 | 14bee39 |

Also spuriously things go awry

rock@rock-5b:~/whisper.cpp$ ./main -m models/ggml-tiny.bin -f samples/jfk.wav
whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  201.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'.
Using Platform: ARM Platform Device: Mali-LODX r0p0
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.000]   And so my fellow Americans?
[00:00:03.000 --> 00:00:04.000]   Are you not?
[00:00:04.000 --> 00:00:05.000]   Not.
[00:00:05.000 --> 00:00:08.000]   What your country can do for you?
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =   224.94 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   306.92 ms
whisper_print_timings:   sample time =    41.27 ms /    39 runs (    1.06 ms per run)
whisper_print_timings:   encode time =  3658.08 ms /     1 runs ( 3658.08 ms per run)
whisper_print_timings:   decode time =   397.50 ms /    39 runs (   10.19 ms per run)
whisper_print_timings:    total time =  4725.34 ms

Its prob still a bit fresh for the mali g610 but I did sucessfully run the ArmNN OPenCL based Wav2Vec example they do, but its very different to whisper as near all load is on the GPU and it looks fantastic as slightly slower than the CPU.
The CPU does get tickled but there is near no load compared to running ArmNN on the CPU whilst above your wondering if it actually used the GPU at all apart from posturing that it was.

I think the Intel is similular and far more less fresh that the Mali G610 which really is still waiting for kernel changes.
https://www.collabora.com/news-and-blog/news-and-events/pancsf-a-new-drm-driver-for-mali-csf-based-gpus.html

To be honest I have a gut feeling OpenCL might be similar to OpenGL and not granular enough and tend to limit performance.
I was sort of hoping that the GPU implementaion would be CoreML/Metal & GGML/Vulkan as maybe that would be more performant and available.
I am not really sure what is happening with NPU's as they don't seem to have any compliance and have specific frameworks.

Maybe what Nvidia say is the way to go and create opensopurce versions of https://developer.nvidia.com/blog/machine-learning-acceleration-vulkan-cooperative-matrices/

https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a

https://github.com/bartwojcik/vulkano-matmul

https://www.khronos.org/assets/uploads/developers/presentations/Cooperative_Matrix_May22.pdf

@csukuangfj
Copy link
Contributor

FYI: You can run Whisper models with onnxruntime in C++ using sherpa-onnx on Raspberry Pi.

You can find the documentation at
https://k2-fsa.github.io/sherpa/onnx/pretrained_models/whisper/index.html

The following is the RTF running tiny.en on Raspberry Pi Model 4B:

Screenshot 2023-08-08 at 10 45 38

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023
Was using the wrong preprocessor macro
jettoblack pushed a commit to jettoblack/whisper.cpp that referenced this issue Feb 8, 2024
kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Build related issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests