fix coreml ANE optimized encoder #1716

philloooo · 2024-01-02T21:20:09Z

Transpose the result back to the format that's accepted by the decoder.
I tested with tiny, small, and base models, and ran ./tests/run-tests.sh, result all looks good. @ggerganov I am not sure why your previous attempt didn't work, can you double-check?

Performance-wise, this is my result on M3 pro with a 30min audio and the base model(I used a longer audio to get a better average encode time per segment so you can ignore the inital coreml model load overhead). The encode time is ~2x faster than metal.
With ANE optimized model:

whisper_print_timings:     load time =    93.50 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   678.59 ms
whisper_print_timings:   sample time =  6985.69 ms / 36729 runs (    0.19 ms per run)
whisper_print_timings:   encode time =  1985.48 ms /    67 runs (   29.63 ms per run)
whisper_print_timings:   decode time =   210.62 ms /    86 runs (    2.45 ms per run)
whisper_print_timings:   batchd time = 29357.13 ms / 36314 runs (    0.81 ms per run)
whisper_print_timings:   prompt time =   857.84 ms / 14712 runs (    0.06 ms per run)
whisper_print_timings:    total time = 42247.07 ms

With vanilla openai whisper model:

whisper_print_timings:     load time =    97.97 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   671.39 ms
whisper_print_timings:   sample time =  7039.62 ms / 36803 runs (    0.19 ms per run)
whisper_print_timings:   encode time =  2792.66 ms /    67 runs (   41.68 ms per run)
whisper_print_timings:   decode time =   202.56 ms /    84 runs (    2.41 ms per run)
whisper_print_timings:   batchd time = 29273.17 ms / 36390 runs (    0.80 ms per run)
whisper_print_timings:   prompt time =   845.34 ms / 14712 runs (    0.06 ms per run)
whisper_print_timings:    total time = 42520.91 ms

Metal:

whisper_print_timings:     load time =   103.68 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   678.44 ms
whisper_print_timings:   sample time =  6981.09 ms / 36940 runs (    0.19 ms per run)
whisper_print_timings:   encode time =  3958.86 ms /    66 runs (   59.98 ms per run)
whisper_print_timings:   decode time =   164.65 ms /    67 runs (    2.46 ms per run)
whisper_print_timings:   batchd time = 29736.41 ms / 36546 runs (    0.81 ms per run)
whisper_print_timings:   prompt time =   844.35 ms / 14712 runs (    0.06 ms per run)
whisper_print_timings:    total time = 42517.96 ms

ggerganov · 2024-01-03T12:08:37Z

Thanks for looking into this - I will recheck the results now

For reference, here is the discussion back then: #548 (reply in thread)
Somehow the results were corrupted, but either I was doing something wrong or something else got fixed along the way

ggerganov · 2024-01-04T14:20:42Z

Indeed, the ANE-optimized Core ML models work correctly and are faster than the original models. Here are the results that I get on M2 Ultra with 76 GPU cores and 32 ANE cores (only the "Enc." column is relevant for this change):

master + Core ML ANE

Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NEON BLAS COREML METAL	tiny	4	29.37	1.37	0.51	0.01	`eaac005`
NEON BLAS COREML METAL	base	4	43.33	1.98	0.78	0.02	`eaac005`
NEON BLAS COREML METAL	small	4	139.86	3.88	1.71	0.05	`eaac005`
NEON BLAS COREML METAL	medium	4	658.38	8.09	3.93	0.13	`eaac005`
NEON BLAS COREML METAL	large-v2	4	1686.48	11.68	6.06	0.23	`eaac005`

PR + Core ML ANE

Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NEON BLAS COREML METAL	tiny	4	23.17	1.37	0.51	0.01	`eaac005`
NEON BLAS COREML METAL	base	4	30.34	1.97	0.77	0.02	`eaac005`
NEON BLAS COREML METAL	small	4	99.59	3.96	1.75	0.05	`eaac005`
NEON BLAS COREML METAL	medium	4	307.01	8.06	3.93	0.13	`eaac005`
NEON BLAS COREML METAL	large-v2	4	526.80	11.68	6.03	0.23	`eaac005`

Notice however that the ANE-optimized Core ML models are not suitable to run on the GPU (i.e. config.computeUnits = MLComputeUnitsCPUAndGPU;):

master + Core ML GPU

Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NEON BLAS COREML METAL	tiny	4	6.93	1.37	0.51	0.01	`eaac005`
NEON BLAS COREML METAL	base	4	12.80	1.97	0.77	0.02	`eaac005`
NEON BLAS COREML METAL	small	4	31.70	3.95	1.71	0.05	`eaac005`
NEON BLAS COREML METAL	medium	4	93.59	8.09	3.93	0.13	`eaac005`
NEON BLAS COREML METAL	large-v2	4	179.49	11.69	5.93	0.23	`eaac005`

PR + Core ML GPU

Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NEON BLAS COREML METAL	tiny	4	9.21	1.37	0.51	0.01	`eaac005`
NEON BLAS COREML METAL	base	4	18.46	1.97	0.78	0.02	`eaac005`
NEON BLAS COREML METAL	small	4	59.12	3.97	1.72	0.05	`eaac005`
NEON BLAS COREML METAL	medium	4	168.24	8.08	3.87	0.13	`eaac005`
NEON BLAS COREML METAL	large-v2	4	303.07	11.72	5.95	0.23	`eaac005`

For reference, here are the results for running the entire computation on the GPU with Metal (i.e. no Core ML):

Full Metal (no Core ML)

Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NEON BLAS METAL	tiny	4	10.74	1.37	0.51	0.01	`eaac005`
NEON BLAS METAL	base	4	19.06	1.98	0.78	0.02	`eaac005`
NEON BLAS METAL	small	4	53.24	3.87	1.71	0.05	`eaac005`
NEON BLAS METAL	medium	4	143.67	8.12	3.94	0.13	`eaac005`
NEON BLAS METAL	large-v2	4	253.30	11.67	6.06	0.23	`eaac005`

Josscii · 2024-01-05T01:38:09Z

2024-01-05 09:24:22.023112+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.023235+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.023298+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.034869+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.035000+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:22.035063+0800 [4575:1504107] Error: Transpose unit is not supported.
2024-01-05 09:24:27.377632+0800 [4575:1504107] [espresso] [Espresso::handle_ex_plan] exception=at at /ggml-base-encoder.mlmodelc/model.mil:14:12: In 'ios16.conv' operations, tensors parameter x[0], parameter weight[0], parameter bias[0], and output at index 0 must have the same data type.
2024-01-05 09:24:27.377830+0800 [4575:1504107] [coreml] Error plan build: -1.

with this update, I generated new coreml encoder and run on iPhone XR with iOS 16, it output above errors

ggerganov · 2024-01-05T07:39:08Z

Hm, interesting. I actually didn't test if the ANE models work on iOS. Maybe this is the problem that we observed in the past, and now there is an error actually being reported

ggerganov · 2024-01-05T08:00:39Z

I just tested it on iPhone 13 Mini (A15) with iOS 17.2.1 and it works without errors

ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A15 GPU
ggml_metal_init: ggml.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/private/var/containers/Bundle/Application//whisper.objc.app/ggml-metal.metal'
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    15.75 MiB, (  157.52)
whisper_init_state: kv self size  =   16.52 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    17.58 MiB, (  175.09)
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: loading Core ML model from '/private/var/containers/Bundle/Application//whisper.objc.app/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...

whisper_init_state: Core ML model loaded
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  175.11)
whisper_init_state: compute buffer (conv)   =    5.74 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  175.12)
whisper_init_state: compute buffer (cross)  =    4.78 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (  175.14)
whisper_init_state: compute buffer (decode) =   96.48 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     3.86 MiB, (  178.98)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     2.94 MiB, (  181.91)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    90.39 MiB, (  272.28)

...

whisper_print_timings:      mel time =    69.66 ms
whisper_print_timings:   sample time =    16.86 ms /     1 runs (   16.86 ms per run)
whisper_print_timings:   encode time =    89.77 ms /     1 runs (   89.77 ms per run)
whisper_print_timings:   decode time =    71.90 ms /     7 runs (   10.27 ms per run)
whisper_print_timings:   batchd time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   252.02 ms

Could it be related to the iOS version?

Josscii · 2024-01-05T08:15:47Z

I think it could be the clip's problem, iPhone XR is quite a old device,it use A12.

astrowonk · 2024-01-06T15:38:53Z

Not sure if it is due to changes in this PR, or something else, but the whisper_init_state: first run on a device may take a while … step when using CoreML (which had gotten very fast in Sonoma), now takes a very long time again. Like ~5 minutes on my first test run of whisper.cpp after recompiling with this PR and making a new medium.en CoreML model with the generate-coreml-model.sh script.

Second time running was quick.

Overall performance on a 10 minute podcast was 157 seconds… on my little M1 Mac Mini, about the same as previous whisper.cpp builds that were CoreML but GPU+CPU only.

The GPU was going all out according to Activity Monitor, but at least according to asitop, the ANE doesn't appear to be doing much? I see the config.computeUnits = MLComputeUnitsAll; change but at least on my M1 the GPU is most of the work.

ggerganov · 2024-01-06T16:06:15Z

What happens if you switch to config.computeUnits = MLComputeUnitsCPUAndNeuralEngine;?

astrowonk · 2024-01-06T17:22:31Z

What happens if you switch to config.computeUnits = MLComputeUnitsCPUAndNeuralEngine;?

Changed whisper.cpp/coreml/whisper-encoder.mm file config to have MLComputeUnitsCPUAndNeuralEngine, did a make clean and then another WHISPER_COREML=1 make -j.

Still not seeing much ANE usage in asitop… weirdly I'm still seeing mostly GPU and some CPU. Processing time for my test file still right around 158 seconds. (Second run since whisper_init_state took a long time again after recompiling.)

philloooo · 2024-01-09T23:33:50Z

On asitop It's expected to be mostly GPU and CPU because the decoder is running on GPU and it's much more expensive than the encoder, and the pre/post processing is all CPU. You should see a small ANE usage though, but in my testing the ANE is 10x more power efficient than GPU so the usage is very minimal.

aehlke · 2024-03-27T02:18:21Z

It's good enough for realtime on iPhone now?

bebound · 2024-04-10T13:40:58Z

The ANE optimized encoder generated by generate-coreml-model.sh is much faster than files in https://huggingface.co/ggerganov/whisper.cpp/tree/main. (33s vs 59s)

whisper_print_timings:     load time =  1148.52 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   135.14 ms
whisper_print_timings:   sample time =   230.31 ms /     1 runs (  230.31 ms per run)
whisper_print_timings:   encode time = 10612.82 ms /    13 runs (  816.37 ms per run)
whisper_print_timings:   decode time = 17358.26 ms /   637 runs (   27.25 ms per run)
whisper_print_timings:   batchd time =   111.01 ms /     6 runs (   18.50 ms per run)
whisper_print_timings:   prompt time =  2460.09 ms /  2087 runs (    1.18 ms per run)
whisper_print_timings:    total time = 32753.55 ms

whisper_print_timings:     load time =  1182.02 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   132.84 ms
whisper_print_timings:   sample time =   235.12 ms /     1 runs (  235.12 ms per run)
whisper_print_timings:   encode time = 38726.64 ms /    13 runs ( 2978.97 ms per run)
whisper_print_timings:   decode time = 15707.32 ms /   637 runs (   24.66 ms per run)
whisper_print_timings:   batchd time =   134.52 ms /     6 runs (   22.42 ms per run)
whisper_print_timings:   prompt time =  2380.83 ms /  2087 runs (    1.14 ms per run)
whisper_print_timings:    total time = 59573.27 ms

Maybe we need to update the files shared in huggingface?

Josscii · 2024-04-10T15:27:54Z

as I reported the old device crash issue above, I think it need some patch to avoid this.

bebound · 2024-04-11T12:01:12Z

I see.

But this PR makes config.computeUnits = MLComputeUnitsAll default , and the compiled main runs slowly with huggingface mlmodelc encoder. I'm inclined to revert to MLComputeUnitsCPUAndGPU before we updating encoder model.

fix and use the Coreml ANE optimized encoder

eaac005

ggerganov approved these changes Jan 4, 2024

View reviewed changes

ggerganov merged commit ba5bcde into ggerganov:master Jan 4, 2024
39 checks passed

astrowonk mentioned this pull request Feb 15, 2024

Core ML support #566

Merged

10 tasks

ggerganov mentioned this pull request Apr 15, 2024

It seems that there is no performance gain utilizing Core ML #2057

Open

viktor-silakov pushed a commit to viktor-silakov/whisper_node_mic.cpp that referenced this pull request May 11, 2024

coreml : fix ANE optimized encoder (ggerganov#1716)

06917a4

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

coreml : fix ANE optimized encoder (ggerganov#1716)

507f68f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix coreml ANE optimized encoder #1716

fix coreml ANE optimized encoder #1716

philloooo commented Jan 2, 2024 •

edited

Loading

ggerganov commented Jan 3, 2024

ggerganov commented Jan 4, 2024 •

edited

Loading

Josscii commented Jan 5, 2024

ggerganov commented Jan 5, 2024

ggerganov commented Jan 5, 2024

Josscii commented Jan 5, 2024

astrowonk commented Jan 6, 2024

ggerganov commented Jan 6, 2024

astrowonk commented Jan 6, 2024

philloooo commented Jan 9, 2024

aehlke commented Mar 27, 2024

bebound commented Apr 10, 2024 •

edited

Loading

Josscii commented Apr 10, 2024

bebound commented Apr 11, 2024 •

edited

Loading

fix coreml ANE optimized encoder #1716

fix coreml ANE optimized encoder #1716

Conversation

philloooo commented Jan 2, 2024 • edited Loading

ggerganov commented Jan 3, 2024

ggerganov commented Jan 4, 2024 • edited Loading

master + Core ML ANE

PR + Core ML ANE

master + Core ML GPU

PR + Core ML GPU

Full Metal (no Core ML)

Josscii commented Jan 5, 2024

ggerganov commented Jan 5, 2024

ggerganov commented Jan 5, 2024

Josscii commented Jan 5, 2024

astrowonk commented Jan 6, 2024

ggerganov commented Jan 6, 2024

astrowonk commented Jan 6, 2024

philloooo commented Jan 9, 2024

aehlke commented Mar 27, 2024

bebound commented Apr 10, 2024 • edited Loading

Josscii commented Apr 10, 2024

bebound commented Apr 11, 2024 • edited Loading

philloooo commented Jan 2, 2024 •

edited

Loading

ggerganov commented Jan 4, 2024 •

edited

Loading

bebound commented Apr 10, 2024 •

edited

Loading

bebound commented Apr 11, 2024 •

edited

Loading