[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON #1744

afg1 · 2023-06-07T19:12:58Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I've been trying out the metal implementation on an M1 mac, and main is working fine, but I would also like to be able to get embeddings. Accelerating this with metal would be fantastic for me.

I tried to understand what would need to change, but I'm not conversent enough with the code to figure it out. Happy to try to make the changes myself and submit a PR if that would be helpful.

Current Behavior

As far as I can tell, embeddings does not use metal. At least, the GPU usage stays at 0% when I give the -ngl 1 parameter.

I should also mention that using the llama-cpp-python wrapper to get embeddings also does not use GPU, while a 'normal' inference of the model does.

I haven't tested if this is the case with a CUDA backend, but I can do if that is useful information.

Environment and Context

I'm running on a 32GB M1 macbook pro
python = Python 3.10.10
make = GNU Make 3.81
cmake = cmake version 3.25.2
g++ = Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.5.0

Failure Information (for bugs)

I'm running
./bin/embedding -f abs -c 1024 -ngl 1 -m ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin

content of abs is:

Long noncoding RNAs (lncRNAs) regulate gene expression via their RNA product or through transcriptional interference, yet a strategy to differentiate these two processes is lacking. To address this, we used multiple small interfering RNAs (siRNAs) to silence GNG12-AS1, a nuclear lncRNA transcribed in an antisense orientation to the tumour-suppressor DIRAS3. Here we show that while most siRNAs silence GNG12-AS1 post-transcriptionally, siRNA complementary to exon 1 of GNG12-AS1 suppresses its transcription by recruiting Argonaute 2 and inhibiting RNA polymerase II binding. Transcriptional, but not post-transcriptional, silencing of GNG12-AS1 causes concomitant upregulation of DIRAS3, indicating a function in transcriptional interference. This change in DIRAS3 expression is sufficient to impair cell cycle progression. In addition, the reduction in GNG12-AS1 transcripts alters MET signalling and cell migration, but these are independent of DIRAS3. Thus, differential siRNA targeting of a lncRNA allows dissection of the functions related to the process and products of its transcription.

Steps to Reproduce

build with cmake ../ -DLLAMA_METAL=ON -DBUILD_SHARED_LIBS=ON

(shared libs is to workaround an issue with the python binding - hopefully not relevant to this)

run ./bin/embedding -f abs -c 1024 -ngl 1 -m ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin

Failure Logs

Metal does appear to be loading, and I get embeddings, but no GPU usage

main: build = 635 (5c64a09)
main: seed  = 1686154509
llama.cpp: loading model from ./Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  =  800.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading './ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x132e063b0
ggml_metal_init: loaded kernel_mul                            0x132e06ad0
ggml_metal_init: loaded kernel_mul_row                        0x132f08330
ggml_metal_init: loaded kernel_scale                          0x132e06ed0
ggml_metal_init: loaded kernel_silu                           0x132e073f0
ggml_metal_init: loaded kernel_relu                           0x132e07910
ggml_metal_init: loaded kernel_soft_max                       0x132f08b90
ggml_metal_init: loaded kernel_diag_mask_inf                  0x132e07fb0
ggml_metal_init: loaded kernel_get_rows_f16                   0x132f09110
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x132e08650
ggml_metal_init: loaded kernel_rms_norm                       0x132e08eb0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x132e099f0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x100f056b0
ggml_metal_init: loaded kernel_rope                           0x132e09570
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x132e0ad20
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x132e0b5d0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   802.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB

[ Big matrix ]

llama_print_timings:        load time = 27444.70 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 26736.64 ms /   602 tokens (   44.41 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 27446.69 ms

The text was updated successfully, but these errors were encountered:

jacobfriedman · 2023-06-08T00:05:11Z

I'm simply getting

llama.cpp: loading model from ./llms/guanaco-33B.bin

ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."

??

afg1 · 2023-06-08T09:20:57Z

@jacobfriedman Do you have ggml-metal.metal in the bin directory (or I guess next to wherever you're running embeddings from)? If I move it out I get that error, and I saw the same thing with llama-cpp-python wrapper until I saw this abetlen/llama-cpp-python#317 (comment)

jacobfriedman · 2023-06-08T14:31:22Z

I wasn't running with Python. Will investigate in that thread, thank you for the direction

github-actions · 2024-04-10T01:07:37Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON #1744

[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON #1744

afg1 commented Jun 7, 2023

jacobfriedman commented Jun 8, 2023

afg1 commented Jun 8, 2023

jacobfriedman commented Jun 8, 2023

github-actions bot commented Apr 10, 2024

[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON #1744

[User] Generating embeddings is not using GPU when built with LLAMA_METAL=ON #1744

Comments

afg1 commented Jun 7, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

jacobfriedman commented Jun 8, 2023

afg1 commented Jun 8, 2023

jacobfriedman commented Jun 8, 2023

github-actions bot commented Apr 10, 2024